[Pvfs2-developers] threaded client-core and the device thread

Murali Vilayannur murali.vilayannur at gmail.com
Fri Oct 13 23:00:12 EDT 2006


Hi Sam,
> Dean and I are looking at trying to push the efficiency of requests 
> from the kernel module up through the device to client-core.  I added 
> the --threaded option to the client to allow the client-core to run 
> with multiple threads (one each for bmi, dev, and main -- and also a 
> remount thread, but lets ignore that for now), so the device thread 
> should be able to keep pulling requests of the device without having 
> to wait for bmi operations to complete.
Cool!
This could address some of the performance problems that Phil also had 
pointed a while back where multiple outstanding requests were slower 
than a single outstanding request.

> PINT_dev_test_unexpected takes an incount of 5, so its only going to 
> read at most 5 requests off the device for each call.  Once it 
> returns, each of the unexpected requests is added to the completed 
> jobs array and then we signal the jobs completed condition variable 
> _for each request_.  It seems like this will be 5x the number of 
> context switches between the device thread and the main thread that we 
> need.
>
> Also, we poll every time before reading another request off the 
> device.  What about trying to read a number of requests off the device 
> at once with one read (or possibly a readv so we can keep separate 
> buffers per request).
Hmm.. both of these are good points. I had dabbled with doing a readv a 
while back. It might make a difference although I suspect this might be 
in the noise region since
if there are requests to be serviced, poll() will only take the time of 
a syscall which should be pretty fast these days.. but worth a shot.

> Also, it looks like we do a malloc for each new request buffer, and 
> then a free once we're done with it, and a memset of the info struct.  
> It seems like we could manage the buffers on the stack instead of the 
> heap, and save on a few system calls there.
Now we are definitely in the noise region.. :) just kidding. glibc's 
malloc implementation should typically amortize overheads in invoking 
system calls (sbrk etc).
> For both threaded and nonthreaded, with the workload that Dean is 
> using, he found that the PINT_dev_test_unexpected always returned 5 
> requests in the outcount.  So it looks like there are always requests 
> sitting on the device, waiting to be read by client-core.  Are we just 
> not able to process requests fast enough through BMI and the state 
> machines, or is the cost of polling and signaling every time we read a 
> request off the device slowing us down?  In other words, does it make 
> sense to rework the code a little bit or will we just get bottlenecked 
> elsewhere?
It is definitely interesting to try all this out, but I am not sure if 
the bottlenecks are here or elsewhere.
What does this workload do by the way?

thanks,
Murali


More information about the Pvfs2-developers mailing list