[Pvfs2-developers] threaded client-core and the device thread

Phil Carns pcarns at wastedcycles.org
Mon Oct 16 09:37:15 EDT 2006


Sam Lang wrote:
> 
> Hi All,
> 
> Dean and I are looking at trying to push the efficiency of requests  
> from the kernel module up through the device to client-core.  I added  
> the --threaded option to the client to allow the client-core to run  
> with multiple threads (one each for bmi, dev, and main -- and also a  
> remount thread, but lets ignore that for now), so the device thread  
> should be able to keep pulling requests of the device without having  to 
> wait for bmi operations to complete.
> 
> I noticed a couple things with the device thread that I wanted to ask  
> about.
> 
> PINT_dev_test_unexpected takes an incount of 5, so its only going to  
> read at most 5 requests off the device for each call.  Once it  returns, 
> each of the unexpected requests is added to the completed  jobs array 
> and then we signal the jobs completed condition variable  _for each 
> request_.  It seems like this will be 5x the number of  context switches 
> between the device thread and the main thread that  we need.
> 
> Also, we poll every time before reading another request off the  
> device.  What about trying to read a number of requests off the  device 
> at once with one read (or possibly a readv so we can keep  separate 
> buffers per request).
> 
> Also, it looks like we do a malloc for each new request buffer, and  
> then a free once we're done with it, and a memset of the info  struct.  
> It seems like we could manage the buffers on the stack  instead of the 
> heap, and save on a few system calls there.
> 
> For both threaded and nonthreaded, with the workload that Dean is  
> using, he found that the PINT_dev_test_unexpected always returned 5  
> requests in the outcount.  So it looks like there are always requests  
> sitting on the device, waiting to be read by client-core.  Are we  just 
> not able to process requests fast enough through BMI and the  state 
> machines, or is the cost of polling and signaling every time we  read a 
> request off the device slowing us down?  In other words, does  it make 
> sense to rework the code a little bit or will we just get  bottlenecked 
> elsewhere?

I am just speculating, but out of the things you list I would guess that 
these two things would be most likely to show improvement without much 
coding effort:

- increasing the testcount to something higher than 5 (since it sounds 
like that is getting maxed out for this workload)
- fixing the "signalling on every request problem"

The need for multiple reads and the mallocs could be a problem, but I am 
with Murali in that I think problems in this area are more likely 
related to inefficient threading or I/O stalls rather than CPU or memory 
overhead.

-Phil


More information about the Pvfs2-developers mailing list