[Pvfs2-developers] threaded client-core and the device thread

Sam Lang slang at mcs.anl.gov
Fri Oct 13 23:12:44 EDT 2006


On Oct 13, 2006, at 10:00 PM, Murali Vilayannur wrote:

> Hi Sam,
>> Dean and I are looking at trying to push the efficiency of  
>> requests from the kernel module up through the device to client- 
>> core.  I added the --threaded option to the client to allow the  
>> client-core to run with multiple threads (one each for bmi, dev,  
>> and main -- and also a remount thread, but lets ignore that for  
>> now), so the device thread should be able to keep pulling requests  
>> of the device without having to wait for bmi operations to complete.
> Cool!
> This could address some of the performance problems that Phil also  
> had pointed a while back where multiple outstanding requests were  
> slower than a single outstanding request.

Well it doesn't seem to make a difference, at least with the  
workloads that we were trying.

>
>> PINT_dev_test_unexpected takes an incount of 5, so its only going  
>> to read at most 5 requests off the device for each call.  Once it  
>> returns, each of the unexpected requests is added to the completed  
>> jobs array and then we signal the jobs completed condition  
>> variable _for each request_.  It seems like this will be 5x the  
>> number of context switches between the device thread and the main  
>> thread that we need.
>>
>> Also, we poll every time before reading another request off the  
>> device.  What about trying to read a number of requests off the  
>> device at once with one read (or possibly a readv so we can keep  
>> separate buffers per request).
> Hmm.. both of these are good points. I had dabbled with doing a  
> readv a while back. It might make a difference although I suspect  
> this might be in the noise region since
> if there are requests to be serviced, poll() will only take the  
> time of a syscall which should be pretty fast these days.. but  
> worth a shot.
>
>> Also, it looks like we do a malloc for each new request buffer,  
>> and then a free once we're done with it, and a memset of the info  
>> struct.  It seems like we could manage the buffers on the stack  
>> instead of the heap, and save on a few system calls there.
> Now we are definitely in the noise region.. :) just kidding.  
> glibc's malloc implementation should typically amortize overheads  
> in invoking system calls (sbrk etc).

Dean was seeing memset at the top of list while running oprofile on  
pvfs2-client-core.  malloc and free were also up there.

>> For both threaded and nonthreaded, with the workload that Dean is  
>> using, he found that the PINT_dev_test_unexpected always returned  
>> 5 requests in the outcount.  So it looks like there are always  
>> requests sitting on the device, waiting to be read by client- 
>> core.  Are we just not able to process requests fast enough  
>> through BMI and the state machines, or is the cost of polling and  
>> signaling every time we read a request off the device slowing us  
>> down?  In other words, does it make sense to rework the code a  
>> little bit or will we just get bottlenecked elsewhere?
> It is definitely interesting to try all this out, but I am not sure  
> if the bottlenecks are here or elsewhere.
> What does this workload do by the way?

If I understand it correctly, there are a number of threads doing  
simultaneous reads or writes (64K block sizes) on the same file.

-sam

>
> thanks,
> Murali
>



More information about the Pvfs2-developers mailing list