[Pvfs2-developers] threaded client-core and the device thread

Dean Hildebrand dhildebz at eecs.umich.edu
Tue Oct 17 11:57:43 EDT 2006


Hi Murali/Phil,

Murali Vilayannur wrote:
> Hi Sam,
>> Dean and I are looking at trying to push the efficiency of requests 
>> from the kernel module up through the device to client-core.  I added 
>> the --threaded option to the client to allow the client-core to run 
>> with multiple threads (one each for bmi, dev, and main -- and also a 
>> remount thread, but lets ignore that for now), so the device thread 
>> should be able to keep pulling requests of the device without having 
>> to wait for bmi operations to complete.
> Cool!
> This could address some of the performance problems that Phil also had 
> pointed a while back where multiple outstanding requests were slower 
> than a single outstanding request.
Just to see if I'm noticing the same issue, what was the exact problem 
Phil was noticing?  Shouldn't multiple requests take longer than a 
single request?

The workload I was using was multiple rpc.nfsd threads issuing 64 KB 
requests (through the writev/readv interface) to the PVFS2 kernel module 
(and then to client-core and so on).  To make things easy, I bet using 
iozone with multiple threads and a random workload would simulate this 
workload quite well.  What I was noticing is that although we haven't 
reached disk, cpu, or network limits, the I/O throughput is fixed at 
some low value.

One test Sam and I tried was to increase the number of kernel mmapped 
buffers.  Instead of five 4MB buffers, we used sixty-four 128KB 
buffers.  This reduced performance considerably, especially read 
performance.  Since we are using 64KB requests, this should not be an 
issue, but it was.  One thing we didn't get a chance to try was if the 
reduced performance was because of the increase in buffers or the 
reduction in size.  My guess would be the increase, but why would this be?

Beyond inefficient coding issues, Sam and I talked about where the 
bottleneck could be from a design standpoint.  We came up with the 
following list:
0) kmapping and copying data is going at fast as possible
1) Sending message through the pvfs2-req device can only happen at a 
constant rate.
2) client-core reading message off the pvfs2-req device (should no 
longer be an issue with the --threaded option, but maybe reading 5 at a 
time is still inefficient)
3) A single BMI thread issuing I/O requests.  Are multiple threads 
necessary to issue the multiple I/O requests from the kernel?

Can anyone think of other parts of the I/O path that might be a 
bottleneck?  So far, we have only started to investigate items 1 and 2.

Thanks for everyone's help.
Dean
>
>> PINT_dev_test_unexpected takes an incount of 5, so its only going to 
>> read at most 5 requests off the device for each call.  Once it 
>> returns, each of the unexpected requests is added to the completed 
>> jobs array and then we signal the jobs completed condition variable 
>> _for each request_.  It seems like this will be 5x the number of 
>> context switches between the device thread and the main thread that 
>> we need.
>>
>> Also, we poll every time before reading another request off the 
>> device.  What about trying to read a number of requests off the 
>> device at once with one read (or possibly a readv so we can keep 
>> separate buffers per request).
> Hmm.. both of these are good points. I had dabbled with doing a readv 
> a while back. It might make a difference although I suspect this might 
> be in the noise region since
> if there are requests to be serviced, poll() will only take the time 
> of a syscall which should be pretty fast these days.. but worth a shot.
>
>> Also, it looks like we do a malloc for each new request buffer, and 
>> then a free once we're done with it, and a memset of the info 
>> struct.  It seems like we could manage the buffers on the stack 
>> instead of the heap, and save on a few system calls there.
> Now we are definitely in the noise region.. :) just kidding. glibc's 
> malloc implementation should typically amortize overheads in invoking 
> system calls (sbrk etc).
>> For both threaded and nonthreaded, with the workload that Dean is 
>> using, he found that the PINT_dev_test_unexpected always returned 5 
>> requests in the outcount.  So it looks like there are always requests 
>> sitting on the device, waiting to be read by client-core.  Are we 
>> just not able to process requests fast enough through BMI and the 
>> state machines, or is the cost of polling and signaling every time we 
>> read a request off the device slowing us down?  In other words, does 
>> it make sense to rework the code a little bit or will we just get 
>> bottlenecked elsewhere?
> It is definitely interesting to try all this out, but I am not sure if 
> the bottlenecks are here or elsewhere.
> What does this workload do by the way?
>
> thanks,
> Murali
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

-- 
Dean Hildebrand
Ph.D. Candidate
University of Michigan



More information about the Pvfs2-developers mailing list