[Pvfs2-developers] threaded client-core and the device thread
Sam Lang
slang at mcs.anl.gov
Fri Oct 13 23:12:44 EDT 2006
On Oct 13, 2006, at 10:00 PM, Murali Vilayannur wrote:
> Hi Sam,
>> Dean and I are looking at trying to push the efficiency of
>> requests from the kernel module up through the device to client-
>> core. I added the --threaded option to the client to allow the
>> client-core to run with multiple threads (one each for bmi, dev,
>> and main -- and also a remount thread, but lets ignore that for
>> now), so the device thread should be able to keep pulling requests
>> of the device without having to wait for bmi operations to complete.
> Cool!
> This could address some of the performance problems that Phil also
> had pointed a while back where multiple outstanding requests were
> slower than a single outstanding request.
Well it doesn't seem to make a difference, at least with the
workloads that we were trying.
>
>> PINT_dev_test_unexpected takes an incount of 5, so its only going
>> to read at most 5 requests off the device for each call. Once it
>> returns, each of the unexpected requests is added to the completed
>> jobs array and then we signal the jobs completed condition
>> variable _for each request_. It seems like this will be 5x the
>> number of context switches between the device thread and the main
>> thread that we need.
>>
>> Also, we poll every time before reading another request off the
>> device. What about trying to read a number of requests off the
>> device at once with one read (or possibly a readv so we can keep
>> separate buffers per request).
> Hmm.. both of these are good points. I had dabbled with doing a
> readv a while back. It might make a difference although I suspect
> this might be in the noise region since
> if there are requests to be serviced, poll() will only take the
> time of a syscall which should be pretty fast these days.. but
> worth a shot.
>
>> Also, it looks like we do a malloc for each new request buffer,
>> and then a free once we're done with it, and a memset of the info
>> struct. It seems like we could manage the buffers on the stack
>> instead of the heap, and save on a few system calls there.
> Now we are definitely in the noise region.. :) just kidding.
> glibc's malloc implementation should typically amortize overheads
> in invoking system calls (sbrk etc).
Dean was seeing memset at the top of list while running oprofile on
pvfs2-client-core. malloc and free were also up there.
>> For both threaded and nonthreaded, with the workload that Dean is
>> using, he found that the PINT_dev_test_unexpected always returned
>> 5 requests in the outcount. So it looks like there are always
>> requests sitting on the device, waiting to be read by client-
>> core. Are we just not able to process requests fast enough
>> through BMI and the state machines, or is the cost of polling and
>> signaling every time we read a request off the device slowing us
>> down? In other words, does it make sense to rework the code a
>> little bit or will we just get bottlenecked elsewhere?
> It is definitely interesting to try all this out, but I am not sure
> if the bottlenecks are here or elsewhere.
> What does this workload do by the way?
If I understand it correctly, there are a number of threads doing
simultaneous reads or writes (64K block sizes) on the same file.
-sam
>
> thanks,
> Murali
>
More information about the Pvfs2-developers
mailing list