[Pvfs2-developers] threaded client-core and the device thread
Murali Vilayannur
murali.vilayannur at gmail.com
Fri Oct 13 23:00:12 EDT 2006
Hi Sam,
> Dean and I are looking at trying to push the efficiency of requests
> from the kernel module up through the device to client-core. I added
> the --threaded option to the client to allow the client-core to run
> with multiple threads (one each for bmi, dev, and main -- and also a
> remount thread, but lets ignore that for now), so the device thread
> should be able to keep pulling requests of the device without having
> to wait for bmi operations to complete.
Cool!
This could address some of the performance problems that Phil also had
pointed a while back where multiple outstanding requests were slower
than a single outstanding request.
> PINT_dev_test_unexpected takes an incount of 5, so its only going to
> read at most 5 requests off the device for each call. Once it
> returns, each of the unexpected requests is added to the completed
> jobs array and then we signal the jobs completed condition variable
> _for each request_. It seems like this will be 5x the number of
> context switches between the device thread and the main thread that we
> need.
>
> Also, we poll every time before reading another request off the
> device. What about trying to read a number of requests off the device
> at once with one read (or possibly a readv so we can keep separate
> buffers per request).
Hmm.. both of these are good points. I had dabbled with doing a readv a
while back. It might make a difference although I suspect this might be
in the noise region since
if there are requests to be serviced, poll() will only take the time of
a syscall which should be pretty fast these days.. but worth a shot.
> Also, it looks like we do a malloc for each new request buffer, and
> then a free once we're done with it, and a memset of the info struct.
> It seems like we could manage the buffers on the stack instead of the
> heap, and save on a few system calls there.
Now we are definitely in the noise region.. :) just kidding. glibc's
malloc implementation should typically amortize overheads in invoking
system calls (sbrk etc).
> For both threaded and nonthreaded, with the workload that Dean is
> using, he found that the PINT_dev_test_unexpected always returned 5
> requests in the outcount. So it looks like there are always requests
> sitting on the device, waiting to be read by client-core. Are we just
> not able to process requests fast enough through BMI and the state
> machines, or is the cost of polling and signaling every time we read a
> request off the device slowing us down? In other words, does it make
> sense to rework the code a little bit or will we just get bottlenecked
> elsewhere?
It is definitely interesting to try all this out, but I am not sure if
the bottlenecks are here or elsewhere.
What does this workload do by the way?
thanks,
Murali
More information about the Pvfs2-developers
mailing list