[Pvfs2-developers] fix BMI multiplexing of multiple methods
Sam Lang
slang at mcs.anl.gov
Thu Jan 8 19:18:29 EST 2009
On Jan 8, 2009, at 4:19 PM, Pete Wyckoff wrote:
> slang at mcs.anl.gov wrote on Wed, 07 Jan 2009 16:06 -0600:
>> Right now if multiple methods are enabled in BMI, we tend to get poor
>> performance from the "fast" network, because BMI_testcontext iterates
>> through all the active methods calling testcontext for each one. It
>> tries to be smart about which methods get scheduled ;-) to prevent
>> starvation, but it treats all the methods fairly, which tends to make
>> tcp (the slow one) hog the time spent in testcontext. I have a few
>> ideas for this, so I'll go ahead and propose them and let you all
>> shoot
>> them down or propose others.
>
> I've always been fond of a third Option: CENTRALIZED_POLLING. All
> BMI methods are changed to hand back an fd to some core BMI routine.
> Individual BMI methods do not poll their devices. The core BMI
> routine sticks all fds in a single select() or epoll(), and when it
> gets one that triggers, calls back into the appropriate BMI method
> to do its business. No need to balance across all the methods.
>
> This can work today with TCP obviously, and IB with some minor
> manipulation. GM cannot fit in such a framework, I believe, being
> completely poll-driven. MX should work however, I think. If a
> method wants to poll for a bit after getting an fd trigger, it can
> get away with that.
>
> This is how pretty much all externally driven server applications
> work today. Lots of threads are still not as good a way to manage
> concurrency.
Hi Pete,
Good to hear from you. I think I understand what you're describing,
but I want to make sure. It probably seems like I'm parroting what
you just told me back to you, sorry about that.
Each method that doesn't already use file descriptors (tcp) creates a
pipe, and hands back one end of the pipe to the BMI generic code. The
method then registers a callback to the underlying networking api,
which writes to its end of the pipe (an operation id or something).
The BMI generic code maps the fds that changed, and for each in turn
calls their completion calls. Is that the idea?
For methods like GM that can't asynchronously notify via a callback, a
separate thread would have to poll and write to its pipe on changes.
This does solve the problem that I don't have to test a method if
nothing is ready, so I skip needlessly waiting up to the timeout for
that method. But what if two methods both have work to be done? So
lets say I poll in the BMI generic code, and discover that work for
both tcp and ib can be done, so I first call the completion call for
tcp, and then the completion call for ib. The completed ib operations
are still held in the completion list while the tcp method is doing
its work, and don't get returned to the job layer (or flow) until the
tcp completion call returns. The callback idea attempts to address
this, as the completed operations get notified via callback pretty
much right away.
-sam
>
>
> Tweaking construct_poll_plan() is doomed to fail. I've tried.
> Maybe I misunderstood your CALLBACK option and you're thinking like
> this too.
>
> -- Pete
More information about the Pvfs2-developers
mailing list