[Pvfs2-developers] fix BMI multiplexing of multiple methods

Sam Lang slang at mcs.anl.gov
Thu Jan 8 12:34:24 EST 2009


On Jan 8, 2009, at 10:48 AM, Rob Ross wrote:

> Hey,
>
> For the CALLBACK option, you would use that to have the individual  
> methods filling in things at the "generic BMI" layer (for lack of  
> the right terminology), but the overall user API would be the same?

I was thinking that the callback would get passed all the way down to  
the method, and it would call the callback on completion of an  
operation.  We could keep the callback at the generic BMI level, and  
call the callback for completed operations on return from a method's  
testcontext.  That does still avoid the multiplexing issues we see at  
present, and its less of a change to the BMI code overall, so maybe  
that's the way to go.

The user API would have to change fairly significantly for callbacks,  
because if a "callback context" were specified, completion would be  
notified via the callback instead of as a list of completed  
operations.  For example, in our job code, instead of copying  
completed BMI operations to the job completion list with each call to  
BMI_testcontext, we would copy completed BMI operations to the job  
completion list whenever the callback was called.  This still doesn't  
fix the issue for our metadata operations though, because completed  
operations are just going to sit in the job completion queue while  
we're calling BMI_testcontext (it still takes just as long to iterate  
through all the methods).  So we would need to modify the job  
interfaces to take callbacks as well, and define a callback that  
starts up the associated state machine.  For our metadata operations,  
this ends up being fairly invasive.

For I/O, it actually ends up being a win, because flow already uses  
callbacks to bounce between BMI and trove operations.

A potential drawback to the callback idea, is that synchronization  
occurs on a per-operation basis, instead of for potentially many  
operations.  A way around that would be to require a callback that  
could many completed operations instead of just one, although I don't  
know if mutex locks are a real bottleneck for us anymore.

>
>
> I don't think that the CONTEXT option is appropriate. I don't want  
> to expose the specifics of the underlying networks any more than we  
> have already.
>
> There should be relevant research in the MPI space related to the  
> POLL_PLAN option.
>
> Do we consider this to be a problem for both clients and servers, or  
> is it really a server-specific issue? If this is something we think  
> will solely (or mostly) a server thing, we could consider throwing a  
> thread at the issue. One option might be to kick off a thread to  
> wait on the TCP side of things, since the kernel is doing most of  
> the work for us anyway, and put completed TCP events into the  
> completion list asynchronously (for servers only)?

I think the problem has been raised only on clients, but it exists on  
both the server and clients.

Maybe I'm just missing some details, but I don't think a tcp thread  
will help us, or it at least needs to be combined with the POLL_PLAN  
or CALLBACK option.  The tcp testcontext call will sleep (epoll_wait)  
up to the timeout passed in if there's no completed operations and no  
work to be done.  With a thread, we would just have tcp testcontext  
return immediately even if nothing was in the completion list.  But  
that means that a tcp-only scenario will cause the BMI_testcontext  
calls on the client to spin and peg the cpu.  We could add in a  
condition variable, but then we're right back where we started.  I  
think an appropriate POLL_PLAN option could adjust timeouts to the tcp  
testcontext call, but it requires a lot more smarts in the code to get  
that right in general, whereas the callback option just allows you to  
get completion right away.

-sam

>
>
> Rob
>
> On Jan 7, 2009, at 4:06 PM, Sam Lang wrote:
>
>>
>> Hi All,
>>
>> Right now if multiple methods are enabled in BMI, we tend to get  
>> poor performance from the "fast" network, because BMI_testcontext  
>> iterates through all the active methods calling testcontext for  
>> each one.  It tries to be smart about which methods get  
>> scheduled ;-) to prevent starvation, but it treats all the methods  
>> fairly, which tends to make tcp (the slow one) hog the time spent  
>> in testcontext.  I have a few ideas for this, so I'll go ahead and  
>> propose them and let you all shoot them down or propose others.
>>
>> Option CALLBACK:  Instead of returning completion as a list in  
>> testcontext, we allow a BMI context to be constructed with a  
>> callback, and on completion of operations, the callback is called.   
>> This allows each method to drive its own operations, and notify the  
>> consumer of completion immediately.  There would still need to be a  
>> testcontext call for methods that only service operations during  
>> that call.  The changes might not be that significant, the  
>> BMI_open_context call could just take an extra parameter that was  
>> the callback function.  If the parameter is null, we just use the  
>> completion list as before.
>>
>> Option CONTEXT:  Require separate contexts for separate methods.   
>> This pushes the problem up to the application, probably not where  
>> it belongs, since active methods are opaque from the BMI api.
>>
>> Option POLL_PLAN:  Modify the construct_poll_plan function in bmi  
>> that already tries to be fair, so that its aware of the performance  
>> discrepancy between methods.  Maybe it can just skip tcp every  
>> other time for example.  This is probably the easiest, since it  
>> doesn't require API changes and the like.
>>
>> -sam



More information about the Pvfs2-developers mailing list