[Pvfs2-developers] help debugging request processor/distribution

Rob Ross rross at mcs.anl.gov
Tue Jun 13 18:47:41 EDT 2006


There's a fundamental issue here that I don't quite get: if we're in 
rendezvous mode, why is there data on the wire if we aren't ready to 
receive it? The whole point of rendezvous mode is to *not* send the data 
until the matching receive has been posted.

What am I missing?

Thanks,

Rob

Phil Carns wrote:
> Ok, I think I _might_ see what the problem is with the BMI messaging.
> 
> I haven't 100% confirmed yet, but it looks like we have the following 
> scenario:
> 
> On the client side:
> --------------------
> - pvfs2-client-core starts a I/O operation (write) to server X
>   - a send (for the request) is posted, which is a small buffer
>   - the flow is posted before an ack is received
>   - the flow itself posts another send for data, which is a large buffer
>   - ...
> 
> A few notes real quick- I think the above is a performance optimization; 
> we try to go ahead and get the flow going before receiving a positive 
> ack from the server.  It will be canceled if we get a negative ack (or 
> fail to get an ack altogether)
> 
> - while the above is in progress, pvfs2-client-core starts another write 
> operation to server X (from another application that is hitting the same 
> server)
>   - a send for this second request is posted
>   - another flow is posted before an ack is received
>   - depending on the timing, it may manage to post a send for data as 
> well, which is another large buffer
>   - this traffic is interleaved on the same socket as is being used for 
> the first flow, which is still running at this point
> 
> On the server side:
> --------------------
> - the first I/O request arrives
>   - it gets past the request scheduler
>   - a flow is started and receives the first (large) data buffer
> - a different request for the same handle arrives
>   - getattr would be a good example, could be from any client
>   - this getattr gets queued in the request scheduler behind the write
> - the second I/O request arrives
>   - it gets queued behind the getattr in the request scheduler
> 
> At this point on the server side, we have a flow in progress that is 
> waiting on a data buffer.  However, the next message is for a different 
> flow (the tags don't match).  Since this message is relatively large 
> (256K), it is in rendezvous mode within bmi_tcp and cannot be pulled out 
> of the socket until a matching receive is posted.  The flow that is 
> expected to post that receive is not running yet because the second I/O 
> request is stuck in the scheduler.
> 
> ... so we have a deadlock.  The socket is filled with data that the 
> server isn't allowed to recv yet, and the data that it really needs next 
> is stuck behind it.
> 
> I'm not sure that I described that all that well.  At a high level we 
> have two flows sharing the same socket.  The client started both of them 
> and the messages got interleaved.  The server only started one of them, 
> but is now stuck because it can't deal with data arriving for the second 
> one.
> 
> I am going to try to find a brute force way to serialize I/O from each 
> pvfs2-client-core just to see if that solves the problem (maybe only 
> allowing one buffer between pvfs2-client-core and kernel, rather than 
> 5).  If that does look like it fixed the problem, then we need a more 
> elegant solution.   Maybe waiting for acks before starting flows, or 
> just somehow serializing flows that share sockets.
> 
> -Phil


More information about the Pvfs2-developers mailing list