[Pvfs2-developers] help debugging request processor/distribution

Phil Carns pcarns at wastedcycles.org
Wed Jun 14 09:02:10 EDT 2006


Sorry- "rendezvous" the wrong terminology here for what is happening 
within bmi_tcp at the individual message level.  It doesn't implicitly 
exchange control messages before putting each buffer on the wire.

bmi_tcp will send any size message without using control messages to 
handshake within bmi_tcp.  It is making the assumption that someone at a 
higher level has already performed handshaking and agreeed that both 
sides are going to post the appropriate matching operations.

Unexpected messages are the only things BMI is permitted to send without 
a guarantee that the other side is going to post a recv.

The difference between small and large "normal" messsages in bmi_tcp is 
not that larger ones will wait to transmit.   Both are basically sent 
the same way.  The difference is on the recv side.  Small messages are 
allowed to be temporarily buffered by the receiver until a matching recv 
is posted, while large messages will not be read into memory until a 
matching receive buffer is posted.  So the actual network transfer will 
not _complete_ until both sides have posted, but it can definitely begin 
before the recv is posted.

Flows have the same sort of restriction as large buffers do in bmi_tcp. 
  When a flow is posted, it does not do any hand shaking to make sure 
both sides are ready to transmit before moving data.  It assumes that 
the request protocol has already sorted that out, so it just starts 
transmitting.

I think that is what's leading to the problem here- the client has been 
told to proceed with the flow before waiting to make sure that the 
server is also ready to transmit a flow.

Once upon a time sys-io.sm did wait for write acks before starting write 
flows, but that was changed at some point to try to improve performance. 
  We just didn't notice the case that it breaks for until now.

-Phil

Rob Ross wrote:
> There's a fundamental issue here that I don't quite get: if we're in 
> rendezvous mode, why is there data on the wire if we aren't ready to 
> receive it? The whole point of rendezvous mode is to *not* send the data 
> until the matching receive has been posted.
> 
> What am I missing?
> 
> Thanks,
> 
> Rob
> 
> Phil Carns wrote:
> 
>> Ok, I think I _might_ see what the problem is with the BMI messaging.
>>
>> I haven't 100% confirmed yet, but it looks like we have the following 
>> scenario:
>>
>> On the client side:
>> --------------------
>> - pvfs2-client-core starts a I/O operation (write) to server X
>>   - a send (for the request) is posted, which is a small buffer
>>   - the flow is posted before an ack is received
>>   - the flow itself posts another send for data, which is a large buffer
>>   - ...
>>
>> A few notes real quick- I think the above is a performance 
>> optimization; we try to go ahead and get the flow going before 
>> receiving a positive ack from the server.  It will be canceled if we 
>> get a negative ack (or fail to get an ack altogether)
>>
>> - while the above is in progress, pvfs2-client-core starts another 
>> write operation to server X (from another application that is hitting 
>> the same server)
>>   - a send for this second request is posted
>>   - another flow is posted before an ack is received
>>   - depending on the timing, it may manage to post a send for data as 
>> well, which is another large buffer
>>   - this traffic is interleaved on the same socket as is being used 
>> for the first flow, which is still running at this point
>>
>> On the server side:
>> --------------------
>> - the first I/O request arrives
>>   - it gets past the request scheduler
>>   - a flow is started and receives the first (large) data buffer
>> - a different request for the same handle arrives
>>   - getattr would be a good example, could be from any client
>>   - this getattr gets queued in the request scheduler behind the write
>> - the second I/O request arrives
>>   - it gets queued behind the getattr in the request scheduler
>>
>> At this point on the server side, we have a flow in progress that is 
>> waiting on a data buffer.  However, the next message is for a 
>> different flow (the tags don't match).  Since this message is 
>> relatively large (256K), it is in rendezvous mode within bmi_tcp and 
>> cannot be pulled out of the socket until a matching receive is 
>> posted.  The flow that is expected to post that receive is not running 
>> yet because the second I/O request is stuck in the scheduler.
>>
>> ... so we have a deadlock.  The socket is filled with data that the 
>> server isn't allowed to recv yet, and the data that it really needs 
>> next is stuck behind it.
>>
>> I'm not sure that I described that all that well.  At a high level we 
>> have two flows sharing the same socket.  The client started both of 
>> them and the messages got interleaved.  The server only started one of 
>> them, but is now stuck because it can't deal with data arriving for 
>> the second one.
>>
>> I am going to try to find a brute force way to serialize I/O from each 
>> pvfs2-client-core just to see if that solves the problem (maybe only 
>> allowing one buffer between pvfs2-client-core and kernel, rather than 
>> 5).  If that does look like it fixed the problem, then we need a more 
>> elegant solution.   Maybe waiting for acks before starting flows, or 
>> just somehow serializing flows that share sockets.
>>
>> -Phil



More information about the Pvfs2-developers mailing list