[Pvfs2-developers] help debugging request processor/distribution

Phil Carns pcarns at wastedcycles.org
Wed Jun 14 09:21:41 EDT 2006


If we did this within BMI, we would be paying an extra round trip 
latency time for each large TCP message, which we should probably try to 
avoid.

I vote for just changing the ordering of sys-io.sm so that it does not 
post write flows until a positive write ack is received from the server. 
  That is basically equivalent to performing one handshake (or 
rendezvous) for the whole flow rather than one per BMI message.

The sys-io.sm state machine already does that for reads.  Read flows do 
not get posted until an ack is received from the server.

-Phil

Rob Ross wrote:
> Ok. Well we screwed up here. We've either got to be able to pull that 
> data off the wire (presumably at the BMI layer) or we've got to ACK for 
> large messages (either in BMI or flow or elsewhere).
> 
> Suggestions on which approach to take and where to implement? Probably 
> most straightforward to do this in BMI, but likely not the most efficient.
> 
> Rob
> 
> Phil Carns wrote:
> 
>> Sorry- "rendezvous" the wrong terminology here for what is happening 
>> within bmi_tcp at the individual message level.  It doesn't implicitly 
>> exchange control messages before putting each buffer on the wire.
>>
>> bmi_tcp will send any size message without using control messages to 
>> handshake within bmi_tcp.  It is making the assumption that someone at 
>> a higher level has already performed handshaking and agreeed that both 
>> sides are going to post the appropriate matching operations.
>>
>> Unexpected messages are the only things BMI is permitted to send 
>> without a guarantee that the other side is going to post a recv.
>>
>> The difference between small and large "normal" messsages in bmi_tcp 
>> is not that larger ones will wait to transmit.   Both are basically 
>> sent the same way.  The difference is on the recv side.  Small 
>> messages are allowed to be temporarily buffered by the receiver until 
>> a matching recv is posted, while large messages will not be read into 
>> memory until a matching receive buffer is posted.  So the actual 
>> network transfer will not _complete_ until both sides have posted, but 
>> it can definitely begin before the recv is posted.
>>
>> Flows have the same sort of restriction as large buffers do in 
>> bmi_tcp.  When a flow is posted, it does not do any hand shaking to 
>> make sure both sides are ready to transmit before moving data.  It 
>> assumes that the request protocol has already sorted that out, so it 
>> just starts transmitting.
>>
>> I think that is what's leading to the problem here- the client has 
>> been told to proceed with the flow before waiting to make sure that 
>> the server is also ready to transmit a flow.
>>
>> Once upon a time sys-io.sm did wait for write acks before starting 
>> write flows, but that was changed at some point to try to improve 
>> performance.  We just didn't notice the case that it breaks for until 
>> now.
>>
>> -Phil
>>
>> Rob Ross wrote:
>>
>>> There's a fundamental issue here that I don't quite get: if we're in 
>>> rendezvous mode, why is there data on the wire if we aren't ready to 
>>> receive it? The whole point of rendezvous mode is to *not* send the 
>>> data until the matching receive has been posted.
>>>
>>> What am I missing?
>>>
>>> Thanks,
>>>
>>> Rob
>>>
>>> Phil Carns wrote:
>>>
>>>> Ok, I think I _might_ see what the problem is with the BMI messaging.
>>>>
>>>> I haven't 100% confirmed yet, but it looks like we have the 
>>>> following scenario:
>>>>
>>>> On the client side:
>>>> --------------------
>>>> - pvfs2-client-core starts a I/O operation (write) to server X
>>>>   - a send (for the request) is posted, which is a small buffer
>>>>   - the flow is posted before an ack is received
>>>>   - the flow itself posts another send for data, which is a large 
>>>> buffer
>>>>   - ...
>>>>
>>>> A few notes real quick- I think the above is a performance 
>>>> optimization; we try to go ahead and get the flow going before 
>>>> receiving a positive ack from the server.  It will be canceled if we 
>>>> get a negative ack (or fail to get an ack altogether)
>>>>
>>>> - while the above is in progress, pvfs2-client-core starts another 
>>>> write operation to server X (from another application that is 
>>>> hitting the same server)
>>>>   - a send for this second request is posted
>>>>   - another flow is posted before an ack is received
>>>>   - depending on the timing, it may manage to post a send for data 
>>>> as well, which is another large buffer
>>>>   - this traffic is interleaved on the same socket as is being used 
>>>> for the first flow, which is still running at this point
>>>>
>>>> On the server side:
>>>> --------------------
>>>> - the first I/O request arrives
>>>>   - it gets past the request scheduler
>>>>   - a flow is started and receives the first (large) data buffer
>>>> - a different request for the same handle arrives
>>>>   - getattr would be a good example, could be from any client
>>>>   - this getattr gets queued in the request scheduler behind the write
>>>> - the second I/O request arrives
>>>>   - it gets queued behind the getattr in the request scheduler
>>>>
>>>> At this point on the server side, we have a flow in progress that is 
>>>> waiting on a data buffer.  However, the next message is for a 
>>>> different flow (the tags don't match).  Since this message is 
>>>> relatively large (256K), it is in rendezvous mode within bmi_tcp and 
>>>> cannot be pulled out of the socket until a matching receive is 
>>>> posted.  The flow that is expected to post that receive is not 
>>>> running yet because the second I/O request is stuck in the scheduler.
>>>>
>>>> ... so we have a deadlock.  The socket is filled with data that the 
>>>> server isn't allowed to recv yet, and the data that it really needs 
>>>> next is stuck behind it.
>>>>
>>>> I'm not sure that I described that all that well.  At a high level 
>>>> we have two flows sharing the same socket.  The client started both 
>>>> of them and the messages got interleaved.  The server only started 
>>>> one of them, but is now stuck because it can't deal with data 
>>>> arriving for the second one.
>>>>
>>>> I am going to try to find a brute force way to serialize I/O from 
>>>> each pvfs2-client-core just to see if that solves the problem (maybe 
>>>> only allowing one buffer between pvfs2-client-core and kernel, 
>>>> rather than 5).  If that does look like it fixed the problem, then 
>>>> we need a more elegant solution.   Maybe waiting for acks before 
>>>> starting flows, or just somehow serializing flows that share sockets.
>>>>
>>>> -Phil
>>
>>



More information about the Pvfs2-developers mailing list