[Pvfs2-developers] help debugging request processor/distribution
pcarns at wastedcycles.org
Wed Jun 14 09:21:41 EDT 2006
If we did this within BMI, we would be paying an extra round trip
latency time for each large TCP message, which we should probably try to
I vote for just changing the ordering of sys-io.sm so that it does not
post write flows until a positive write ack is received from the server.
That is basically equivalent to performing one handshake (or
rendezvous) for the whole flow rather than one per BMI message.
The sys-io.sm state machine already does that for reads. Read flows do
not get posted until an ack is received from the server.
Rob Ross wrote:
> Ok. Well we screwed up here. We've either got to be able to pull that
> data off the wire (presumably at the BMI layer) or we've got to ACK for
> large messages (either in BMI or flow or elsewhere).
> Suggestions on which approach to take and where to implement? Probably
> most straightforward to do this in BMI, but likely not the most efficient.
> Phil Carns wrote:
>> Sorry- "rendezvous" the wrong terminology here for what is happening
>> within bmi_tcp at the individual message level. It doesn't implicitly
>> exchange control messages before putting each buffer on the wire.
>> bmi_tcp will send any size message without using control messages to
>> handshake within bmi_tcp. It is making the assumption that someone at
>> a higher level has already performed handshaking and agreeed that both
>> sides are going to post the appropriate matching operations.
>> Unexpected messages are the only things BMI is permitted to send
>> without a guarantee that the other side is going to post a recv.
>> The difference between small and large "normal" messsages in bmi_tcp
>> is not that larger ones will wait to transmit. Both are basically
>> sent the same way. The difference is on the recv side. Small
>> messages are allowed to be temporarily buffered by the receiver until
>> a matching recv is posted, while large messages will not be read into
>> memory until a matching receive buffer is posted. So the actual
>> network transfer will not _complete_ until both sides have posted, but
>> it can definitely begin before the recv is posted.
>> Flows have the same sort of restriction as large buffers do in
>> bmi_tcp. When a flow is posted, it does not do any hand shaking to
>> make sure both sides are ready to transmit before moving data. It
>> assumes that the request protocol has already sorted that out, so it
>> just starts transmitting.
>> I think that is what's leading to the problem here- the client has
>> been told to proceed with the flow before waiting to make sure that
>> the server is also ready to transmit a flow.
>> Once upon a time sys-io.sm did wait for write acks before starting
>> write flows, but that was changed at some point to try to improve
>> performance. We just didn't notice the case that it breaks for until
>> Rob Ross wrote:
>>> There's a fundamental issue here that I don't quite get: if we're in
>>> rendezvous mode, why is there data on the wire if we aren't ready to
>>> receive it? The whole point of rendezvous mode is to *not* send the
>>> data until the matching receive has been posted.
>>> What am I missing?
>>> Phil Carns wrote:
>>>> Ok, I think I _might_ see what the problem is with the BMI messaging.
>>>> I haven't 100% confirmed yet, but it looks like we have the
>>>> following scenario:
>>>> On the client side:
>>>> - pvfs2-client-core starts a I/O operation (write) to server X
>>>> - a send (for the request) is posted, which is a small buffer
>>>> - the flow is posted before an ack is received
>>>> - the flow itself posts another send for data, which is a large
>>>> - ...
>>>> A few notes real quick- I think the above is a performance
>>>> optimization; we try to go ahead and get the flow going before
>>>> receiving a positive ack from the server. It will be canceled if we
>>>> get a negative ack (or fail to get an ack altogether)
>>>> - while the above is in progress, pvfs2-client-core starts another
>>>> write operation to server X (from another application that is
>>>> hitting the same server)
>>>> - a send for this second request is posted
>>>> - another flow is posted before an ack is received
>>>> - depending on the timing, it may manage to post a send for data
>>>> as well, which is another large buffer
>>>> - this traffic is interleaved on the same socket as is being used
>>>> for the first flow, which is still running at this point
>>>> On the server side:
>>>> - the first I/O request arrives
>>>> - it gets past the request scheduler
>>>> - a flow is started and receives the first (large) data buffer
>>>> - a different request for the same handle arrives
>>>> - getattr would be a good example, could be from any client
>>>> - this getattr gets queued in the request scheduler behind the write
>>>> - the second I/O request arrives
>>>> - it gets queued behind the getattr in the request scheduler
>>>> At this point on the server side, we have a flow in progress that is
>>>> waiting on a data buffer. However, the next message is for a
>>>> different flow (the tags don't match). Since this message is
>>>> relatively large (256K), it is in rendezvous mode within bmi_tcp and
>>>> cannot be pulled out of the socket until a matching receive is
>>>> posted. The flow that is expected to post that receive is not
>>>> running yet because the second I/O request is stuck in the scheduler.
>>>> ... so we have a deadlock. The socket is filled with data that the
>>>> server isn't allowed to recv yet, and the data that it really needs
>>>> next is stuck behind it.
>>>> I'm not sure that I described that all that well. At a high level
>>>> we have two flows sharing the same socket. The client started both
>>>> of them and the messages got interleaved. The server only started
>>>> one of them, but is now stuck because it can't deal with data
>>>> arriving for the second one.
>>>> I am going to try to find a brute force way to serialize I/O from
>>>> each pvfs2-client-core just to see if that solves the problem (maybe
>>>> only allowing one buffer between pvfs2-client-core and kernel,
>>>> rather than 5). If that does look like it fixed the problem, then
>>>> we need a more elegant solution. Maybe waiting for acks before
>>>> starting flows, or just somehow serializing flows that share sockets.
More information about the Pvfs2-developers