[Pvfs2-developers] help debugging request processor/distribution
Phil Carns
pcarns at wastedcycles.org
Thu Jun 15 08:45:26 EDT 2006
I don't think that the same problem would happen with bmi_gm. For large
buffers it has to do a full rendezvous handshake to exchange memory
addresses etc. for remote put operations (directed sends in GM-speak).
There isn't any way to get a transmission started without a buffer
available.
-Phil
Bradley W Settlemyer wrote:
> I've seen similar behavior to this using Myrinet (jazz) also. I haven't
> ever gotten it with TCP though, so maybe mine was due to something a
> little different (basically, it looked like a message with the tag was
> sent, but never delivered). Are you certain that its just a BMI tcp issue?
>
> Cheers,
> Brad
>
>
> Phil Carns wrote:
>
>> Ok, I think I _might_ see what the problem is with the BMI messaging.
>>
>> I haven't 100% confirmed yet, but it looks like we have the following
>> scenario:
>>
>> On the client side:
>> --------------------
>> - pvfs2-client-core starts a I/O operation (write) to server X
>> - a send (for the request) is posted, which is a small buffer
>> - the flow is posted before an ack is received
>> - the flow itself posts another send for data, which is a large buffer
>> - ...
>>
>> A few notes real quick- I think the above is a performance
>> optimization; we try to go ahead and get the flow going before
>> receiving a positive ack from the server. It will be canceled if we
>> get a negative ack (or fail to get an ack altogether)
>>
>> - while the above is in progress, pvfs2-client-core starts another
>> write operation to server X (from another application that is hitting
>> the same server)
>> - a send for this second request is posted
>> - another flow is posted before an ack is received
>> - depending on the timing, it may manage to post a send for data as
>> well, which is another large buffer
>> - this traffic is interleaved on the same socket as is being used
>> for the first flow, which is still running at this point
>>
>> On the server side:
>> --------------------
>> - the first I/O request arrives
>> - it gets past the request scheduler
>> - a flow is started and receives the first (large) data buffer
>> - a different request for the same handle arrives
>> - getattr would be a good example, could be from any client
>> - this getattr gets queued in the request scheduler behind the write
>> - the second I/O request arrives
>> - it gets queued behind the getattr in the request scheduler
>>
>> At this point on the server side, we have a flow in progress that is
>> waiting on a data buffer. However, the next message is for a
>> different flow (the tags don't match). Since this message is
>> relatively large (256K), it is in rendezvous mode within bmi_tcp and
>> cannot be pulled out of the socket until a matching receive is
>> posted. The flow that is expected to post that receive is not running
>> yet because the second I/O request is stuck in the scheduler.
>>
>> ... so we have a deadlock. The socket is filled with data that the
>> server isn't allowed to recv yet, and the data that it really needs
>> next is stuck behind it.
>>
>> I'm not sure that I described that all that well. At a high level we
>> have two flows sharing the same socket. The client started both of
>> them and the messages got interleaved. The server only started one of
>> them, but is now stuck because it can't deal with data arriving for
>> the second one.
>>
>> I am going to try to find a brute force way to serialize I/O from each
>> pvfs2-client-core just to see if that solves the problem (maybe only
>> allowing one buffer between pvfs2-client-core and kernel, rather than
>> 5). If that does look like it fixed the problem, then we need a more
>> elegant solution. Maybe waiting for acks before starting flows, or
>> just somehow serializing flows that share sockets.
>>
>> -Phil
>>
>> Phil Carns wrote:
>>
>>> Hi Murali,
>>>
>>> I'm not seeing any problems with the dd scenario that you describe (I
>>> tried running 8-10 of them simultaneously), but I may just have
>>> different timing characteristics.
>>>
>>> One thing that I neglected to mention about the application that I am
>>> running is that it actually has 2 processes per client. That may be
>>> something that I have in common with your test case (multiple
>>> applications writing concurrently from the same client). I will
>>> reconfigure and try some test runs with 1 process per client and see
>>> if that impacts behavior for my test case.
>>>
>>> -Phil
>>>
>>> Phil Carns wrote:
>>>
>>>> I reverted my src/io/bmi directory back to a snapshot of what was in
>>>> CVS around March, but I still have the same problems. It doesn't
>>>> seem to be impacted by the pipe changes or the socket buffer size
>>>> changes.
>>>>
>>>> I haven't tried your dd test case yet, but I will give that a shot
>>>> soon.
>>>>
>>>> -Phil
>>>>
>>>> Murali Vilayannur wrote:
>>>>
>>>>> Hi Phil, others,
>>>>> I think I am seeing the same problem (or some variants of it) with the
>>>>> NFSv4 bug that I reported a few days ago..
>>>>> Here is a very simple way that I am able to reproduce it reliably..
>>>>> Mount a pvfs2 volume and run a few instances of
>>>>> dd if=/dev/zero of=/mnt/pvfs2/file bs=2M count=100
>>>>> All the dd instances simply wait for something to happen,
>>>>> client-core and
>>>>> servers are also waiting for something to happen and no process
>>>>> makes any
>>>>> progress..
>>>>> It is perhaps related to the pipe changes to bmi-tcp with poll? Let me
>>>>> revert those changes and try it out as well..
>>>>> Thanks,
>>>>> Murali
>>>>>
>>>>>
>>>>> On Tue, 13 Jun 2006, Phil Carns wrote:
>>>>>
>>>>>
>>>>>> I went back and added some much more specific debugging messages
>>>>>> and put
>>>>>> some special prefixes on flow, bmi, and request processor messages
>>>>>> so I
>>>>>> could group them a little easier, and got rid of the extra mutexes.
>>>>>>
>>>>>> After running a few more tests and double checking the logs, this is
>>>>>> what I am seeing in every failure case:
>>>>>>
>>>>>> - it is a write flow
>>>>>> - it gets broken into two buffers within the flow protocol
>>>>>> - the first is 256K
>>>>>> - the second is 64K
>>>>>> - both sides _do_ get the request processing correct
>>>>>> - both sides post the correct BMI operations
>>>>>> - the server posts 2 receives
>>>>>> - the client posts 2 sends
>>>>>> - the second data message never shows up at the server
>>>>>>
>>>>>> I must have missed part of the flow/request processing on the client
>>>>>> side when I looked before. Sorry about the wild goose chase with the
>>>>>> request process question. The request processing definitely looks
>>>>>> correct in the logs that I am seeing now (or at least both the client
>>>>>> and server agree on it anyway).
>>>>>>
>>>>>> BMI has logging that shows when incoming messages appear on a socket,
>>>>>> and it includes the tag number. The 64K message never shows up on a
>>>>>> socket at the server side. The BMI operation therefore never
>>>>>> completes,
>>>>>> and the callback never gets triggered to complete the flow.
>>>>>>
>>>>>> I don't know if this is a bmi_tcp problem, or if there is some
>>>>>> sort of
>>>>>> general network wackiness in my environment. I'll try to revert
>>>>>> to some
>>>>>> older versions of bmi_tcp tomorrow and see if the problem
>>>>>> persists. I
>>>>>> already tried just backing out changes to the poll notification in
>>>>>> socket-collection.[ch] and (assuming I got that right) it didn't make
>>>>>> any difference.
>>>>>>
>>>>>> -Phil
>>>>>>
>>>>>> Phil Carns wrote:
>>>>>>
>>>>>>> Hi Sam,
>>>>>>>
>>>>>>> I may be wrong, but so far it looks like the problem is a little
>>>>>>> different this time. In the scenario that I am seeing now it
>>>>>>> doesn't
>>>>>>> look like the flow has finished everything and then just failed
>>>>>>> to mark
>>>>>>> completion- it looks like one side actually posted more BMI messages
>>>>>>> than the other. The server posted two receives of 256K, while the
>>>>>>> client only posted one send of 256K. I don't see any zero byte
>>>>>>> results
>>>>>>> from the request processor or any cases where the flow should have
>>>>>>> marked completion.
>>>>>>>
>>>>>>> I'm still trying to get a better log. After adding mutexes in
>>>>>>> PINT_process_request() on the server side, the problem has gotten
>>>>>>> much
>>>>>>> harder to reproduce. I don't know if there is any correlation or
>>>>>>> if I
>>>>>>> am just perturbing the timing enough to hide the problem. So far
>>>>>>> out of
>>>>>>> 9 runs I have had only 1 failure, and the logs from that failure
>>>>>>> were
>>>>>>> inconclusive.
>>>>>>>
>>>>>>> I definitely agree with both of your points about reworking the flow
>>>>>>> code; it is difficult to debug right now.
>>>>>>>
>>>>>>> -Phil
>>>>>>>
>>>>>>> Sam Lang wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Jun 12, 2006, at 8:34 AM, Phil Carns wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I am looking at an I/O problem that I don't completely understand.
>>>>>>>>> The setup is that there are 15 servers and 20 clients (all RHEL3
>>>>>>>>> SMP). The clients are running a proprietary application. At the
>>>>>>>>> end of the run they each write their share of a data set into a 36
>>>>>>>>> GB file. So each is doing a contiguous write of 36/20 GB of data.
>>>>>>>>>
>>>>>>>>> Roughly 80-90% of the time, one or more of the clients and servers
>>>>>>>>> hang, though it isn't predictable which one will do it. After
>>>>>>>>> quite
>>>>>>>>> a bit of digging through logs, it turns out that there is a flow
>>>>>>>>> that never completes on the server side. There are two problems
>>>>>>>>> contributing here:
>>>>>>>>>
>>>>>>>>> 1) The flow timeout mechanism doesn't seem to be working right for
>>>>>>>>> me (the timer for the flow pops and checks progress one time, then
>>>>>>>>> never gets triggered again). This isn't the core problem, but
>>>>>>>>> it is
>>>>>>>>> causing the flow to get stuck forever and hold up the request
>>>>>>>>> scheduler rather than giving up eventually.
>>>>>>>>>
>>>>>>>>> 2) For this particular flow, the client and server appear to
>>>>>>>>> disagree on how much data to exchange. The client sends one 256K
>>>>>>>>> buffer and completes the flow. The server receives that 256K
>>>>>>>>> buffer
>>>>>>>>> but still expects at least one more to arrive (which never does).
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Phil, these problems sound very similar to the bug I was seeing
>>>>>>>> about a month ago with the flow code for certain request types.
>>>>>>>> The
>>>>>>>> particular case I was looking at was with reads, so given that
>>>>>>>> there's a lot of code duplication in flow I could imagine the same
>>>>>>>> problem existing (but not getting fixed) for writes. It may not be
>>>>>>>> the same as the problem you're seeing, but I'll walk through the
>>>>>>>> bug
>>>>>>>> and how I fixed it, in case it helps.
>>>>>>>>
>>>>>>>> Basically, the request was ending right at the boundary of a flow
>>>>>>>> buffer (so bytes == bytemax), this caused the request processing
>>>>>>>> code
>>>>>>>> to finish and the bmi_send_callback_fn would return, but it
>>>>>>>> would get
>>>>>>>> called again by the post code for the next buffer, except this time
>>>>>>>> the request processing would return 0 bytes, and this queue item
>>>>>>>> would get marked as the last one. It was that last queue item that
>>>>>>>> was the problem: since the trove_bstream_read_list was never
>>>>>>>> getting
>>>>>>>> posted, the queue item who's last field was set, was never
>>>>>>>> making it
>>>>>>>> into the trove_read_callback_fn, so the dest_last_posted field for
>>>>>>>> the flow data wasn't getting set either.
>>>>>>>>
>>>>>>>> The fix was to add some extra checking in bmi_send_callback_fn. In
>>>>>>>> the case that the request processing returns 0 bytes processed,
>>>>>>>> I can
>>>>>>>> set the dest_last_posted field to 1 immediately. The only
>>>>>>>> caveat to
>>>>>>>> this is in the case that other queue items in the source list
>>>>>>>> haven't
>>>>>>>> made it to the dest list yet, so we need to check for that
>>>>>>>> case. The
>>>>>>>> diff of the changes I made might help:
>>>>>>>>
>>>>>>>> http://www.pvfs.org/cgi-bin/pvfs2/viewcvs/viewcvs.cgi/pvfs2/src/io/
>>>>>>>> flow/flowproto-bmi-trove/flowproto-multiqueue.c.diff?r1=1.105&r2=1.106
>>>>>>>>
>>>>>>>>
>>>>>>>> In any case, I wasn't able to make use of the debug messages for
>>>>>>>> the
>>>>>>>> request processing code, it was too verbose to make any sense of
>>>>>>>> it.
>>>>>>>> I think I ended up just adding debug statements for the entry and
>>>>>>>> exit points of the flow callbacks along with all the different
>>>>>>>> side-
>>>>>>>> state fields as they were changed. I'm not sure those debug
>>>>>>>> statements made it into the trunk though. :-(
>>>>>>>>
>>>>>>>> Since we're here, I think the flow code could benefit from some
>>>>>>>> reworking in two ways:
>>>>>>>>
>>>>>>>> 1. Less side-state variables to be set. Each queue item has
>>>>>>>> fields
>>>>>>>> that change the behavior of when things should complete. Its
>>>>>>>> hard to
>>>>>>>> keep track of all that.
>>>>>>>>
>>>>>>>> 2. More code abstraction. We end up doing very similar things
>>>>>>>> between reads and writes on the server (and on the client too),
>>>>>>>> especially with respect to the request processing. It might make
>>>>>>>> sense to abstract out some of that code duplication into separate
>>>>>>>> functions.
>>>>>>>>
>>>>>>>> -sam
>>>>>>>>
>>>>>>>>
>>>>>>>>> I am pretty sure I can get to the bottom of the first problem.
>>>>>>>>> The
>>>>>>>>> second one is trickier.
>>>>>>>>>
>>>>>>>>> I attached a few log file excerpts:
>>>>>>>>>
>>>>>>>>> client-1st.txt: This is the request processing log output from the
>>>>>>>>> client side when it is working on the flow in question
>>>>>>>>>
>>>>>>>>> server-1st.txt: request processing log output from the first
>>>>>>>>> iteration of the server side flow
>>>>>>>>>
>>>>>>>>> server-2nd.txt: request processing log output from the second
>>>>>>>>> iteration of the server side flow
>>>>>>>>>
>>>>>>>>> Anyone have any suggestions, or see any indication from those log
>>>>>>>>> portions about why the server and client arn't on the same page?
>>>>>>>>>
>>>>>>>>> Some of this log output (the server-2nd.txt in particular?)
>>>>>>>>> looks a
>>>>>>>>> little jumbled, like maybe there are two flows doing request
>>>>>>>>> processing at once, but I don't really know how to interpret it.
>>>>>>>>>
>>>>>>>>> I may try adding a mutex around PINT_process_request() next,
>>>>>>>>> just to
>>>>>>>>> see if I can make the logs clearer by making sure only one
>>>>>>>>> invocation is going at once.
>>>>>>>>>
>>>>>>>>> -Phil
>>
>>
>>
>>
>> _______________________________________________
>> Pvfs2-developers mailing list
>> Pvfs2-developers at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
More information about the Pvfs2-developers
mailing list