[Pvfs2-developers] help debugging request processor/distribution
Bradley W Settlemyer
bradles at parl.clemson.edu
Wed Jun 14 14:58:41 EDT 2006
I've seen similar behavior to this using Myrinet (jazz) also. I haven't
ever gotten it with TCP though, so maybe mine was due to something a
little different (basically, it looked like a message with the tag was
sent, but never delivered). Are you certain that its just a BMI tcp issue?
Cheers,
Brad
Phil Carns wrote:
> Ok, I think I _might_ see what the problem is with the BMI messaging.
>
> I haven't 100% confirmed yet, but it looks like we have the following
> scenario:
>
> On the client side:
> --------------------
> - pvfs2-client-core starts a I/O operation (write) to server X
> - a send (for the request) is posted, which is a small buffer
> - the flow is posted before an ack is received
> - the flow itself posts another send for data, which is a large buffer
> - ...
>
> A few notes real quick- I think the above is a performance optimization;
> we try to go ahead and get the flow going before receiving a positive
> ack from the server. It will be canceled if we get a negative ack (or
> fail to get an ack altogether)
>
> - while the above is in progress, pvfs2-client-core starts another write
> operation to server X (from another application that is hitting the same
> server)
> - a send for this second request is posted
> - another flow is posted before an ack is received
> - depending on the timing, it may manage to post a send for data as
> well, which is another large buffer
> - this traffic is interleaved on the same socket as is being used for
> the first flow, which is still running at this point
>
> On the server side:
> --------------------
> - the first I/O request arrives
> - it gets past the request scheduler
> - a flow is started and receives the first (large) data buffer
> - a different request for the same handle arrives
> - getattr would be a good example, could be from any client
> - this getattr gets queued in the request scheduler behind the write
> - the second I/O request arrives
> - it gets queued behind the getattr in the request scheduler
>
> At this point on the server side, we have a flow in progress that is
> waiting on a data buffer. However, the next message is for a different
> flow (the tags don't match). Since this message is relatively large
> (256K), it is in rendezvous mode within bmi_tcp and cannot be pulled out
> of the socket until a matching receive is posted. The flow that is
> expected to post that receive is not running yet because the second I/O
> request is stuck in the scheduler.
>
> ... so we have a deadlock. The socket is filled with data that the
> server isn't allowed to recv yet, and the data that it really needs next
> is stuck behind it.
>
> I'm not sure that I described that all that well. At a high level we
> have two flows sharing the same socket. The client started both of them
> and the messages got interleaved. The server only started one of them,
> but is now stuck because it can't deal with data arriving for the second
> one.
>
> I am going to try to find a brute force way to serialize I/O from each
> pvfs2-client-core just to see if that solves the problem (maybe only
> allowing one buffer between pvfs2-client-core and kernel, rather than
> 5). If that does look like it fixed the problem, then we need a more
> elegant solution. Maybe waiting for acks before starting flows, or
> just somehow serializing flows that share sockets.
>
> -Phil
>
> Phil Carns wrote:
>
>> Hi Murali,
>>
>> I'm not seeing any problems with the dd scenario that you describe (I
>> tried running 8-10 of them simultaneously), but I may just have
>> different timing characteristics.
>>
>> One thing that I neglected to mention about the application that I am
>> running is that it actually has 2 processes per client. That may be
>> something that I have in common with your test case (multiple
>> applications writing concurrently from the same client). I will
>> reconfigure and try some test runs with 1 process per client and see
>> if that impacts behavior for my test case.
>>
>> -Phil
>>
>> Phil Carns wrote:
>>
>>> I reverted my src/io/bmi directory back to a snapshot of what was in
>>> CVS around March, but I still have the same problems. It doesn't
>>> seem to be impacted by the pipe changes or the socket buffer size
>>> changes.
>>>
>>> I haven't tried your dd test case yet, but I will give that a shot soon.
>>>
>>> -Phil
>>>
>>> Murali Vilayannur wrote:
>>>
>>>> Hi Phil, others,
>>>> I think I am seeing the same problem (or some variants of it) with the
>>>> NFSv4 bug that I reported a few days ago..
>>>> Here is a very simple way that I am able to reproduce it reliably..
>>>> Mount a pvfs2 volume and run a few instances of
>>>> dd if=/dev/zero of=/mnt/pvfs2/file bs=2M count=100
>>>> All the dd instances simply wait for something to happen,
>>>> client-core and
>>>> servers are also waiting for something to happen and no process
>>>> makes any
>>>> progress..
>>>> It is perhaps related to the pipe changes to bmi-tcp with poll? Let me
>>>> revert those changes and try it out as well..
>>>> Thanks,
>>>> Murali
>>>>
>>>>
>>>> On Tue, 13 Jun 2006, Phil Carns wrote:
>>>>
>>>>
>>>>> I went back and added some much more specific debugging messages
>>>>> and put
>>>>> some special prefixes on flow, bmi, and request processor messages
>>>>> so I
>>>>> could group them a little easier, and got rid of the extra mutexes.
>>>>>
>>>>> After running a few more tests and double checking the logs, this is
>>>>> what I am seeing in every failure case:
>>>>>
>>>>> - it is a write flow
>>>>> - it gets broken into two buffers within the flow protocol
>>>>> - the first is 256K
>>>>> - the second is 64K
>>>>> - both sides _do_ get the request processing correct
>>>>> - both sides post the correct BMI operations
>>>>> - the server posts 2 receives
>>>>> - the client posts 2 sends
>>>>> - the second data message never shows up at the server
>>>>>
>>>>> I must have missed part of the flow/request processing on the client
>>>>> side when I looked before. Sorry about the wild goose chase with the
>>>>> request process question. The request processing definitely looks
>>>>> correct in the logs that I am seeing now (or at least both the client
>>>>> and server agree on it anyway).
>>>>>
>>>>> BMI has logging that shows when incoming messages appear on a socket,
>>>>> and it includes the tag number. The 64K message never shows up on a
>>>>> socket at the server side. The BMI operation therefore never
>>>>> completes,
>>>>> and the callback never gets triggered to complete the flow.
>>>>>
>>>>> I don't know if this is a bmi_tcp problem, or if there is some sort of
>>>>> general network wackiness in my environment. I'll try to revert to
>>>>> some
>>>>> older versions of bmi_tcp tomorrow and see if the problem persists. I
>>>>> already tried just backing out changes to the poll notification in
>>>>> socket-collection.[ch] and (assuming I got that right) it didn't make
>>>>> any difference.
>>>>>
>>>>> -Phil
>>>>>
>>>>> Phil Carns wrote:
>>>>>
>>>>>> Hi Sam,
>>>>>>
>>>>>> I may be wrong, but so far it looks like the problem is a little
>>>>>> different this time. In the scenario that I am seeing now it doesn't
>>>>>> look like the flow has finished everything and then just failed to
>>>>>> mark
>>>>>> completion- it looks like one side actually posted more BMI messages
>>>>>> than the other. The server posted two receives of 256K, while the
>>>>>> client only posted one send of 256K. I don't see any zero byte
>>>>>> results
>>>>>> from the request processor or any cases where the flow should have
>>>>>> marked completion.
>>>>>>
>>>>>> I'm still trying to get a better log. After adding mutexes in
>>>>>> PINT_process_request() on the server side, the problem has gotten
>>>>>> much
>>>>>> harder to reproduce. I don't know if there is any correlation or
>>>>>> if I
>>>>>> am just perturbing the timing enough to hide the problem. So far
>>>>>> out of
>>>>>> 9 runs I have had only 1 failure, and the logs from that failure were
>>>>>> inconclusive.
>>>>>>
>>>>>> I definitely agree with both of your points about reworking the flow
>>>>>> code; it is difficult to debug right now.
>>>>>>
>>>>>> -Phil
>>>>>>
>>>>>> Sam Lang wrote:
>>>>>>
>>>>>>
>>>>>>> On Jun 12, 2006, at 8:34 AM, Phil Carns wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I am looking at an I/O problem that I don't completely understand.
>>>>>>>> The setup is that there are 15 servers and 20 clients (all RHEL3
>>>>>>>> SMP). The clients are running a proprietary application. At the
>>>>>>>> end of the run they each write their share of a data set into a 36
>>>>>>>> GB file. So each is doing a contiguous write of 36/20 GB of data.
>>>>>>>>
>>>>>>>> Roughly 80-90% of the time, one or more of the clients and servers
>>>>>>>> hang, though it isn't predictable which one will do it. After
>>>>>>>> quite
>>>>>>>> a bit of digging through logs, it turns out that there is a flow
>>>>>>>> that never completes on the server side. There are two problems
>>>>>>>> contributing here:
>>>>>>>>
>>>>>>>> 1) The flow timeout mechanism doesn't seem to be working right for
>>>>>>>> me (the timer for the flow pops and checks progress one time, then
>>>>>>>> never gets triggered again). This isn't the core problem, but
>>>>>>>> it is
>>>>>>>> causing the flow to get stuck forever and hold up the request
>>>>>>>> scheduler rather than giving up eventually.
>>>>>>>>
>>>>>>>> 2) For this particular flow, the client and server appear to
>>>>>>>> disagree on how much data to exchange. The client sends one 256K
>>>>>>>> buffer and completes the flow. The server receives that 256K
>>>>>>>> buffer
>>>>>>>> but still expects at least one more to arrive (which never does).
>>>>>>>>
>>>>>>>
>>>>>>> Hi Phil, these problems sound very similar to the bug I was seeing
>>>>>>> about a month ago with the flow code for certain request types. The
>>>>>>> particular case I was looking at was with reads, so given that
>>>>>>> there's a lot of code duplication in flow I could imagine the same
>>>>>>> problem existing (but not getting fixed) for writes. It may not be
>>>>>>> the same as the problem you're seeing, but I'll walk through the bug
>>>>>>> and how I fixed it, in case it helps.
>>>>>>>
>>>>>>> Basically, the request was ending right at the boundary of a flow
>>>>>>> buffer (so bytes == bytemax), this caused the request processing
>>>>>>> code
>>>>>>> to finish and the bmi_send_callback_fn would return, but it would
>>>>>>> get
>>>>>>> called again by the post code for the next buffer, except this time
>>>>>>> the request processing would return 0 bytes, and this queue item
>>>>>>> would get marked as the last one. It was that last queue item that
>>>>>>> was the problem: since the trove_bstream_read_list was never getting
>>>>>>> posted, the queue item who's last field was set, was never making it
>>>>>>> into the trove_read_callback_fn, so the dest_last_posted field for
>>>>>>> the flow data wasn't getting set either.
>>>>>>>
>>>>>>> The fix was to add some extra checking in bmi_send_callback_fn. In
>>>>>>> the case that the request processing returns 0 bytes processed, I
>>>>>>> can
>>>>>>> set the dest_last_posted field to 1 immediately. The only caveat to
>>>>>>> this is in the case that other queue items in the source list
>>>>>>> haven't
>>>>>>> made it to the dest list yet, so we need to check for that case.
>>>>>>> The
>>>>>>> diff of the changes I made might help:
>>>>>>>
>>>>>>> http://www.pvfs.org/cgi-bin/pvfs2/viewcvs/viewcvs.cgi/pvfs2/src/io/
>>>>>>> flow/flowproto-bmi-trove/flowproto-multiqueue.c.diff?r1=1.105&r2=1.106
>>>>>>>
>>>>>>>
>>>>>>> In any case, I wasn't able to make use of the debug messages for the
>>>>>>> request processing code, it was too verbose to make any sense of it.
>>>>>>> I think I ended up just adding debug statements for the entry and
>>>>>>> exit points of the flow callbacks along with all the different side-
>>>>>>> state fields as they were changed. I'm not sure those debug
>>>>>>> statements made it into the trunk though. :-(
>>>>>>>
>>>>>>> Since we're here, I think the flow code could benefit from some
>>>>>>> reworking in two ways:
>>>>>>>
>>>>>>> 1. Less side-state variables to be set. Each queue item has fields
>>>>>>> that change the behavior of when things should complete. Its
>>>>>>> hard to
>>>>>>> keep track of all that.
>>>>>>>
>>>>>>> 2. More code abstraction. We end up doing very similar things
>>>>>>> between reads and writes on the server (and on the client too),
>>>>>>> especially with respect to the request processing. It might make
>>>>>>> sense to abstract out some of that code duplication into separate
>>>>>>> functions.
>>>>>>>
>>>>>>> -sam
>>>>>>>
>>>>>>>
>>>>>>>> I am pretty sure I can get to the bottom of the first problem. The
>>>>>>>> second one is trickier.
>>>>>>>>
>>>>>>>> I attached a few log file excerpts:
>>>>>>>>
>>>>>>>> client-1st.txt: This is the request processing log output from the
>>>>>>>> client side when it is working on the flow in question
>>>>>>>>
>>>>>>>> server-1st.txt: request processing log output from the first
>>>>>>>> iteration of the server side flow
>>>>>>>>
>>>>>>>> server-2nd.txt: request processing log output from the second
>>>>>>>> iteration of the server side flow
>>>>>>>>
>>>>>>>> Anyone have any suggestions, or see any indication from those log
>>>>>>>> portions about why the server and client arn't on the same page?
>>>>>>>>
>>>>>>>> Some of this log output (the server-2nd.txt in particular?) looks a
>>>>>>>> little jumbled, like maybe there are two flows doing request
>>>>>>>> processing at once, but I don't really know how to interpret it.
>>>>>>>>
>>>>>>>> I may try adding a mutex around PINT_process_request() next,
>>>>>>>> just to
>>>>>>>> see if I can make the logs clearer by making sure only one
>>>>>>>> invocation is going at once.
>>>>>>>>
>>>>>>>> -Phil
>
>
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
More information about the Pvfs2-developers
mailing list