[Pvfs2-developers] help debugging request processor/distribution

Phil Carns pcarns at wastedcycles.org
Thu Jun 15 08:45:26 EDT 2006


I don't think that the same problem would happen with bmi_gm.  For large 
buffers it has to do a full rendezvous handshake to exchange memory 
addresses etc. for remote put operations (directed sends in GM-speak). 
There isn't any way to get a transmission started without a buffer 
available.

-Phil

Bradley W Settlemyer wrote:
> I've seen similar behavior to this using Myrinet (jazz) also.  I haven't 
> ever gotten it with TCP though, so maybe mine was due to something a 
> little different (basically, it looked like a message with the tag was 
> sent, but never delivered).  Are you certain that its just a BMI tcp issue?
> 
> Cheers,
> Brad
> 
> 
> Phil Carns wrote:
> 
>> Ok, I think I _might_ see what the problem is with the BMI messaging.
>>
>> I haven't 100% confirmed yet, but it looks like we have the following 
>> scenario:
>>
>> On the client side:
>> --------------------
>> - pvfs2-client-core starts a I/O operation (write) to server X
>>   - a send (for the request) is posted, which is a small buffer
>>   - the flow is posted before an ack is received
>>   - the flow itself posts another send for data, which is a large buffer
>>   - ...
>>
>> A few notes real quick- I think the above is a performance 
>> optimization; we try to go ahead and get the flow going before 
>> receiving a positive ack from the server.  It will be canceled if we 
>> get a negative ack (or fail to get an ack altogether)
>>
>> - while the above is in progress, pvfs2-client-core starts another 
>> write operation to server X (from another application that is hitting 
>> the same server)
>>   - a send for this second request is posted
>>   - another flow is posted before an ack is received
>>   - depending on the timing, it may manage to post a send for data as 
>> well, which is another large buffer
>>   - this traffic is interleaved on the same socket as is being used 
>> for the first flow, which is still running at this point
>>
>> On the server side:
>> --------------------
>> - the first I/O request arrives
>>   - it gets past the request scheduler
>>   - a flow is started and receives the first (large) data buffer
>> - a different request for the same handle arrives
>>   - getattr would be a good example, could be from any client
>>   - this getattr gets queued in the request scheduler behind the write
>> - the second I/O request arrives
>>   - it gets queued behind the getattr in the request scheduler
>>
>> At this point on the server side, we have a flow in progress that is 
>> waiting on a data buffer.  However, the next message is for a 
>> different flow (the tags don't match).  Since this message is 
>> relatively large (256K), it is in rendezvous mode within bmi_tcp and 
>> cannot be pulled out of the socket until a matching receive is 
>> posted.  The flow that is expected to post that receive is not running 
>> yet because the second I/O request is stuck in the scheduler.
>>
>> ... so we have a deadlock.  The socket is filled with data that the 
>> server isn't allowed to recv yet, and the data that it really needs 
>> next is stuck behind it.
>>
>> I'm not sure that I described that all that well.  At a high level we 
>> have two flows sharing the same socket.  The client started both of 
>> them and the messages got interleaved.  The server only started one of 
>> them, but is now stuck because it can't deal with data arriving for 
>> the second one.
>>
>> I am going to try to find a brute force way to serialize I/O from each 
>> pvfs2-client-core just to see if that solves the problem (maybe only 
>> allowing one buffer between pvfs2-client-core and kernel, rather than 
>> 5).  If that does look like it fixed the problem, then we need a more 
>> elegant solution.   Maybe waiting for acks before starting flows, or 
>> just somehow serializing flows that share sockets.
>>
>> -Phil
>>
>> Phil Carns wrote:
>>
>>> Hi Murali,
>>>
>>> I'm not seeing any problems with the dd scenario that you describe (I 
>>> tried running 8-10 of them simultaneously), but I may just have 
>>> different timing characteristics.
>>>
>>> One thing that I neglected to mention about the application that I am 
>>> running is that it actually has 2 processes per client.  That may be 
>>> something that I have in common with your test case (multiple 
>>> applications writing concurrently from the same client).  I will 
>>> reconfigure and try some test runs with 1 process per client and see 
>>> if that impacts behavior for my test case.
>>>
>>> -Phil
>>>
>>> Phil Carns wrote:
>>>
>>>> I reverted my src/io/bmi directory back to a snapshot of what was in 
>>>> CVS around March, but I still have the same problems.  It doesn't 
>>>> seem to be impacted by the pipe changes or the socket buffer size 
>>>> changes.
>>>>
>>>> I haven't tried your dd test case yet, but I will give that a shot 
>>>> soon.
>>>>
>>>> -Phil
>>>>
>>>> Murali Vilayannur wrote:
>>>>
>>>>> Hi Phil, others,
>>>>> I think I am seeing the same problem (or some variants of it) with the
>>>>> NFSv4 bug that I reported a few days ago..
>>>>> Here is a very simple way that I am able to reproduce it reliably..
>>>>> Mount a pvfs2 volume and run a few instances of
>>>>> dd if=/dev/zero of=/mnt/pvfs2/file bs=2M count=100
>>>>> All the dd instances simply wait for something to happen, 
>>>>> client-core and
>>>>> servers are also waiting for something to happen and no process 
>>>>> makes any
>>>>> progress..
>>>>> It is perhaps related to the pipe changes to bmi-tcp with poll? Let me
>>>>> revert those changes and try it out as well..
>>>>> Thanks,
>>>>> Murali
>>>>>
>>>>>
>>>>> On Tue, 13 Jun 2006, Phil Carns wrote:
>>>>>
>>>>>
>>>>>> I went back and added some much more specific debugging messages 
>>>>>> and put
>>>>>> some special prefixes on flow, bmi, and request processor messages 
>>>>>> so I
>>>>>> could group them a little easier, and got rid of the extra mutexes.
>>>>>>
>>>>>> After running a few more tests and double checking the logs, this is
>>>>>> what I am seeing in every failure case:
>>>>>>
>>>>>> - it is a write flow
>>>>>> - it gets broken into two buffers within the flow protocol
>>>>>>   - the first is 256K
>>>>>>   - the second is 64K
>>>>>> - both sides _do_ get the request processing correct
>>>>>> - both sides post the correct BMI operations
>>>>>>   - the server posts 2 receives
>>>>>>   - the client posts 2 sends
>>>>>> - the second data message never shows up at the server
>>>>>>
>>>>>> I must have missed part of the flow/request processing on the client
>>>>>> side when I looked before.  Sorry about the wild goose chase with the
>>>>>> request process question.  The request processing definitely looks
>>>>>> correct in the logs that I am seeing now (or at least both the client
>>>>>> and server agree on it anyway).
>>>>>>
>>>>>> BMI has logging that shows when incoming messages appear on a socket,
>>>>>> and it includes the tag number.  The 64K message never shows up on a
>>>>>> socket at the server side.  The BMI operation therefore never 
>>>>>> completes,
>>>>>> and the callback never gets triggered to complete the flow.
>>>>>>
>>>>>> I don't know if this is a bmi_tcp problem, or if there is some 
>>>>>> sort of
>>>>>> general network wackiness in my environment.  I'll try to revert 
>>>>>> to some
>>>>>> older versions of bmi_tcp tomorrow and see if the problem 
>>>>>> persists.  I
>>>>>> already tried just backing out changes to the poll notification in
>>>>>> socket-collection.[ch] and (assuming I got that right) it didn't make
>>>>>> any difference.
>>>>>>
>>>>>> -Phil
>>>>>>
>>>>>> Phil Carns wrote:
>>>>>>
>>>>>>> Hi Sam,
>>>>>>>
>>>>>>> I may be wrong, but so far it looks like the problem is a little
>>>>>>> different this time.  In the scenario that I am seeing now it 
>>>>>>> doesn't
>>>>>>> look like the flow has finished everything and then just failed 
>>>>>>> to mark
>>>>>>> completion- it looks like one side actually posted more BMI messages
>>>>>>> than the other.  The server posted two receives of 256K, while the
>>>>>>> client only posted one send of 256K.  I don't see any zero byte 
>>>>>>> results
>>>>>>> from the request processor or any cases where the flow should have
>>>>>>> marked completion.
>>>>>>>
>>>>>>> I'm still trying to get a better log.  After adding mutexes in
>>>>>>> PINT_process_request() on the server side, the problem has gotten 
>>>>>>> much
>>>>>>> harder to reproduce.  I don't know if there is any correlation or 
>>>>>>> if I
>>>>>>> am just perturbing the timing enough to hide the problem.  So far 
>>>>>>> out of
>>>>>>> 9 runs I have had only 1 failure, and the logs from that failure 
>>>>>>> were
>>>>>>> inconclusive.
>>>>>>>
>>>>>>> I definitely agree with both of your points about reworking the flow
>>>>>>> code; it is difficult to debug right now.
>>>>>>>
>>>>>>> -Phil
>>>>>>>
>>>>>>> Sam Lang wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Jun 12, 2006, at 8:34 AM, Phil Carns wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I am looking at an I/O problem that I don't completely understand.
>>>>>>>>> The setup is that there are 15 servers and 20 clients (all RHEL3
>>>>>>>>> SMP).  The clients are running a proprietary application.  At the
>>>>>>>>> end of the run they each write their share of a data set into a 36
>>>>>>>>> GB file.  So each is doing a contiguous write of 36/20 GB of data.
>>>>>>>>>
>>>>>>>>> Roughly 80-90% of the time, one or more of the clients and servers
>>>>>>>>> hang, though it isn't predictable which one will do it.  After  
>>>>>>>>> quite
>>>>>>>>> a bit of digging through logs, it turns out that there is a  flow
>>>>>>>>> that never completes on the server side.  There are two  problems
>>>>>>>>> contributing here:
>>>>>>>>>
>>>>>>>>> 1) The flow timeout mechanism doesn't seem to be working right for
>>>>>>>>> me (the timer for the flow pops and checks progress one time, then
>>>>>>>>> never gets triggered again).  This isn't the core problem, but 
>>>>>>>>> it  is
>>>>>>>>> causing the flow to get stuck forever and hold up the request
>>>>>>>>> scheduler rather than giving up eventually.
>>>>>>>>>
>>>>>>>>> 2) For this particular flow, the client and server appear to
>>>>>>>>> disagree on how much data to exchange.  The client sends one 256K
>>>>>>>>> buffer and completes the flow.  The server receives that 256K  
>>>>>>>>> buffer
>>>>>>>>> but still expects at least one more to arrive (which never  does).
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Phil, these problems sound very similar to the bug I was seeing
>>>>>>>> about a month ago with the flow code for certain request types.  
>>>>>>>> The
>>>>>>>> particular case I was looking at was with reads, so given that
>>>>>>>> there's a lot of code duplication in flow I could imagine the same
>>>>>>>> problem existing (but not getting fixed) for writes.  It may not be
>>>>>>>> the same as the problem you're seeing, but I'll walk through the 
>>>>>>>> bug
>>>>>>>> and how I fixed it, in case it helps.
>>>>>>>>
>>>>>>>> Basically, the request was ending right at the boundary of a flow
>>>>>>>> buffer (so bytes == bytemax), this caused the request processing 
>>>>>>>> code
>>>>>>>> to finish and the bmi_send_callback_fn would return, but it 
>>>>>>>> would get
>>>>>>>> called again by the post code for the next buffer, except this time
>>>>>>>> the request processing would return 0 bytes, and this queue item
>>>>>>>> would get marked as the last one.  It was that last queue item that
>>>>>>>> was the problem: since the trove_bstream_read_list was never 
>>>>>>>> getting
>>>>>>>> posted, the queue item who's last field was set, was never 
>>>>>>>> making it
>>>>>>>> into the trove_read_callback_fn, so the dest_last_posted field for
>>>>>>>> the flow data wasn't getting set either.
>>>>>>>>
>>>>>>>> The fix was to add some extra checking in bmi_send_callback_fn.  In
>>>>>>>> the case that the request processing returns 0 bytes processed, 
>>>>>>>> I can
>>>>>>>> set the dest_last_posted field to 1 immediately.  The only 
>>>>>>>> caveat to
>>>>>>>> this is in the case that other queue items in the source list 
>>>>>>>> haven't
>>>>>>>> made it to the dest list yet, so we need to check for that 
>>>>>>>> case.  The
>>>>>>>> diff of the changes I made might help:
>>>>>>>>
>>>>>>>> http://www.pvfs.org/cgi-bin/pvfs2/viewcvs/viewcvs.cgi/pvfs2/src/io/
>>>>>>>> flow/flowproto-bmi-trove/flowproto-multiqueue.c.diff?r1=1.105&r2=1.106 
>>>>>>>>
>>>>>>>>
>>>>>>>> In any case, I wasn't able to make use of the debug messages for 
>>>>>>>> the
>>>>>>>> request processing code, it was too verbose to make any sense of 
>>>>>>>> it.
>>>>>>>> I think I ended up just adding debug statements for the entry and
>>>>>>>> exit points of the flow callbacks along with all the different 
>>>>>>>> side-
>>>>>>>> state fields as they were changed.  I'm not sure those debug
>>>>>>>> statements made it into the trunk though. :-(
>>>>>>>>
>>>>>>>> Since we're here, I think the flow code could benefit from some
>>>>>>>> reworking in two ways:
>>>>>>>>
>>>>>>>> 1.  Less side-state variables to be set.  Each queue item has 
>>>>>>>> fields
>>>>>>>> that change the behavior of when things should complete.  Its 
>>>>>>>> hard to
>>>>>>>> keep track of all that.
>>>>>>>>
>>>>>>>> 2.  More code abstraction.  We end up doing very similar things
>>>>>>>> between reads and writes on the server (and on the client too),
>>>>>>>> especially with respect to the request processing.  It might make
>>>>>>>> sense to abstract out some of that code duplication into separate
>>>>>>>> functions.
>>>>>>>>
>>>>>>>> -sam
>>>>>>>>
>>>>>>>>
>>>>>>>>> I am pretty sure I can get to the bottom of the first problem.  
>>>>>>>>> The
>>>>>>>>> second one is trickier.
>>>>>>>>>
>>>>>>>>> I attached a few log file excerpts:
>>>>>>>>>
>>>>>>>>> client-1st.txt: This is the request processing log output from the
>>>>>>>>> client side when it is working on the flow in question
>>>>>>>>>
>>>>>>>>> server-1st.txt: request processing log output from the first
>>>>>>>>> iteration of the server side flow
>>>>>>>>>
>>>>>>>>> server-2nd.txt: request processing log output from the second
>>>>>>>>> iteration of the server side flow
>>>>>>>>>
>>>>>>>>> Anyone have any suggestions, or see any indication from those log
>>>>>>>>> portions about why the server and client arn't on the same page?
>>>>>>>>>
>>>>>>>>> Some of this log output (the server-2nd.txt in particular?) 
>>>>>>>>> looks a
>>>>>>>>> little jumbled, like maybe there are two flows doing request
>>>>>>>>> processing at once, but I don't really know how to interpret it.
>>>>>>>>>
>>>>>>>>> I may try adding a mutex around PINT_process_request() next, 
>>>>>>>>> just  to
>>>>>>>>> see if I can make the logs clearer by making sure only one
>>>>>>>>> invocation is going at once.
>>>>>>>>>
>>>>>>>>> -Phil
>>
>>
>>
>>
>> _______________________________________________
>> Pvfs2-developers mailing list
>> Pvfs2-developers at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers



More information about the Pvfs2-developers mailing list