[Pvfs2-developers] help debugging request processor/distribution

Phil Carns pcarns at wastedcycles.org
Tue Jun 13 13:13:26 EDT 2006


Hi Murali,

I'm not seeing any problems with the dd scenario that you describe (I 
tried running 8-10 of them simultaneously), but I may just have 
different timing characteristics.

One thing that I neglected to mention about the application that I am 
running is that it actually has 2 processes per client.  That may be 
something that I have in common with your test case (multiple 
applications writing concurrently from the same client).  I will 
reconfigure and try some test runs with 1 process per client and see if 
that impacts behavior for my test case.

-Phil

Phil Carns wrote:
> I reverted my src/io/bmi directory back to a snapshot of what was in CVS 
> around March, but I still have the same problems.  It doesn't seem to be 
> impacted by the pipe changes or the socket buffer size changes.
> 
> I haven't tried your dd test case yet, but I will give that a shot soon.
> 
> -Phil
> 
> Murali Vilayannur wrote:
> 
>> Hi Phil, others,
>> I think I am seeing the same problem (or some variants of it) with the
>> NFSv4 bug that I reported a few days ago..
>> Here is a very simple way that I am able to reproduce it reliably..
>> Mount a pvfs2 volume and run a few instances of
>> dd if=/dev/zero of=/mnt/pvfs2/file bs=2M count=100
>> All the dd instances simply wait for something to happen, client-core and
>> servers are also waiting for something to happen and no process makes any
>> progress..
>> It is perhaps related to the pipe changes to bmi-tcp with poll? Let me
>> revert those changes and try it out as well..
>> Thanks,
>> Murali
>>
>>
>> On Tue, 13 Jun 2006, Phil Carns wrote:
>>
>>
>>> I went back and added some much more specific debugging messages and put
>>> some special prefixes on flow, bmi, and request processor messages so I
>>> could group them a little easier, and got rid of the extra mutexes.
>>>
>>> After running a few more tests and double checking the logs, this is
>>> what I am seeing in every failure case:
>>>
>>> - it is a write flow
>>> - it gets broken into two buffers within the flow protocol
>>>   - the first is 256K
>>>   - the second is 64K
>>> - both sides _do_ get the request processing correct
>>> - both sides post the correct BMI operations
>>>   - the server posts 2 receives
>>>   - the client posts 2 sends
>>> - the second data message never shows up at the server
>>>
>>> I must have missed part of the flow/request processing on the client
>>> side when I looked before.  Sorry about the wild goose chase with the
>>> request process question.  The request processing definitely looks
>>> correct in the logs that I am seeing now (or at least both the client
>>> and server agree on it anyway).
>>>
>>> BMI has logging that shows when incoming messages appear on a socket,
>>> and it includes the tag number.  The 64K message never shows up on a
>>> socket at the server side.  The BMI operation therefore never completes,
>>> and the callback never gets triggered to complete the flow.
>>>
>>> I don't know if this is a bmi_tcp problem, or if there is some sort of
>>> general network wackiness in my environment.  I'll try to revert to some
>>> older versions of bmi_tcp tomorrow and see if the problem persists.  I
>>> already tried just backing out changes to the poll notification in
>>> socket-collection.[ch] and (assuming I got that right) it didn't make
>>> any difference.
>>>
>>> -Phil
>>>
>>> Phil Carns wrote:
>>>
>>>> Hi Sam,
>>>>
>>>> I may be wrong, but so far it looks like the problem is a little
>>>> different this time.  In the scenario that I am seeing now it doesn't
>>>> look like the flow has finished everything and then just failed to mark
>>>> completion- it looks like one side actually posted more BMI messages
>>>> than the other.  The server posted two receives of 256K, while the
>>>> client only posted one send of 256K.  I don't see any zero byte results
>>>> from the request processor or any cases where the flow should have
>>>> marked completion.
>>>>
>>>> I'm still trying to get a better log.  After adding mutexes in
>>>> PINT_process_request() on the server side, the problem has gotten much
>>>> harder to reproduce.  I don't know if there is any correlation or if I
>>>> am just perturbing the timing enough to hide the problem.  So far 
>>>> out of
>>>> 9 runs I have had only 1 failure, and the logs from that failure were
>>>> inconclusive.
>>>>
>>>> I definitely agree with both of your points about reworking the flow
>>>> code; it is difficult to debug right now.
>>>>
>>>> -Phil
>>>>
>>>> Sam Lang wrote:
>>>>
>>>>
>>>>> On Jun 12, 2006, at 8:34 AM, Phil Carns wrote:
>>>>>
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am looking at an I/O problem that I don't completely understand.
>>>>>> The setup is that there are 15 servers and 20 clients (all RHEL3
>>>>>> SMP).  The clients are running a proprietary application.  At the
>>>>>> end of the run they each write their share of a data set into a 36
>>>>>> GB file.  So each is doing a contiguous write of 36/20 GB of data.
>>>>>>
>>>>>> Roughly 80-90% of the time, one or more of the clients and servers
>>>>>> hang, though it isn't predictable which one will do it.  After  quite
>>>>>> a bit of digging through logs, it turns out that there is a  flow
>>>>>> that never completes on the server side.  There are two  problems
>>>>>> contributing here:
>>>>>>
>>>>>> 1) The flow timeout mechanism doesn't seem to be working right for
>>>>>> me (the timer for the flow pops and checks progress one time, then
>>>>>> never gets triggered again).  This isn't the core problem, but it  is
>>>>>> causing the flow to get stuck forever and hold up the request
>>>>>> scheduler rather than giving up eventually.
>>>>>>
>>>>>> 2) For this particular flow, the client and server appear to
>>>>>> disagree on how much data to exchange.  The client sends one 256K
>>>>>> buffer and completes the flow.  The server receives that 256K  buffer
>>>>>> but still expects at least one more to arrive (which never  does).
>>>>>>
>>>>>
>>>>> Hi Phil, these problems sound very similar to the bug I was seeing
>>>>> about a month ago with the flow code for certain request types.  The
>>>>> particular case I was looking at was with reads, so given that
>>>>> there's a lot of code duplication in flow I could imagine the same
>>>>> problem existing (but not getting fixed) for writes.  It may not be
>>>>> the same as the problem you're seeing, but I'll walk through the bug
>>>>> and how I fixed it, in case it helps.
>>>>>
>>>>> Basically, the request was ending right at the boundary of a flow
>>>>> buffer (so bytes == bytemax), this caused the request processing code
>>>>> to finish and the bmi_send_callback_fn would return, but it would get
>>>>> called again by the post code for the next buffer, except this time
>>>>> the request processing would return 0 bytes, and this queue item
>>>>> would get marked as the last one.  It was that last queue item that
>>>>> was the problem: since the trove_bstream_read_list was never getting
>>>>> posted, the queue item who's last field was set, was never making it
>>>>> into the trove_read_callback_fn, so the dest_last_posted field for
>>>>> the flow data wasn't getting set either.
>>>>>
>>>>> The fix was to add some extra checking in bmi_send_callback_fn.  In
>>>>> the case that the request processing returns 0 bytes processed, I can
>>>>> set the dest_last_posted field to 1 immediately.  The only caveat to
>>>>> this is in the case that other queue items in the source list haven't
>>>>> made it to the dest list yet, so we need to check for that case.  The
>>>>> diff of the changes I made might help:
>>>>>
>>>>> http://www.pvfs.org/cgi-bin/pvfs2/viewcvs/viewcvs.cgi/pvfs2/src/io/
>>>>> flow/flowproto-bmi-trove/flowproto-multiqueue.c.diff?r1=1.105&r2=1.106
>>>>>
>>>>> In any case, I wasn't able to make use of the debug messages for the
>>>>> request processing code, it was too verbose to make any sense of it.
>>>>> I think I ended up just adding debug statements for the entry and
>>>>> exit points of the flow callbacks along with all the different side-
>>>>> state fields as they were changed.  I'm not sure those debug
>>>>> statements made it into the trunk though. :-(
>>>>>
>>>>> Since we're here, I think the flow code could benefit from some
>>>>> reworking in two ways:
>>>>>
>>>>> 1.  Less side-state variables to be set.  Each queue item has fields
>>>>> that change the behavior of when things should complete.  Its hard to
>>>>> keep track of all that.
>>>>>
>>>>> 2.  More code abstraction.  We end up doing very similar things
>>>>> between reads and writes on the server (and on the client too),
>>>>> especially with respect to the request processing.  It might make
>>>>> sense to abstract out some of that code duplication into separate
>>>>> functions.
>>>>>
>>>>> -sam
>>>>>
>>>>>
>>>>>> I am pretty sure I can get to the bottom of the first problem.  The
>>>>>> second one is trickier.
>>>>>>
>>>>>> I attached a few log file excerpts:
>>>>>>
>>>>>> client-1st.txt: This is the request processing log output from the
>>>>>> client side when it is working on the flow in question
>>>>>>
>>>>>> server-1st.txt: request processing log output from the first
>>>>>> iteration of the server side flow
>>>>>>
>>>>>> server-2nd.txt: request processing log output from the second
>>>>>> iteration of the server side flow
>>>>>>
>>>>>> Anyone have any suggestions, or see any indication from those log
>>>>>> portions about why the server and client arn't on the same page?
>>>>>>
>>>>>> Some of this log output (the server-2nd.txt in particular?) looks a
>>>>>> little jumbled, like maybe there are two flows doing request
>>>>>> processing at once, but I don't really know how to interpret it.
>>>>>>
>>>>>> I may try adding a mutex around PINT_process_request() next, just  to
>>>>>> see if I can make the logs clearer by making sure only one
>>>>>> invocation is going at once.
>>>>>>
>>>>>> -Phil
>>>
>>>
>>> _______________________________________________
>>> Pvfs2-developers mailing list
>>> Pvfs2-developers at beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>>
>>>
> 
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers



More information about the Pvfs2-developers mailing list