[Pvfs2-developers] help debugging request processor/distribution
Phil Carns
pcarns at wastedcycles.org
Tue Jun 13 12:38:29 EDT 2006
I reverted my src/io/bmi directory back to a snapshot of what was in CVS
around March, but I still have the same problems. It doesn't seem to be
impacted by the pipe changes or the socket buffer size changes.
I haven't tried your dd test case yet, but I will give that a shot soon.
-Phil
Murali Vilayannur wrote:
> Hi Phil, others,
> I think I am seeing the same problem (or some variants of it) with the
> NFSv4 bug that I reported a few days ago..
> Here is a very simple way that I am able to reproduce it reliably..
> Mount a pvfs2 volume and run a few instances of
> dd if=/dev/zero of=/mnt/pvfs2/file bs=2M count=100
> All the dd instances simply wait for something to happen, client-core and
> servers are also waiting for something to happen and no process makes any
> progress..
> It is perhaps related to the pipe changes to bmi-tcp with poll? Let me
> revert those changes and try it out as well..
> Thanks,
> Murali
>
>
> On Tue, 13 Jun 2006, Phil Carns wrote:
>
>
>>I went back and added some much more specific debugging messages and put
>>some special prefixes on flow, bmi, and request processor messages so I
>>could group them a little easier, and got rid of the extra mutexes.
>>
>>After running a few more tests and double checking the logs, this is
>>what I am seeing in every failure case:
>>
>>- it is a write flow
>>- it gets broken into two buffers within the flow protocol
>> - the first is 256K
>> - the second is 64K
>>- both sides _do_ get the request processing correct
>>- both sides post the correct BMI operations
>> - the server posts 2 receives
>> - the client posts 2 sends
>>- the second data message never shows up at the server
>>
>>I must have missed part of the flow/request processing on the client
>>side when I looked before. Sorry about the wild goose chase with the
>>request process question. The request processing definitely looks
>>correct in the logs that I am seeing now (or at least both the client
>>and server agree on it anyway).
>>
>>BMI has logging that shows when incoming messages appear on a socket,
>>and it includes the tag number. The 64K message never shows up on a
>>socket at the server side. The BMI operation therefore never completes,
>>and the callback never gets triggered to complete the flow.
>>
>>I don't know if this is a bmi_tcp problem, or if there is some sort of
>>general network wackiness in my environment. I'll try to revert to some
>>older versions of bmi_tcp tomorrow and see if the problem persists. I
>>already tried just backing out changes to the poll notification in
>>socket-collection.[ch] and (assuming I got that right) it didn't make
>>any difference.
>>
>>-Phil
>>
>>Phil Carns wrote:
>>
>>>Hi Sam,
>>>
>>>I may be wrong, but so far it looks like the problem is a little
>>>different this time. In the scenario that I am seeing now it doesn't
>>>look like the flow has finished everything and then just failed to mark
>>>completion- it looks like one side actually posted more BMI messages
>>>than the other. The server posted two receives of 256K, while the
>>>client only posted one send of 256K. I don't see any zero byte results
>>>from the request processor or any cases where the flow should have
>>>marked completion.
>>>
>>>I'm still trying to get a better log. After adding mutexes in
>>>PINT_process_request() on the server side, the problem has gotten much
>>>harder to reproduce. I don't know if there is any correlation or if I
>>>am just perturbing the timing enough to hide the problem. So far out of
>>>9 runs I have had only 1 failure, and the logs from that failure were
>>>inconclusive.
>>>
>>>I definitely agree with both of your points about reworking the flow
>>>code; it is difficult to debug right now.
>>>
>>>-Phil
>>>
>>>Sam Lang wrote:
>>>
>>>
>>>>On Jun 12, 2006, at 8:34 AM, Phil Carns wrote:
>>>>
>>>>
>>>>>Hi all,
>>>>>
>>>>>I am looking at an I/O problem that I don't completely understand.
>>>>>The setup is that there are 15 servers and 20 clients (all RHEL3
>>>>>SMP). The clients are running a proprietary application. At the
>>>>>end of the run they each write their share of a data set into a 36
>>>>>GB file. So each is doing a contiguous write of 36/20 GB of data.
>>>>>
>>>>>Roughly 80-90% of the time, one or more of the clients and servers
>>>>>hang, though it isn't predictable which one will do it. After quite
>>>>>a bit of digging through logs, it turns out that there is a flow
>>>>>that never completes on the server side. There are two problems
>>>>>contributing here:
>>>>>
>>>>>1) The flow timeout mechanism doesn't seem to be working right for
>>>>>me (the timer for the flow pops and checks progress one time, then
>>>>>never gets triggered again). This isn't the core problem, but it is
>>>>>causing the flow to get stuck forever and hold up the request
>>>>>scheduler rather than giving up eventually.
>>>>>
>>>>>2) For this particular flow, the client and server appear to
>>>>>disagree on how much data to exchange. The client sends one 256K
>>>>>buffer and completes the flow. The server receives that 256K buffer
>>>>>but still expects at least one more to arrive (which never does).
>>>>>
>>>>
>>>>Hi Phil, these problems sound very similar to the bug I was seeing
>>>>about a month ago with the flow code for certain request types. The
>>>>particular case I was looking at was with reads, so given that
>>>>there's a lot of code duplication in flow I could imagine the same
>>>>problem existing (but not getting fixed) for writes. It may not be
>>>>the same as the problem you're seeing, but I'll walk through the bug
>>>>and how I fixed it, in case it helps.
>>>>
>>>>Basically, the request was ending right at the boundary of a flow
>>>>buffer (so bytes == bytemax), this caused the request processing code
>>>>to finish and the bmi_send_callback_fn would return, but it would get
>>>>called again by the post code for the next buffer, except this time
>>>>the request processing would return 0 bytes, and this queue item
>>>>would get marked as the last one. It was that last queue item that
>>>>was the problem: since the trove_bstream_read_list was never getting
>>>>posted, the queue item who's last field was set, was never making it
>>>>into the trove_read_callback_fn, so the dest_last_posted field for
>>>>the flow data wasn't getting set either.
>>>>
>>>>The fix was to add some extra checking in bmi_send_callback_fn. In
>>>>the case that the request processing returns 0 bytes processed, I can
>>>>set the dest_last_posted field to 1 immediately. The only caveat to
>>>>this is in the case that other queue items in the source list haven't
>>>>made it to the dest list yet, so we need to check for that case. The
>>>>diff of the changes I made might help:
>>>>
>>>>http://www.pvfs.org/cgi-bin/pvfs2/viewcvs/viewcvs.cgi/pvfs2/src/io/
>>>>flow/flowproto-bmi-trove/flowproto-multiqueue.c.diff?r1=1.105&r2=1.106
>>>>
>>>>In any case, I wasn't able to make use of the debug messages for the
>>>>request processing code, it was too verbose to make any sense of it.
>>>>I think I ended up just adding debug statements for the entry and
>>>>exit points of the flow callbacks along with all the different side-
>>>>state fields as they were changed. I'm not sure those debug
>>>>statements made it into the trunk though. :-(
>>>>
>>>>Since we're here, I think the flow code could benefit from some
>>>>reworking in two ways:
>>>>
>>>>1. Less side-state variables to be set. Each queue item has fields
>>>>that change the behavior of when things should complete. Its hard to
>>>>keep track of all that.
>>>>
>>>>2. More code abstraction. We end up doing very similar things
>>>>between reads and writes on the server (and on the client too),
>>>>especially with respect to the request processing. It might make
>>>>sense to abstract out some of that code duplication into separate
>>>>functions.
>>>>
>>>>-sam
>>>>
>>>>
>>>>>I am pretty sure I can get to the bottom of the first problem. The
>>>>>second one is trickier.
>>>>>
>>>>>I attached a few log file excerpts:
>>>>>
>>>>>client-1st.txt: This is the request processing log output from the
>>>>>client side when it is working on the flow in question
>>>>>
>>>>>server-1st.txt: request processing log output from the first
>>>>>iteration of the server side flow
>>>>>
>>>>>server-2nd.txt: request processing log output from the second
>>>>>iteration of the server side flow
>>>>>
>>>>>Anyone have any suggestions, or see any indication from those log
>>>>>portions about why the server and client arn't on the same page?
>>>>>
>>>>>Some of this log output (the server-2nd.txt in particular?) looks a
>>>>>little jumbled, like maybe there are two flows doing request
>>>>>processing at once, but I don't really know how to interpret it.
>>>>>
>>>>>I may try adding a mutex around PINT_process_request() next, just to
>>>>>see if I can make the logs clearer by making sure only one
>>>>>invocation is going at once.
>>>>>
>>>>>-Phil
>>
>>_______________________________________________
>>Pvfs2-developers mailing list
>>Pvfs2-developers at beowulf-underground.org
>>http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>
>>
More information about the Pvfs2-developers
mailing list