[Pvfs2-developers] help debugging request processor/distribution
Phil Carns
pcarns at wastedcycles.org
Mon Jun 12 22:55:21 EDT 2006
I went back and added some much more specific debugging messages and put
some special prefixes on flow, bmi, and request processor messages so I
could group them a little easier, and got rid of the extra mutexes.
After running a few more tests and double checking the logs, this is
what I am seeing in every failure case:
- it is a write flow
- it gets broken into two buffers within the flow protocol
- the first is 256K
- the second is 64K
- both sides _do_ get the request processing correct
- both sides post the correct BMI operations
- the server posts 2 receives
- the client posts 2 sends
- the second data message never shows up at the server
I must have missed part of the flow/request processing on the client
side when I looked before. Sorry about the wild goose chase with the
request process question. The request processing definitely looks
correct in the logs that I am seeing now (or at least both the client
and server agree on it anyway).
BMI has logging that shows when incoming messages appear on a socket,
and it includes the tag number. The 64K message never shows up on a
socket at the server side. The BMI operation therefore never completes,
and the callback never gets triggered to complete the flow.
I don't know if this is a bmi_tcp problem, or if there is some sort of
general network wackiness in my environment. I'll try to revert to some
older versions of bmi_tcp tomorrow and see if the problem persists. I
already tried just backing out changes to the poll notification in
socket-collection.[ch] and (assuming I got that right) it didn't make
any difference.
-Phil
Phil Carns wrote:
> Hi Sam,
>
> I may be wrong, but so far it looks like the problem is a little
> different this time. In the scenario that I am seeing now it doesn't
> look like the flow has finished everything and then just failed to mark
> completion- it looks like one side actually posted more BMI messages
> than the other. The server posted two receives of 256K, while the
> client only posted one send of 256K. I don't see any zero byte results
> from the request processor or any cases where the flow should have
> marked completion.
>
> I'm still trying to get a better log. After adding mutexes in
> PINT_process_request() on the server side, the problem has gotten much
> harder to reproduce. I don't know if there is any correlation or if I
> am just perturbing the timing enough to hide the problem. So far out of
> 9 runs I have had only 1 failure, and the logs from that failure were
> inconclusive.
>
> I definitely agree with both of your points about reworking the flow
> code; it is difficult to debug right now.
>
> -Phil
>
> Sam Lang wrote:
>
>>
>> On Jun 12, 2006, at 8:34 AM, Phil Carns wrote:
>>
>>> Hi all,
>>>
>>> I am looking at an I/O problem that I don't completely understand.
>>> The setup is that there are 15 servers and 20 clients (all RHEL3
>>> SMP). The clients are running a proprietary application. At the
>>> end of the run they each write their share of a data set into a 36
>>> GB file. So each is doing a contiguous write of 36/20 GB of data.
>>>
>>> Roughly 80-90% of the time, one or more of the clients and servers
>>> hang, though it isn't predictable which one will do it. After quite
>>> a bit of digging through logs, it turns out that there is a flow
>>> that never completes on the server side. There are two problems
>>> contributing here:
>>>
>>> 1) The flow timeout mechanism doesn't seem to be working right for
>>> me (the timer for the flow pops and checks progress one time, then
>>> never gets triggered again). This isn't the core problem, but it is
>>> causing the flow to get stuck forever and hold up the request
>>> scheduler rather than giving up eventually.
>>>
>>> 2) For this particular flow, the client and server appear to
>>> disagree on how much data to exchange. The client sends one 256K
>>> buffer and completes the flow. The server receives that 256K buffer
>>> but still expects at least one more to arrive (which never does).
>>>
>>
>> Hi Phil, these problems sound very similar to the bug I was seeing
>> about a month ago with the flow code for certain request types. The
>> particular case I was looking at was with reads, so given that
>> there's a lot of code duplication in flow I could imagine the same
>> problem existing (but not getting fixed) for writes. It may not be
>> the same as the problem you're seeing, but I'll walk through the bug
>> and how I fixed it, in case it helps.
>>
>> Basically, the request was ending right at the boundary of a flow
>> buffer (so bytes == bytemax), this caused the request processing code
>> to finish and the bmi_send_callback_fn would return, but it would get
>> called again by the post code for the next buffer, except this time
>> the request processing would return 0 bytes, and this queue item
>> would get marked as the last one. It was that last queue item that
>> was the problem: since the trove_bstream_read_list was never getting
>> posted, the queue item who's last field was set, was never making it
>> into the trove_read_callback_fn, so the dest_last_posted field for
>> the flow data wasn't getting set either.
>>
>> The fix was to add some extra checking in bmi_send_callback_fn. In
>> the case that the request processing returns 0 bytes processed, I can
>> set the dest_last_posted field to 1 immediately. The only caveat to
>> this is in the case that other queue items in the source list haven't
>> made it to the dest list yet, so we need to check for that case. The
>> diff of the changes I made might help:
>>
>> http://www.pvfs.org/cgi-bin/pvfs2/viewcvs/viewcvs.cgi/pvfs2/src/io/
>> flow/flowproto-bmi-trove/flowproto-multiqueue.c.diff?r1=1.105&r2=1.106
>>
>> In any case, I wasn't able to make use of the debug messages for the
>> request processing code, it was too verbose to make any sense of it.
>> I think I ended up just adding debug statements for the entry and
>> exit points of the flow callbacks along with all the different side-
>> state fields as they were changed. I'm not sure those debug
>> statements made it into the trunk though. :-(
>>
>> Since we're here, I think the flow code could benefit from some
>> reworking in two ways:
>>
>> 1. Less side-state variables to be set. Each queue item has fields
>> that change the behavior of when things should complete. Its hard to
>> keep track of all that.
>>
>> 2. More code abstraction. We end up doing very similar things
>> between reads and writes on the server (and on the client too),
>> especially with respect to the request processing. It might make
>> sense to abstract out some of that code duplication into separate
>> functions.
>>
>> -sam
>>
>>> I am pretty sure I can get to the bottom of the first problem. The
>>> second one is trickier.
>>>
>>> I attached a few log file excerpts:
>>>
>>> client-1st.txt: This is the request processing log output from the
>>> client side when it is working on the flow in question
>>>
>>> server-1st.txt: request processing log output from the first
>>> iteration of the server side flow
>>>
>>> server-2nd.txt: request processing log output from the second
>>> iteration of the server side flow
>>>
>>> Anyone have any suggestions, or see any indication from those log
>>> portions about why the server and client arn't on the same page?
>>>
>>> Some of this log output (the server-2nd.txt in particular?) looks a
>>> little jumbled, like maybe there are two flows doing request
>>> processing at once, but I don't really know how to interpret it.
>>>
>>> I may try adding a mutex around PINT_process_request() next, just to
>>> see if I can make the logs clearer by making sure only one
>>> invocation is going at once.
>>>
>>> -Phil
More information about the Pvfs2-developers
mailing list