[Pvfs2-developers] help debugging request processor/distribution

Phil Carns pcarns at wastedcycles.org
Mon Jun 12 22:55:21 EDT 2006


I went back and added some much more specific debugging messages and put 
some special prefixes on flow, bmi, and request processor messages so I 
could group them a little easier, and got rid of the extra mutexes.

After running a few more tests and double checking the logs, this is 
what I am seeing in every failure case:

- it is a write flow
- it gets broken into two buffers within the flow protocol
   - the first is 256K
   - the second is 64K
- both sides _do_ get the request processing correct
- both sides post the correct BMI operations
   - the server posts 2 receives
   - the client posts 2 sends
- the second data message never shows up at the server

I must have missed part of the flow/request processing on the client 
side when I looked before.  Sorry about the wild goose chase with the 
request process question.  The request processing definitely looks 
correct in the logs that I am seeing now (or at least both the client 
and server agree on it anyway).

BMI has logging that shows when incoming messages appear on a socket, 
and it includes the tag number.  The 64K message never shows up on a 
socket at the server side.  The BMI operation therefore never completes, 
and the callback never gets triggered to complete the flow.

I don't know if this is a bmi_tcp problem, or if there is some sort of 
general network wackiness in my environment.  I'll try to revert to some 
older versions of bmi_tcp tomorrow and see if the problem persists.  I 
already tried just backing out changes to the poll notification in 
socket-collection.[ch] and (assuming I got that right) it didn't make 
any difference.

-Phil

Phil Carns wrote:
> Hi Sam,
> 
> I may be wrong, but so far it looks like the problem is a little 
> different this time.  In the scenario that I am seeing now it doesn't 
> look like the flow has finished everything and then just failed to mark 
> completion- it looks like one side actually posted more BMI messages 
> than the other.  The server posted two receives of 256K, while the 
> client only posted one send of 256K.  I don't see any zero byte results 
> from the request processor or any cases where the flow should have 
> marked completion.
> 
> I'm still trying to get a better log.  After adding mutexes in 
> PINT_process_request() on the server side, the problem has gotten much 
> harder to reproduce.  I don't know if there is any correlation or if I 
> am just perturbing the timing enough to hide the problem.  So far out of 
> 9 runs I have had only 1 failure, and the logs from that failure were 
> inconclusive.
> 
> I definitely agree with both of your points about reworking the flow 
> code; it is difficult to debug right now.
> 
> -Phil
> 
> Sam Lang wrote:
> 
>>
>> On Jun 12, 2006, at 8:34 AM, Phil Carns wrote:
>>
>>> Hi all,
>>>
>>> I am looking at an I/O problem that I don't completely understand.   
>>> The setup is that there are 15 servers and 20 clients (all RHEL3  
>>> SMP).  The clients are running a proprietary application.  At the  
>>> end of the run they each write their share of a data set into a 36  
>>> GB file.  So each is doing a contiguous write of 36/20 GB of data.
>>>
>>> Roughly 80-90% of the time, one or more of the clients and servers  
>>> hang, though it isn't predictable which one will do it.  After  quite 
>>> a bit of digging through logs, it turns out that there is a  flow 
>>> that never completes on the server side.  There are two  problems 
>>> contributing here:
>>>
>>> 1) The flow timeout mechanism doesn't seem to be working right for  
>>> me (the timer for the flow pops and checks progress one time, then  
>>> never gets triggered again).  This isn't the core problem, but it  is 
>>> causing the flow to get stuck forever and hold up the request  
>>> scheduler rather than giving up eventually.
>>>
>>> 2) For this particular flow, the client and server appear to  
>>> disagree on how much data to exchange.  The client sends one 256K  
>>> buffer and completes the flow.  The server receives that 256K  buffer 
>>> but still expects at least one more to arrive (which never  does).
>>>
>>
>> Hi Phil, these problems sound very similar to the bug I was seeing  
>> about a month ago with the flow code for certain request types.  The  
>> particular case I was looking at was with reads, so given that  
>> there's a lot of code duplication in flow I could imagine the same  
>> problem existing (but not getting fixed) for writes.  It may not be  
>> the same as the problem you're seeing, but I'll walk through the bug  
>> and how I fixed it, in case it helps.
>>
>> Basically, the request was ending right at the boundary of a flow  
>> buffer (so bytes == bytemax), this caused the request processing code  
>> to finish and the bmi_send_callback_fn would return, but it would get  
>> called again by the post code for the next buffer, except this time  
>> the request processing would return 0 bytes, and this queue item  
>> would get marked as the last one.  It was that last queue item that  
>> was the problem: since the trove_bstream_read_list was never getting  
>> posted, the queue item who's last field was set, was never making it  
>> into the trove_read_callback_fn, so the dest_last_posted field for  
>> the flow data wasn't getting set either.
>>
>> The fix was to add some extra checking in bmi_send_callback_fn.  In  
>> the case that the request processing returns 0 bytes processed, I can  
>> set the dest_last_posted field to 1 immediately.  The only caveat to  
>> this is in the case that other queue items in the source list haven't  
>> made it to the dest list yet, so we need to check for that case.  The  
>> diff of the changes I made might help:
>>
>> http://www.pvfs.org/cgi-bin/pvfs2/viewcvs/viewcvs.cgi/pvfs2/src/io/ 
>> flow/flowproto-bmi-trove/flowproto-multiqueue.c.diff?r1=1.105&r2=1.106
>>
>> In any case, I wasn't able to make use of the debug messages for the  
>> request processing code, it was too verbose to make any sense of it.   
>> I think I ended up just adding debug statements for the entry and  
>> exit points of the flow callbacks along with all the different side- 
>> state fields as they were changed.  I'm not sure those debug  
>> statements made it into the trunk though. :-(
>>
>> Since we're here, I think the flow code could benefit from some  
>> reworking in two ways:
>>
>> 1.  Less side-state variables to be set.  Each queue item has fields  
>> that change the behavior of when things should complete.  Its hard to  
>> keep track of all that.
>>
>> 2.  More code abstraction.  We end up doing very similar things  
>> between reads and writes on the server (and on the client too),  
>> especially with respect to the request processing.  It might make  
>> sense to abstract out some of that code duplication into separate  
>> functions.
>>
>> -sam
>>
>>> I am pretty sure I can get to the bottom of the first problem.  The  
>>> second one is trickier.
>>>
>>> I attached a few log file excerpts:
>>>
>>> client-1st.txt: This is the request processing log output from the  
>>> client side when it is working on the flow in question
>>>
>>> server-1st.txt: request processing log output from the first  
>>> iteration of the server side flow
>>>
>>> server-2nd.txt: request processing log output from the second  
>>> iteration of the server side flow
>>>
>>> Anyone have any suggestions, or see any indication from those log  
>>> portions about why the server and client arn't on the same page?
>>>
>>> Some of this log output (the server-2nd.txt in particular?) looks a  
>>> little jumbled, like maybe there are two flows doing request  
>>> processing at once, but I don't really know how to interpret it.
>>>
>>> I may try adding a mutex around PINT_process_request() next, just  to 
>>> see if I can make the logs clearer by making sure only one  
>>> invocation is going at once.
>>>
>>> -Phil



More information about the Pvfs2-developers mailing list