[Pvfs2-developers] Re: pvfs2-client issues over IB

Kyle Schochenmaier kschoche at scl.ameslab.gov
Mon Sep 25 13:32:22 EDT 2006


Pete Wyckoff wrote:
> kschoche at scl.ameslab.gov wrote on Mon, 25 Sep 2006 10:50 -0500:
>   
>> Can anyone make any sense of this?
>> I have a feeling these are related to the hangups I'm having w/o the 
>> client interface in openib.
>> This is built off of latest cvs head.  6 server nodes, 1 client node. 
>> mounted via pvfs2-client over openib.
>>     
>
> They are exactly related.
>
>   
>> Log message from the client:
>>
>> [E 10:19:33.127182] fp_multiqueue_cancel: flow proto cancel called on 
>> 0x10151cf0
>> [E 10:19:33.127283] handle_io_error: flow proto error cleanup started on 
>> 0x10151cf0, error_code: -1610612737
>>     
>
> The client is bored of waiting for one of its IO flows to finish.  A
> read or write operation.  The error code translates to "Operation
> cancelled (possibly due to timeout)", indicating the client itself
> did BMI_Cancel() after 30 sec of waiting for a response.  Things are
> designed to recover after this, but they may not as that's not well
> debugged and "should not happen" in normal operation.  Even if it
> did recover properly, your performance would be terrible.
>
> Did you take a look at debugging your netpipe failure testcase?
> That seems like the lowest level where we can figure out what is
> going wrong.  You should not be getting timeouts at all, and
> appearances point to messages getting lost in the network somehow.
> I cannot get your testcase to fail here, after over 72 hours of
> continuous testing.
>   
I wish that were the case on our end, hopefully the utility you sent me 
will point out some points of failure-ish on the network, and I can 
resolve those. (Pete) Also, if you're interested in testing on our end, 
I'm sending an offline mail to you in a few minutes with instructions.
> Also did you have a chance to run the network debugging tool I sent
> you offline?  Both these last mails from me should have appeared me
> on Monday last week.
>
>   
I'm running the network debugging tool right now, though not sure if its 
running correctly, maybe I need to look back at that email to see when 
it will complete. I tried the '8 minute' test you suggested and we're 
running on 25 minutes now.
> You really can't expect to get the full system working until you fix
> the basic failure.
>   
I thought I'd take a blind shot and try it out, in hopes that the error 
messages I get here would provide some insight into whats breaking 
underneath since those messages aren't helping out right now.
> 		-- Pete
>
>
>
>   


-- 
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory 



More information about the Pvfs2-developers mailing list