[Pvfs2-developers] Re: pvfs2-client issues over IB
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Mon Sep 25 13:32:22 EDT 2006
Pete Wyckoff wrote:
> kschoche at scl.ameslab.gov wrote on Mon, 25 Sep 2006 10:50 -0500:
>
>> Can anyone make any sense of this?
>> I have a feeling these are related to the hangups I'm having w/o the
>> client interface in openib.
>> This is built off of latest cvs head. 6 server nodes, 1 client node.
>> mounted via pvfs2-client over openib.
>>
>
> They are exactly related.
>
>
>> Log message from the client:
>>
>> [E 10:19:33.127182] fp_multiqueue_cancel: flow proto cancel called on
>> 0x10151cf0
>> [E 10:19:33.127283] handle_io_error: flow proto error cleanup started on
>> 0x10151cf0, error_code: -1610612737
>>
>
> The client is bored of waiting for one of its IO flows to finish. A
> read or write operation. The error code translates to "Operation
> cancelled (possibly due to timeout)", indicating the client itself
> did BMI_Cancel() after 30 sec of waiting for a response. Things are
> designed to recover after this, but they may not as that's not well
> debugged and "should not happen" in normal operation. Even if it
> did recover properly, your performance would be terrible.
>
> Did you take a look at debugging your netpipe failure testcase?
> That seems like the lowest level where we can figure out what is
> going wrong. You should not be getting timeouts at all, and
> appearances point to messages getting lost in the network somehow.
> I cannot get your testcase to fail here, after over 72 hours of
> continuous testing.
>
I wish that were the case on our end, hopefully the utility you sent me
will point out some points of failure-ish on the network, and I can
resolve those. (Pete) Also, if you're interested in testing on our end,
I'm sending an offline mail to you in a few minutes with instructions.
> Also did you have a chance to run the network debugging tool I sent
> you offline? Both these last mails from me should have appeared me
> on Monday last week.
>
>
I'm running the network debugging tool right now, though not sure if its
running correctly, maybe I need to look back at that email to see when
it will complete. I tried the '8 minute' test you suggested and we're
running on 25 minutes now.
> You really can't expect to get the full system working until you fix
> the basic failure.
>
I thought I'd take a blind shot and try it out, in hopes that the error
messages I get here would provide some insight into whats breaking
underneath since those messages aren't helping out right now.
> -- Pete
>
>
>
>
--
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory
More information about the Pvfs2-developers
mailing list