[Pvfs2-developers] Re: openib-vfs failure
kschoche at scl.ameslab.gov
kschoche at scl.ameslab.gov
Tue Jan 29 17:58:58 EST 2008
> kschoche at scl.ameslab.gov wrote on Tue, 29 Jan 2008 16:09 -0600:
>> I've been running GAMESS tests with about 160GB's on the filesystem
>> trying
>> to stress the network a bit and have managed to reproducibly get the
>> pvfs2-client to end on an assertion failure in
>> "src/io/bmi/bmi_ib/ib.c:611"
>>
>> I havent been able to figure out exactly what is occuring that is
>> causing
>> this assertion failure, but from the code it really appears as if this
>> shouldnt ever be occuring, obviously (assertion) ;) Maybe we're getting
>> duplicate messages or double-testing a message somehow.
>>
>> I'm running cvs HEAD, debian 2.6.18, and using bmi_ib modules over the
>> vfs.
>>
>> [E 15:54:20.197159] Error: encourage_recv_incoming: RTS_DONE to rq wrong
>> state RQ_RTS_WAITING_USER_TEST.
>> [E 15:54:20.200927] [bt] pvfs2-client-core(error+0xca) [0x41a2ba]
>> [E 15:54:20.200940] [bt] pvfs2-client-core [0x41779f]
>> [E 15:54:20.200948] [bt] pvfs2-client-core [0x417e3a]
>> [E 15:54:20.200955] [bt] pvfs2-client-core [0x4181fd]
>> [E 15:54:20.200963] [bt] pvfs2-client-core(job_bmi_recv+0xea)
>> [0x422f0a]
>> [E 15:54:20.200971] [bt] pvfs2-client-core [0x441a18]
>> [E 15:54:20.200978] [bt]
>> pvfs2-client-core(PINT_state_machine_invoke+0xd2) [
>> 0x431be2]
>> [E 15:54:20.200986] [bt]
>> pvfs2-client-core(PINT_state_machine_next+0xcc) [0x
>> 43198c]
>> [E 15:54:20.200994] [bt]
>> pvfs2-client-core(PINT_client_state_machine_post+0x
>> 99) [0x4383e9]
>> [E 15:54:20.201001] [bt] pvfs2-client-core(PVFS_isys_io+0x324)
>> [0x4430a4]
>> [E 15:54:20.201009] [bt] pvfs2-client-core [0x4117a6]
>> [E 15:54:20.205453] pvfs2-client-core with pid 6251 exited with value 1
>
> That is indeed scary. The server has sent MSG_RTS_DONE to the
> client. The client looks up the mop_id (64-bit number in header)
> and finds it corresponds to a message that it thought had already
> been completed. The message is in "waiting user test" which means
> IB is all done, it just is waiting for the upper layers to ask for
> the completion status.
>
> You could turn on debugging, level 2, which I think is the default.
> Enable it on the client core by starting it up with
>
> pvsf2-client --gossip-mask=network
>
> then look at the /tmp/pvfs-client.log (or whatever I forget) and try
> to find some patterns. You will see these messages:
>
> debug(2, "%s: recv RTS_DONE mop_id %llx", __func__,
>
> whenever the client gets a MSG_RTS_DONE. If you see a duplicate
> mop_id (or not) before your assert, that will help us narrow the
> problem.
>
> You can also turn debugging on the server, with "pvfs2-set-debugmask
> -m /pvfs network", and watch him say he has sent RTS_DONE of certain
> mopid. I don't think this will add any information yet, but fyi.
>
> Probably easier to deal with all this on a single-server setup, if
> possible.
>
> -- Pete
>
Thanks for the quick response!
I knew this was going to be tricky to debug, this failure usually doesnt
occur until about 100GBytes into a read for us. I have an identical
failure using a single node. So far I've eliminated all but our opteron
systems from the tests so we're on a relatively 'stable' systems wrt to
IB.
I'll look at this tomorrow and see if I can get the logs to be of any help.
>From what you are saying, this isnt likely a duplicate message from ib,
but some duplicate mopid or a corrupt mopid?
~Kyle
More information about the Pvfs2-developers
mailing list