[Pvfs2-developers] Re: openib-vfs failure
Kyle Schochenmaier
kschoche at gmail.com
Wed Feb 6 15:14:06 EST 2008
Pete -
I applied your patch, and got the following immediately on some io:
[E 02/06 14:12] Error: openib_check_cq: unknown opcode 11171.
[E 02/06 14:12] [bt] bin/sbin/pvfs2-server(error+0xca) [0x4293ba]
[E 02/06 14:12] [bt] bin/sbin/pvfs2-server [0x429f9b]
[E 02/06 14:12] [bt] bin/sbin/pvfs2-server [0x425dc3]
[E 02/06 14:12] [bt] bin/sbin/pvfs2-server [0x428129]
[E 02/06 14:12] [bt] bin/sbin/pvfs2-server(BMI_testunexpected+0x19f) [0x
424d2f]
[E 02/06 14:12] [bt] bin/sbin/pvfs2-server [0x4397a7]
[E 02/06 14:12] [bt] /lib/libpthread.so.0 [0x2ba396d2df1a]
[E 02/06 14:12] [bt] /lib/libc.so.6(__clone+0x72) [0x2ba39711c602]
On Feb 6, 2008 12:43 PM, Pete Wyckoff <pw at osc.edu> wrote:
> kschoche at scl.ameslab.gov wrote on Fri, 01 Feb 2008 19:07 -0600:
>
> > Troy Benjegerdes wrote:
> >>
> >>> Pete -
> >>>
> >>> I've attached a link to a log of the failure with network debugging on in
> >>> the client, single IO node. The whole log is 5.9GB so I only attached
> >>> the
> >>> last 10k lines. Same error as before of course.
> >>>
> >>> http://www.scl.ameslab.gov/~kschoche/pvfs2-client.log.gz
> >>>
> >>> The mopids are fairly difficult to track as they are used all over the
> >>> place and end up here and there, I cant make out anything useful from it
> >>> :'(
> >>>
> >>> Any advice would be great,
> >>>
> >>> ~Kyle
> >>>
> >>>
> >> Here is another, full logfile of another failure: (93M compressed, 1GB
> >> unpacked)
> >>
> >> http://www.scl.ameslab.gov/~troy/pvfs/pvfs2-client.log.gz
> >
> > Kind of an update....
> > after doing some tracing of function calls and trying to figure out why the
> > same "mop_id" was used 10,000+ times during my failed run, troy and I
> > stumbled upon some of the fmr code.. and after changing the
> > id_gen_fast(mopid) functions to use the id_gen_safe(mopid) functions in
> > id_generator.c... We have possibly fixed the problem, however, this does
> > introduce some amount of overhead. I'll attempt to do some tests to
> > quantify the exact amount next week, but for now it seems to at least allow
> > my tests to complete.
> >
> > Maybe something is wrong with the id_gen_fast() stuff, locking or other
> > issues maybe?
> >
> > Troy and I had some questions about how these mop_id's, which are just
> > addresses, are generated, and whether or not there is the possibility for
> > two I/O servers to generate the same address, and send that to the client
> > somehow?
> > Can you give us a brief description of the process Pete?
>
> Sorry for the delay here. You may be on to something there.
>
> The mop_id is a number assigned at each send or recv operation by
> the caller, e.g. the client. It is used internally in BMI/PVFS to
> track outstanding operations. This is true for all the transports.
>
> In IB, I thought I'd be clever and use it as the tag in RTS
> operations, so that an incoming CTS from the server would echo the
> mop_id that the client sent in its RTS, and the client can use that
> to lookup what message the server is saying is clear to send.
>
> The difference between "fast" and "safe", is fast just casts the
> pointer to an int, done. Safe uses cross-thread locking and adds
> one to a unique counter.
>
> So the reason you see lots of the same mop_ids is tied to malloc()
> returning the same block over and over. This should be perfectly
> fine, and good to keep caches hot. There should be only one send or
> recv state structure per mop_id, guaranteeing uniqueness, but
> perhaps something is broken in here.
>
> There could be a multi-server problem, but a check on sq->c is
> supposed to be everywhere, and looks okay from here.
>
> The sender goes through these phases for a message that needs RTS:
>
> post_send()
> assign mop_id
> SQ_WAITING_BUFFER
> encourage_send_waiting_buffer()
> send RTS with mop_id
> SQ_WAITING_RTS_SEND_COMPLETION
> ib_check_cq()
> get mop_id back
> SQ_WAITING_CTS
> encourage_send_incoming_cts()
> get mop_id, lookup send state structure (sq)
> send data
> SQ_WAITING_DATA_SEND_COMPLETION
> ib_check_cq()
> SQ_WAITING_RTS_DONE_BUFFER
> encourage_rts_done_waiting_buffer()
> send done message with mop_id
> SQ_WAITING_RTS_DONE_SEND_COMPLETION
> ib_check_cq()
> SQ_WAITING_USER_TEST
> test() et al
> free mop_id
>
> The receiver goes like this. Case preposted:
>
> post_recv()
> assign mop_id
> RQ_WAITING_INCOMING
> ib_check_cq()
> (his rts arrives)
> assign rq->rts_mop_id with sender's mop_id
> RQ_RTS_WAITING_CTS_BUFFER
> goto common path
>
> Case non-preposted:
>
> ib_check_cq()
> (his rts arrives)
> alloc new recv state rq
> assign rq->rts_mop_id with sender's mop_id
> RQ_RTS_WAITING_USER_POST
> post_recv()
> assign our mop_id to existing rq
> goto common path
>
> Common recv path:
>
> send_cts()
> send cts with sender's mop_id
> RQ_RTS_WAITING_CTS_SEND_COMPLETION
> ib_check_cq()
> (rts send completed)
> RQ_RTS_WAITING_RTS_DONE
> ib_check_cq()
> (his done message arrived)
> compare sender's mop_id against his stored mop_id
> RQ_RTS_WAITING_USER_TEST
> test() et al
> free mop_id
>
> The bug can be seen from your trace. The client posts 4 recvs: 8k,
> 1200k, 8k, 8k. The sender sends something to match recv 1 mopid 1,
> then recv 3, mopid 3, then recv 4 with mopid 4. But mopid 1 ==
> mopid 4. This is valid as the server could have done a "test" and
> retired the mopid, only to use it again for send 4. But to the
> client, this mopid is still on the "active recv" list twice.
>
> Your trace, by the way, has lots of 8192-byte sends. These end up
> going RTS rather than eager, as the limit is like 8170. You might
> want to bump up DEFAULT_EAGER_BUF_SIZE in ib.h to get better
> performance.
>
> Here's a patch (untested), with a bit more debugging. Let me know
> if it works and I'll check it in. Thanks for tracking this down.
>
> -- Pete
>
> Index: src/io/bmi/bmi_ib/ib.c
> ===================================================================
> RCS file: /projects/cvsroot/pvfs2/src/io/bmi/bmi_ib/ib.c,v
> retrieving revision 1.62
> diff -u -p -r1.62 ib.c
> --- src/io/bmi/bmi_ib/ib.c 5 Dec 2007 20:03:17 -0000 1.62
> +++ src/io/bmi/bmi_ib/ib.c 6 Feb 2008 18:40:09 -0000
> @@ -217,8 +217,9 @@ static int ib_check_cq(void)
> assert(0, "%s: unknown send state %s (%d) of sq %p",
> __func__, sq_state_name(sq->state.send),
> sq->state.send, sq);
> - debug(2, "%s: send to %s completed locally: -> %s",
> - __func__, bh->c->peername, sq_state_name(sq->state.send));
> + debug(2, "%s: send to %s completed locally: sq %p -> %s",
> + __func__, bh->c->peername, sq,
> + sq_state_name(sq->state.send));
>
> } else {
> struct ib_work *rq = sq; /* rename */
> @@ -229,8 +230,9 @@ static int ib_check_cq(void)
> else
> assert(0, "%s: unknown send state %s of rq %p",
> __func__, rq_state_name(rq->state.recv), rq);
> - debug(2, "%s: send to %s completed locally: -> %s",
> - __func__, bh->c->peername, rq_state_name(rq->state.recv));
> + debug(2, "%s: send to %s completed locally: rq %p -> %s",
> + __func__, bh->c->peername, rq,
> + rq_state_name(rq->state.recv));
> }
>
> qlist_add_tail(&bh->list, &bh->c->eager_send_buf_free);
> @@ -599,7 +601,8 @@ encourage_recv_incoming(struct buf_head
>
> rq = NULL;
> qlist_for_each_entry(rqt, &ib_device->recvq, list) {
> - if (rqt->c == c && rqt->rts_mop_id == mh_rts_done.mop_id) {
> + if (rqt->c == c && rqt->rts_mop_id == mh_rts_done.mop_id &&
> + rqt->state.recv == RQ_RTS_WAITING_RTS_DONE) {
> rq = rqt;
> break;
> }
> @@ -607,9 +610,6 @@ encourage_recv_incoming(struct buf_head
>
> assert(rq, "%s: mop_id %llx in RTS_DONE message not found",
> __func__, llu(mh_rts_done.mop_id));
> - assert(rq->state.recv == RQ_RTS_WAITING_RTS_DONE,
> - "%s: RTS_DONE to rq wrong state %s",
> - __func__, rq_state_name(rq->state.recv));
>
> #if MEMCACHE_BOUNCEBUF
> memcpy_to_buflist(&rq->buflist, reg_recv_buflist_buf,
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
--
Kyle Schochenmaier
More information about the Pvfs2-developers
mailing list