[Pvfs2-developers] Re: bmi_ib resource constraints with older
hardware
Troy Benjegerdes
troy at scl.ameslab.gov
Fri Mar 7 11:26:19 EST 2008
For further information, the diff to libmthca we used to figure this
out, and some extensive logfiles of the problem occuring on the servers
with full PVFS2_DEBUGMASK=network are at:
http://scl.ameslab.gov/~troy/pvfs/ibv_post_send/
(This is the error showing ibv_post_send failing with the -1001 error
code I added)
[D 12:33:00.861561] PVFS2 Server version 2.7.1pre1-2008-03-05-215140
starting.
[E 13:51:39.436430] openib_post_sr_rdmaw: ibv_post_send failed ret:
-1001 errno: 0
[E 13:51:39.445031] wr_id: 0x0 next: (nil) sg_list 0x65bb30 num_sge 1
[E 13:51:39.445073] opcode: 0x0 send_flags: 0x0 imm_data: 0x0
[E 13:51:39.445091] sr.wr.rdma.remote_addr: 0xf509c000 rkey 0x300055
[E 13:51:39.445195] openib_post_sr_rdmaw: QP_request sge: 1
[E 13:51:39.445249] Error: openib_post_sr_rdmaw: QP_sge: 28
: Unknown error 18446744073709550615.
Included in the logfiles in an attempt where I just tried to repost the
send after 100 us for 10 retries, but that didn't seem to help. I am
wondering if doing something like calling the ib_poll_cq function needs
to be done to make some progress, or if there's some other way to back
off out of openib_post_wr_rdma when the queues are full.
Kyle Schochenmaier wrote:
> Pete -
>
> We're still trying to track down this "bug" with how we use our ib
> nics with pvfs2. This is a continuation of previous emails regarding
> failures inside the openib_post_wr_rdma() functions for openib.c where
> we get into a situation with running out of wq entries on the server
> nics with multiple client processes hammering the filesystem. We only
> see the servers going out here, the client ends up with a timeout
> eventually.
>
> Troy and I have tracked down the specific resources down to the driver
> that were being reported, ( and unfortunately sharing the same -errno
> codes ).
> So as far as we can tell, every time we get into this situation, we
> get a wq_overflow() from the driver, and then of course the post_send
> fails, leading to problems that are unrecoverable. We are pretty sure
> that we're just running into the hw constraints of our nics, and that
> the best way to deal with this type of thing would be to create some
> sort of ib_flush_outgoing_requests() functionality for pvfs2 that
> would either implement a backoff mechanism for the send requests to
> wait for things at the nic level to be processed, or would just
> 'flush' everything out.. We're not sure exactly how to go about this,
> where or if this would be appropriate, or if we're missing something
> obvious..
> Can we recover from this elegantly?
>
> What does everyone think?
>
> ~~Kyle
>
>
> Included is the path from pvfs2 to what we are seeing in the driver:
>
> pvfs2/src/io/bmi/bmi_ib/openib.c
>
> static void openib_post_sr_rdmaw(struct ib_work *sq, msg_header_cts_t *mh_cts,
> void *mh_cts_buf)
> {
> <snip>
>
> ret = ibv_post_send(oc->qp, &sr, &bad_wr);
>
> <snip>
> }
>
> -------------------------
> ibv_post_send() points to this function for memfull mellanox cards in
> libmthca-*/src/qp.c
> -------------------------
> int mthca_tavor_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
> struct ibv_send_wr **bad_wr)
>
> {
> struct mthca_qp *qp = to_mqp(ibqp);
> void *wqe, *prev_wqe;
> int ind;
> int nreq;
> int ret = 0;
> int size;
> int size0 = 0;
> int i;
> /*
> * f0 and op0 cannot be used unless nreq > 0, which means this
> * function makes it through the loop at least once. So the
> * code inside the if (!size0) will be executed, and f0 and
> * op0 will be initialized. So any gcc warning about "may be
> * used unitialized" is bogus.
> */
> uint32_t f0;
> uint32_t op0;
>
> pthread_spin_lock(&qp->sq.lock);
>
> ind = qp->sq.next_ind;
>
> for (nreq = 0; wr; ++nreq, wr = wr->next) {
> ****** if (wq_overflow(&qp->sq, nreq,
> to_mcq(qp->ibv_qp.send_cq))) {
> ret = -1;
> *bad_wr = wr;
> goto out;
> }
>
> <snip>
>
>
>
More information about the Pvfs2-developers
mailing list