[Pvfs2-developers] Re: bmi_ib resource constraints with older hardware

Troy Benjegerdes troy at scl.ameslab.gov
Fri Mar 7 11:26:19 EST 2008


For further information, the diff to libmthca we used to figure this 
out, and some extensive logfiles of the problem occuring on the servers 
with full PVFS2_DEBUGMASK=network are at:

http://scl.ameslab.gov/~troy/pvfs/ibv_post_send/

(This is the error showing ibv_post_send failing with the -1001 error 
code I added)

[D 12:33:00.861561] PVFS2 Server version 2.7.1pre1-2008-03-05-215140 
starting.
[E 13:51:39.436430] openib_post_sr_rdmaw: ibv_post_send failed ret: 
-1001 errno: 0
[E 13:51:39.445031]  wr_id: 0x0 next: (nil) sg_list 0x65bb30 num_sge 1
[E 13:51:39.445073]  opcode: 0x0 send_flags: 0x0 imm_data: 0x0
[E 13:51:39.445091]  sr.wr.rdma.remote_addr: 0xf509c000 rkey 0x300055
[E 13:51:39.445195] openib_post_sr_rdmaw: QP_request sge: 1
[E 13:51:39.445249] Error: openib_post_sr_rdmaw: QP_sge: 28
: Unknown error 18446744073709550615.

Included in the logfiles in an attempt where I just tried to repost the 
send after 100 us for 10 retries, but that didn't seem to help. I am 
wondering if doing something like calling the ib_poll_cq function needs 
to be done to make some progress, or if there's some other way to back 
off out of openib_post_wr_rdma when the queues are full.

Kyle Schochenmaier wrote:
> Pete -
>
> We're still trying to track down this "bug" with how we use our ib
> nics with pvfs2.  This is a continuation of previous emails regarding
> failures inside the openib_post_wr_rdma() functions for openib.c where
> we get into a situation with running out of wq entries on the server
> nics with multiple client processes hammering the filesystem.  We only
> see the servers going out here, the client ends up with a timeout
> eventually.
>
> Troy and I have tracked down the specific resources down to the driver
> that were being reported, ( and unfortunately sharing the same -errno
> codes ).
> So as far as we can tell, every time we get into this situation, we
> get a wq_overflow() from the driver, and then of course the post_send
> fails, leading to problems that are unrecoverable.  We are pretty sure
> that we're just running into the hw constraints of our nics, and that
> the best way to deal with this type of thing would be to create some
> sort of ib_flush_outgoing_requests() functionality for pvfs2 that
> would either implement a backoff mechanism for the send requests to
> wait for things at the nic level to be processed, or would just
> 'flush' everything out.. We're not sure exactly how to go about this,
> where or if this would be appropriate, or if we're missing something
> obvious..
> Can we recover from this elegantly?
>
> What does everyone think?
>
> ~~Kyle
>
>
> Included is the path from pvfs2 to what we are seeing in the driver:
>
> pvfs2/src/io/bmi/bmi_ib/openib.c
>
> static void openib_post_sr_rdmaw(struct ib_work *sq, msg_header_cts_t *mh_cts,
>                                  void *mh_cts_buf)
> {
> <snip>
>
>         ret = ibv_post_send(oc->qp, &sr, &bad_wr);
>
> <snip>
> }
>
> -------------------------
> ibv_post_send()  points to this function for memfull mellanox cards in
> libmthca-*/src/qp.c
> -------------------------
> int mthca_tavor_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr,
>                           struct ibv_send_wr **bad_wr)
>
> {
>         struct mthca_qp *qp = to_mqp(ibqp);
>         void *wqe, *prev_wqe;
>         int ind;
>         int nreq;
>         int ret = 0;
>         int size;
>         int size0 = 0;
>         int i;
>         /*
>          * f0 and op0 cannot be used unless nreq > 0, which means this
>          * function makes it through the loop at least once.  So the
>          * code inside the if (!size0) will be executed, and f0 and
>          * op0 will be initialized.  So any gcc warning about "may be
>          * used unitialized" is bogus.
>          */
>         uint32_t f0;
>         uint32_t op0;
>
>         pthread_spin_lock(&qp->sq.lock);
>
>         ind = qp->sq.next_ind;
>
>         for (nreq = 0; wr; ++nreq, wr = wr->next) {
> ******                if (wq_overflow(&qp->sq, nreq,
> to_mcq(qp->ibv_qp.send_cq))) {
>                         ret = -1;
>                         *bad_wr = wr;
>                         goto out;
>                 }
>
> <snip>
>
>
>



More information about the Pvfs2-developers mailing list