[Pvfs2-users] PVFS2 over Infiniband error
Florin Isaila
florin.isaila at gmail.com
Fri Oct 19 11:11:57 EDT 2007
Hi Pete,
I did the tracing that you are suggesting, this time with 1 client and
1 PVFS2 server. Apparently the queue has enough completion queue
entries. The memory registration seems to be the problem (however as I
said, on the front-end runs):
[D 10:04:01.500768] PVFS2 Server version 2.6.3 starting.
[D 10:04:01.778135] BMI_ib_initialize: init.
[D 10:04:01.778252] openib_ib_initialize: init.
[D 10:04:01.779038] openib_ib_initialize: max 65408 completion queue entries.
[D 10:04:01.779380] BMI_ib_initialize: done.
[E 10:04:01.781047] Error: openib_mem_register: ibv_register_mr.
[E 10:04:01.781763] [bt] ./bt.A.1.mpi_io_full(error+0xf4) [0x533738]
[E 10:04:01.781771] [bt] ./bt.A.1.mpi_io_full [0x53614a]
[E 10:04:01.781776] [bt] ./bt.A.1.mpi_io_full [0x534214]
[E 10:04:01.781780] [bt] ./bt.A.1.mpi_io_full [0x533166]
[E 10:04:01.781784] [bt] ./bt.A.1.mpi_io_full [0x50a644]
[E 10:04:01.781788] [bt] ./bt.A.1.mpi_io_full [0x504ac1]
[E 10:04:01.781792] [bt] ./bt.A.1.mpi_io_full [0x4ce576]
[E 10:04:01.781795] [bt] ./bt.A.1.mpi_io_full [0x4ce277]
[E 10:04:01.781799] [bt] ./bt.A.1.mpi_io_full [0x4ed598]
[E 10:04:01.781803] [bt] ./bt.A.1.mpi_io_full [0x4ed5d1]
[E 10:04:01.781807] [bt] ./bt.A.1.mpi_io_full [0x4ff1b5]
[D 10/19 10:04] PVFS2 Server: storage space created. Exiting.
[D 10:04:01.896168] PVFS2 Server version 2.6.3 starting.
Any suggestion?
Florin
On 10/16/07, Pete Wyckoff <pw at osc.edu> wrote:
> florin.isaila at gmail.com wrote on Mon, 15 Oct 2007 11:31 -0500:
> > I am trying to run PVFS over IB on the lonestar cluster at TACC with
> > BTIO: http://www.tacc.utexas.edu/services/userguides/lonestar/
> >
> > On the front end evth works perfect. However, when launching the PVFS2
> > and the applications on the cluster they fail.
> >
> > [D 10:35:59.457502] PVFS2 Server version 2.6.3 starting.
> > [E 10:35:59.476341] Error: openib_ib_initialize: ibv_create_cq failed.
> > ....
> >
> > [E 10:35:59.548287] [bt] ./bt.B.16.mpi_io_full(error+0xf4) [0x53355c]
> > [E 10:35:59.548589] [bt]
> > ./bt.B.16.mpi_io_full(openib_ib_initialize+0x4c3) [0x5365a0]
> >
> > Did anyone see this problem before?
>
> Haven't seen exactly this, but I'll guess that we're asking for
> too many CQE slots. Try changing the value in this line
> (pvfs2/src/io/bmi/bmi_ib/openib.c:85):
>
> static const unsigned int IBV_NUM_CQ_ENTRIES = 1024;
>
> to 100. More is better. You can fish around for something that
> works. You can also debug the client to see how many it is
> asking for:
>
> PVFS2_DEBUGMASK=network ./bt.B.16
>
> I'd like to see what these lines print out:
>
> debug(1, "%s: max %d completion queue entries", __func__, hca_cap.max_cq);
> cqe_num = IBV_NUM_CQ_ENTRIES;
> od->nic_max_sge = hca_cap.max_sge;
> od->nic_max_wr = hca_cap.max_qp_wr;
>
> if (hca_cap.max_cq < cqe_num) {
> cqe_num = hca_cap.max_cq;
> warning("%s: hardly enough completion queue entries %d, hoping for %d",
> __func__, hca_cap.max_cq, cqe_num);
> }
>
> There is code there to ask the NIC how many CQEs it can support,
> then it is careful not to ask for too many, given the reported
> limit. However the OpenFabrics API has this long-standing problem
> where the reported limits can not always be used as reported.
>
> Would be interesting to know the details of your NIC. We might want
> to add some work-arounds for it.
>
> -- Pete
>
More information about the Pvfs2-users
mailing list