[Pvfs2-users] PVFS2 over Infiniband error
Florin Isaila
florin.isaila at gmail.com
Fri Nov 16 17:30:11 EST 2007
Hi,
I am coming back to a problem I still have with PVFS 2.6.3 over IB.
I run it on Lonestar - Xeon Intel Duo-Core 64bit cluster at TACC:
http://www.tacc.utexas.edu/services/userguides/lonestar/
I remind you that PVFS-IB works on the front end, but fails when I try
to start it on the compute nodes.
As Pete suggested I had set the debug level to network.
I found out that there for each run one of two types of errors show up:
1) this is from the previous message I sent to the list
> > [E 10:04:01.781047] Error: openib_mem_register: ibv_register_mr.
2) this I just got (the full messages are at the end of this mail):
[E 12:05:07.676399] Error: openib_ib_initialize: ibv_create_cq failed.
As Pete suggested I looked in /etc/security/limits.conf: soft and hard
memlock are set to unlimited.
In do not have control over the nodes, I can not install things, I am
just a user :)
Pete, how can I find out what type of Infiniband fabric is installed?
The configuration file /etc/infiniband/openib.conf :
# Start HCA driver upon boot
ONBOOT=yes
# Load UCM module
UCM_LOAD=no
# Load RDMA_CM module
RDMA_CM_LOAD=yes
# Load RDMA_UCM module
RDMA_UCM_LOAD=yes
# Increase ib_mad thread priority
RENICE_IB_MAD=no
# Load MTHCA
MTHCA_LOAD=yes
# Load IPATH
IPATH_LOAD=yes
# Load IPoIB
IPOIB_LOAD=yes
Here the full error message:
[D 12:05:07.675267] BMI_ib_initialize: init.
[D 12:05:07.675423] openib_ib_initialize: init.
[D 12:05:07.676266] openib_ib_initialize: max 65408 completion queue entries.
[E 12:05:07.676399] Error: openib_ib_initialize: ibv_create_cq failed.
[E 12:05:07.712529] [bt] ./bt.S.1.mpi_io_full(error+0xf4) [0x598700]
[E 12:05:07.712545] [bt]
./bt.S.1.mpi_io_full(openib_ib_initialize+0x4c3) [0x59b744]
[E 12:05:07.712550] [bt] ./bt.S.1.mpi_io_full [0x5982eb]
[E 12:05:07.712555] [bt] ./bt.S.1.mpi_io_full [0x570e86]
[E 12:05:07.712558] [bt] ./bt.S.1.mpi_io_full [0x570122]
[E 12:05:07.712562] [bt] ./bt.S.1.mpi_io_full [0x55233c]
[E 12:05:07.712566] [bt] ./bt.S.1.mpi_io_full [0x552599]
[E 12:05:07.712570] [bt] ./bt.S.1.mpi_io_full [0x56417d]
[E 12:05:07.712574] [bt] ./bt.S.1.mpi_io_full [0x4fdef4]
[E 12:05:07.712577] [bt] ./bt.S.1.mpi_io_full [0x4fdcd2]
[E 12:05:07.712581] [bt] ./bt.S.1.mpi_io_full [0x4a5a73]
Thanks
Florin
On Oct 20, 2007 8:17 AM, Pete Wyckoff <pw at osc.edu> wrote:
> florin.isaila at gmail.com wrote on Fri, 19 Oct 2007 10:11 -0500:
> > I did the tracing that you are suggesting, this time with 1 client and
> > 1 PVFS2 server. Apparently the queue has enough completion queue
> > entries. The memory registration seems to be the problem (however as I
> > said, on the front-end runs):
> >
> > [D 10:04:01.500768] PVFS2 Server version 2.6.3 starting.
> > [D 10:04:01.778135] BMI_ib_initialize: init.
> > [D 10:04:01.778252] openib_ib_initialize: init.
> > [D 10:04:01.779038] openib_ib_initialize: max 65408 completion queue entries.
> > [D 10:04:01.779380] BMI_ib_initialize: done.
> > [E 10:04:01.781047] Error: openib_mem_register: ibv_register_mr.
> > [E 10:04:01.781763] [bt] ./bt.A.1.mpi_io_full(error+0xf4) [0x533738]
> > [E 10:04:01.781771] [bt] ./bt.A.1.mpi_io_full [0x53614a]
> > [E 10:04:01.781776] [bt] ./bt.A.1.mpi_io_full [0x534214]
> > [E 10:04:01.781780] [bt] ./bt.A.1.mpi_io_full [0x533166]
> > [E 10:04:01.781784] [bt] ./bt.A.1.mpi_io_full [0x50a644]
> > [E 10:04:01.781788] [bt] ./bt.A.1.mpi_io_full [0x504ac1]
> > [E 10:04:01.781792] [bt] ./bt.A.1.mpi_io_full [0x4ce576]
> > [E 10:04:01.781795] [bt] ./bt.A.1.mpi_io_full [0x4ce277]
> > [E 10:04:01.781799] [bt] ./bt.A.1.mpi_io_full [0x4ed598]
> > [E 10:04:01.781803] [bt] ./bt.A.1.mpi_io_full [0x4ed5d1]
> > [E 10:04:01.781807] [bt] ./bt.A.1.mpi_io_full [0x4ff1b5]
> > [D 10/19 10:04] PVFS2 Server: storage space created. Exiting.
> > [D 10:04:01.896168] PVFS2 Server version 2.6.3 starting.
>
> Then the CQ allocation fail did not happen this time around? How
> did that get fixed? 65408 seems way too big. I still wonder what
> type of silicon you have.
>
> This MR issue might be due to process locked memory limits. Look
> around in the IB world for "ulimit -l" or /etc/security/limits.conf
> and set it to lots, or unlimited.
>
> -- Pete
>
More information about the Pvfs2-users
mailing list