[Pvfs2-users] Pvfs2 over infiniband stops working
Eric J. Walter
ewalter at particle.physics.wm.edu
Mon Mar 31 19:47:27 EST 2008
Kyle,
Clients: ~120 dual core / dual proc 2.6-3.0 GHz Opterons w/ 8-32GB of
memory each with one SilverStorm 9000 DDR PCI-Express single port HCA
(lspci says: InfiniBand: Mellanox Technologies MT25204 [InfiniHost III
Lx HCA] (rev 20)). All mount the pvfs filesystem via Infiniband so I
guess the ethernet NIC isn't important (just in case: Ethernet
controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet
PCI Express (rev 21)).
3 I/O servers (2 I/O + 1 Metadata+I/O). Each is a dual core / dual
proc 2.8 GHz Opteron with 8 GB memory and the same Infiniband HCA as
the clients. Each server has 4X146 GB SAS5 with hardware RAID 0. The
total file system is ~1.5 TB.
The Infiniband switch is a SilverStorm/Qlogic 9120 4x DDR.
Did I leave something out?
Thanks again,
Eric
On Sun, Mar 30, 2008 at 06:35:30PM -0500, Kyle Schochenmaier wrote:
> We are currently trying to track down this bug, as well as one other
> involving potential data corruption under heavy load.
> I would like to say that I havent seen this bug after some patches
> that were committed a while back.
>
> Can you include some more detailed information about your hardware
> setup.. the types of nics specifically.
> --We've found some bugs that occur on slower nics but not on faster
> nics, so knowing what hardware you are running might help us out here.
>
> Tomorrow I can sit down and look at this further, also I'm going to cc
> this to the pvfs2-dev list.
>
> ~Kyle
>
>
> On Sun, Mar 30, 2008 at 1:05 PM, Eric J. Walter
> <ewalter at particle.physics.wm.edu> wrote:
> > Dear pvfs2-users,
> >
> > I have been trying to get pvfs2 working over infiniband for a few
> > weeks now and have made a lot of progress. I am still stuck on one
> > last thing I can't seem to fix.
> >
> > Basically, everything will be fine for a while (like a few days), then
> > I see the following in one of the pvfs2-server.logs (when the
> > debugging mask is set to "all"):
> >
> > [E 03/30 11:50] Error: encourage_recv_incoming: mop_id 680cc0 in RTS_DONE message not found.
> > [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server(error+0xbd) [0x45d9ed]
> > [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server [0x45b571]
> > [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server [0x45d281]
> > [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server(BMI_testcontext+0x120) [0x43cd40]
> > [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server [0x43508d]
> > [E 03/30 11:50] [bt] /lib64/tls/libpthread.so.0 [0x354b90610a]
> > [E 03/30 11:50] [bt] /lib64/tls/libc.so.6(__clone+0x73) [0x354b0c68c3]
> >
> > At this point all mounts will be hung and will require a
> > restart/remount of all servers and clients, and all jobs using this
> > space will need to be restarted.
> >
> > Only one server seems to ever suffer this problem, i.e. we have 3
> > servers total for I/O (one for both metadata and I/O) and this message
> > can occur on any of the 3 servers.
> >
> > It seems that this occurs only when the number of clients accessing
> > gets larger than say, 15-20 or perhaps it is a filesystem load issue?
> > I haven't been able to tell...
> >
> > I am using the CVS version from 03/23/08 (I have also tried version
> > 2.6.3 but this had other problems mentioned in the pvfs2 users mailing
> > list, so I decided to go to the CVS version).
> >
> > I am using OFED version 1.1 on a cluster of dual core/processor
> > Opterons running kernel 2.6.9-42.ELsmp. We have 114 clients which
> > mount the pvfs file space over infiniband and use it as scratch space.
> > They don't use mpi-io/romio they just write directly to the pvfs2 file
> > space mounted via IB (I guess they write through the kernel
> > interface). The errors seem to occur when more than 15-20 processors
> > worth of jobs try and read/write to the pvfs scratch space, or they
> > could be just random.
> >
> > Does anyone have some clues for how to debug this further or track
> > down what the problem is?
> >
> > Any suggestions are welcome.
> >
> > Thanks,
> >
> > Eric J. Walter
> > Department of Physics
> > College of William and Mary
> >
> >
> > _______________________________________________
> > Pvfs2-users mailing list
> > Pvfs2-users at beowulf-underground.org
> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> >
>
>
>
> --
> Kyle Schochenmaier
More information about the Pvfs2-users
mailing list