[Pvfs2-developers] Re: [Pvfs2-users] Pvfs2 over infiniband stops
working
Kyle Schochenmaier
kschoche at gmail.com
Sun Mar 30 18:35:30 EST 2008
We are currently trying to track down this bug, as well as one other
involving potential data corruption under heavy load.
I would like to say that I havent seen this bug after some patches
that were committed a while back.
Can you include some more detailed information about your hardware
setup.. the types of nics specifically.
--We've found some bugs that occur on slower nics but not on faster
nics, so knowing what hardware you are running might help us out here.
Tomorrow I can sit down and look at this further, also I'm going to cc
this to the pvfs2-dev list.
~Kyle
On Sun, Mar 30, 2008 at 1:05 PM, Eric J. Walter
<ewalter at particle.physics.wm.edu> wrote:
> Dear pvfs2-users,
>
> I have been trying to get pvfs2 working over infiniband for a few
> weeks now and have made a lot of progress. I am still stuck on one
> last thing I can't seem to fix.
>
> Basically, everything will be fine for a while (like a few days), then
> I see the following in one of the pvfs2-server.logs (when the
> debugging mask is set to "all"):
>
> [E 03/30 11:50] Error: encourage_recv_incoming: mop_id 680cc0 in RTS_DONE message not found.
> [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server(error+0xbd) [0x45d9ed]
> [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server [0x45b571]
> [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server [0x45d281]
> [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server(BMI_testcontext+0x120) [0x43cd40]
> [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server [0x43508d]
> [E 03/30 11:50] [bt] /lib64/tls/libpthread.so.0 [0x354b90610a]
> [E 03/30 11:50] [bt] /lib64/tls/libc.so.6(__clone+0x73) [0x354b0c68c3]
>
> At this point all mounts will be hung and will require a
> restart/remount of all servers and clients, and all jobs using this
> space will need to be restarted.
>
> Only one server seems to ever suffer this problem, i.e. we have 3
> servers total for I/O (one for both metadata and I/O) and this message
> can occur on any of the 3 servers.
>
> It seems that this occurs only when the number of clients accessing
> gets larger than say, 15-20 or perhaps it is a filesystem load issue?
> I haven't been able to tell...
>
> I am using the CVS version from 03/23/08 (I have also tried version
> 2.6.3 but this had other problems mentioned in the pvfs2 users mailing
> list, so I decided to go to the CVS version).
>
> I am using OFED version 1.1 on a cluster of dual core/processor
> Opterons running kernel 2.6.9-42.ELsmp. We have 114 clients which
> mount the pvfs file space over infiniband and use it as scratch space.
> They don't use mpi-io/romio they just write directly to the pvfs2 file
> space mounted via IB (I guess they write through the kernel
> interface). The errors seem to occur when more than 15-20 processors
> worth of jobs try and read/write to the pvfs scratch space, or they
> could be just random.
>
> Does anyone have some clues for how to debug this further or track
> down what the problem is?
>
> Any suggestions are welcome.
>
> Thanks,
>
> Eric J. Walter
> Department of Physics
> College of William and Mary
>
>
> _______________________________________________
> Pvfs2-users mailing list
> Pvfs2-users at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>
--
Kyle Schochenmaier
More information about the Pvfs2-developers
mailing list