[PVFS-users] FW: PVFS Hangups during concurrent read/writes

Brannen S Hough bshough at impactsci.com
Fri Aug 6 13:03:13 EDT 2004


	I recompiled the pvfs-kernel code with -g and set up the PVFS 2 node
cluster I've been testing on.  When I ran the two test program copies on
different machines and they hung, I attached ddd to pvfsd and took a look at
the stack.  It points back to a call to do_jobs_handle_error() which called
check_socks(), which is where it is stuck (unfortunately, I didn't recompile
the other pvfs code with -g, so I didn't have any debug info or exact line
number there, but that method is pretty small).
	There is a huge note about different techniques for waiting for data
on the socket, and notes about different kernel revisions.  I'm using
2.4.20-6 right now.
	I'll look into it in more depth on Monday.

	Thanks,
	- Brannen

> -----Original Message-----
> From: Rob Ross [mailto:rross at mcs.anl.gov]
> Sent: Thursday, August 05, 2004 5:34 PM
> To: Brannen S Hough
> Cc: pvfs-users at beowulf-underground.org
> Subject: RE: [PVFS-users] FW: PVFS Hangups during concurrent read/writes
> 
> What sort of network is this?  Are you getting errors on any of the
> interfaces?
> 
> Thanks,
> 
> Rob
> 
> On Thu, 5 Aug 2004, Brannen S Hough wrote:
> 
> >
> > 	Yet more information - not sure if this helps or not - but I set up
> > a 2 node cluster.  I can run multiple test programs concurrently on the
> > Manager node, I can run multiple test programs concurrently on the other
> > node (that is only an IONode), but my test programs locked up when
> running
> > one on each of the nodes - one doing reads and the other writes.
> > 	The strange thing about that is they aren't actually hung.  After a
> > long time (couple of hours) one of the two started up again, right where
> it
> > left off, and ran at about normal speed to completion.  The other
> struggled
> > slowly though a couple of more files, then stopped again for good, and
> was
> > in the same place when I killed it a couple of hours later.  This gets
> > weirder and weirder.
> > 	Are people using PVFS more as a file server (where writes are
> > seldom, and most accesses are reads)?  Any suggestions are welcome.
> >
> > 	- Brannen
> >
> > > -----Original Message-----
> > > From: Rob Ross [mailto:rross at mcs.anl.gov]
> > > Sent: Tuesday, August 03, 2004 2:58 PM
> > > To: Brannen S Hough
> > > Cc: pvfs-users at beowulf-underground.org
> > > Subject: Re: [PVFS-users] FW: PVFS Hangups during concurrent
> read/writes
> > >
> > > Hi Brannen,
> > >
> > > Thanks for the problem report and for trying the newest prerelease
> before
> > > getting back to us!
> > >
> > > What exactly is your test program?
> > >
> > > Does this only happen when you have the two iods running on the same
> > > node?
> > >
> > > It would be helpful to us for you to recompile with "-g", attach to
> the
> > > pvfsd, and get a stack dump.  Also some debugging output could help...
> try
> > > 0x077 and see if that gets much into the pvfsdlog file as a start.
> > >
> > > The iods don't coordinate read and write operations in a way that
> would
> > > cause deadlock, so that shouldn't be it.  The mgr just serializes
> > > everything it does, so that should be ok too.  I'm not sure what is
> going
> > > on quite yet...
> > >
> > > Rob
> > >
> >
> >
> >
> >
> >






More information about the PVFS-users mailing list