[PVFS-users] RE: Hangup is back

Rob Ross rross at mcs.anl.gov
Thu Aug 19 16:52:33 EDT 2004


You might want to find a TCP testing program and just run that for a 
while, to see if the network will give out for something other than the 
file system.  Do big scp runs work, for example?

Rob

On Thu, 19 Aug 2004, Brannen S Hough wrote:

> 
> 	Thanks for getting back so quickly.  I knew there had to be
> something strange about my case - seems too simple otherwise (I agree -
> everyone would be seeing it if it were a PVFS bug, or even an OS bug I'd
> imagine).  Though it is a lot harder to make it happen - took over a half
> hour of nonstop traffic in 100MB and 2 GB files to get it to happen this
> time, and that was after a full day of throwing everything at it I could
> (and not getting it to hang once).
> 
> 	The netstat is from the hung node, which was running the mgr and one
> iod (10.0.0.3).  The other node running the other iod is the 10.0.0.6
> machine.
> 
> 	Thanks for the explanation of the output - too bad it's normal (as
> in no clue there).   Thanks for all your help - not sure what else to do
> either.
> 
> 	- Brannen
> 
> > -----Original Message-----
> > From: Rob Ross [mailto:rross at mcs.anl.gov]
> > Sent: Thursday, August 19, 2004 4:40 PM
> > To: Brannen S Hough
> > Cc: pvfs-users at beowulf-underground.org
> > Subject: Re: Hangup is back
> > 
> > Hi Brannen,
> > 
> > Damn.
> > 
> > I don't know what to tell you Brannen.  It is certainly possible that this
> > is a PVFS bug; however, given that you're able to repeat it with a
> > relatively simple test that works for others, I really have to think it's
> > something about your system.  If this were something in PVFS I would
> > expect to be
> > hearing about it from virtually everyone running the system!
> > 
> > That netstat is from the hung node?  This is 10.0.0.3, right?  And it is
> > also the one running the mgr?
> > 
> > You can see that there is a connection to the mgr to the client
> > (3000-32784).  That's normal.
> > 
> > There are actually *four* connections to iods in this case, since the
> > mgr is running here.  There should be 2 connections from the mgr to the
> > iods and two from the client to the iods.
> > 
> > If you look, you can see two connections from the client to the iods (the
> > client connects to the iods right after the mgr, so it tends to get port
> > #s right after that one):
> >   32785-7000 (iod on 10.0.0.6)
> >   32786-7000 (iod on 10.0.0.3)
> > 
> > There are two other connections for the mgr (again, making assumptions
> > about port
> > allocation):
> >   32773-7000 (iod on 10.0.0.6)
> >   32774-7000 (iod on 10.0.0.3)
> > 
> > You don't see the other end of the connections to the remote system, which
> > is why
> > there are only 6 entries instead of 8.
> > 
> > Anyway, that all looks like a normal quiescent system, unfortunately.
> > 
> > I don't know how to further help with this one.  Sorry!
> > 
> > Rob
> > 
> > 
> > On Thu, 19 Aug 2004, Brannen S Hough wrote:
> > 
> > >              Just when I thought I was out of the woods - I thought I'd
> > > found the networking issues that caused the problems (though not
> > directly)
> > > was related to having 2 gigabit Ethernet cards in one machine and trying
> > to
> > > be on the local testing network (for the PVFS nodes) and the company
> > network
> > > (for general use) at the same time.  Spend a day trying to create the
> > hangup
> > > to get more information, but could not.
> > >
> > >              The problem popped up again - while 2 copies of my test
> > program
> > > were running (one on each node of a 2 node PVFS cluster).  It took 35
> > solid
> > > minutes of reading and writing files at full speed for it to show up,
> > and
> > > only one of the programs hung.  The other completed normally.  They were
> > > both reading and writing files using the PVFS library calls (pvfs_open,
> > > etc).
> > >
> > >              I've got a run of 'netstat -tan' that might be useful,
> > caught
> > > after the good test completed and while the bad test was still hung.  I
> > > didn't wait around to see if it would spontaneously restart.
> > >
> > > Recorded 3:15 PM, Aug 19, 2004 - running 2 test programs concurrently on
> > 2
> > > node PVFS.
> > >
> > > Tests running in direct mode (PVFS library calls, not standard file I/O
> > >
> > > Active Internet connections (servers and established)
> > > Proto Recv-Q Send-Q Local Address           Foreign Address
> > State
> > > tcp        0      0 0.0.0.0:7000            0.0.0.0:*
> > LISTEN
> > > tcp        0      0 0.0.0.0:3000            0.0.0.0:*
> > LISTEN
> > > tcp        0      0 10.0.0.3:7000           10.0.0.3:32786
> > ESTABLISHED
> > > tcp        0      0 10.0.0.3:32774          10.0.0.3:7000
> > ESTABLISHED
> > > tcp        0      0 10.0.0.3:32786          10.0.0.3:7000
> > ESTABLISHED
> > > tcp        0      0 10.0.0.3:3000           10.0.0.3:32784
> > ESTABLISHED
> > > tcp        0      0 10.0.0.3:32773          10.0.0.6:7000
> > ESTABLISHED
> > > tcp        0      0 10.0.0.3:7000           10.0.0.3:32774
> > ESTABLISHED
> > > tcp        0      0 10.0.0.3:32785          10.0.0.6:7000
> > ESTABLISHED
> > > tcp        0      0 10.0.0.3:32784          10.0.0.3:3000
> > ESTABLISHED
> > >
> > >              What do you think?  It seems odd that there are seven
> > > references to port 7000 (the 2 iods), but there isn't any traffic built
> > up
> > > and pending.  Strange.
> 
> 
> 


More information about the PVFS-users mailing list