[PVFS-users] Re: Hangup is back

Rob Ross rross at mcs.anl.gov
Thu Aug 19 16:39:55 EDT 2004


Hi Brannen,

Damn.

I don't know what to tell you Brannen.  It is certainly possible that this 
is a PVFS bug; however, given that you're able to repeat it with a 
relatively simple test that works for others, I really have to think it's 
something about your system.  If this were something in PVFS I would expect to be
hearing about it from virtually everyone running the system!

That netstat is from the hung node?  This is 10.0.0.3, right?  And it is 
also the one running the mgr?

You can see that there is a connection to the mgr to the client 
(3000-32784).  That's normal.

There are actually *four* connections to iods in this case, since the
mgr is running here.  There should be 2 connections from the mgr to the
iods and two from the client to the iods.

If you look, you can see two connections from the client to the iods (the 
client connects to the iods right after the mgr, so it tends to get port 
#s right after that one):
  32785-7000 (iod on 10.0.0.6)
  32786-7000 (iod on 10.0.0.3)

There are two other connections for the mgr (again, making assumptions about port
allocation):
  32773-7000 (iod on 10.0.0.6)
  32774-7000 (iod on 10.0.0.3)

You don't see the other end of the connections to the remote system, which is why
there are only 6 entries instead of 8.

Anyway, that all looks like a normal quiescent system, unfortunately.

I don't know how to further help with this one.  Sorry!

Rob


On Thu, 19 Aug 2004, Brannen S Hough wrote:

>              Just when I thought I was out of the woods - I thought I'd
> found the networking issues that caused the problems (though not directly)
> was related to having 2 gigabit Ethernet cards in one machine and trying to
> be on the local testing network (for the PVFS nodes) and the company network
> (for general use) at the same time.  Spend a day trying to create the hangup
> to get more information, but could not.
> 
>              The problem popped up again - while 2 copies of my test program
> were running (one on each node of a 2 node PVFS cluster).  It took 35 solid
> minutes of reading and writing files at full speed for it to show up, and
> only one of the programs hung.  The other completed normally.  They were
> both reading and writing files using the PVFS library calls (pvfs_open,
> etc).
> 
>              I've got a run of 'netstat -tan' that might be useful, caught
> after the good test completed and while the bad test was still hung.  I
> didn't wait around to see if it would spontaneously restart.
> 
> Recorded 3:15 PM, Aug 19, 2004 - running 2 test programs concurrently on 2
> node PVFS.
> 
> Tests running in direct mode (PVFS library calls, not standard file I/O
> 
> Active Internet connections (servers and established)
> Proto Recv-Q Send-Q Local Address           Foreign Address         State
> tcp        0      0 0.0.0.0:7000            0.0.0.0:*               LISTEN
> tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN
> tcp        0      0 10.0.0.3:7000           10.0.0.3:32786    ESTABLISHED 
> tcp        0      0 10.0.0.3:32774          10.0.0.3:7000     ESTABLISHED 
> tcp        0      0 10.0.0.3:32786          10.0.0.3:7000     ESTABLISHED 
> tcp        0      0 10.0.0.3:3000           10.0.0.3:32784    ESTABLISHED 
> tcp        0      0 10.0.0.3:32773          10.0.0.6:7000     ESTABLISHED 
> tcp        0      0 10.0.0.3:7000           10.0.0.3:32774    ESTABLISHED 
> tcp        0      0 10.0.0.3:32785          10.0.0.6:7000     ESTABLISHED 
> tcp        0      0 10.0.0.3:32784          10.0.0.3:3000     ESTABLISHED 
> 
>              What do you think?  It seems odd that there are seven
> references to port 7000 (the 2 iods), but there isn't any traffic built up
> and pending.  Strange.


More information about the PVFS-users mailing list