[PVFS-users] Re: Hangup is back
Rob Ross
rross at mcs.anl.gov
Thu Aug 19 16:39:55 EDT 2004
Hi Brannen,
Damn.
I don't know what to tell you Brannen. It is certainly possible that this
is a PVFS bug; however, given that you're able to repeat it with a
relatively simple test that works for others, I really have to think it's
something about your system. If this were something in PVFS I would expect to be
hearing about it from virtually everyone running the system!
That netstat is from the hung node? This is 10.0.0.3, right? And it is
also the one running the mgr?
You can see that there is a connection to the mgr to the client
(3000-32784). That's normal.
There are actually *four* connections to iods in this case, since the
mgr is running here. There should be 2 connections from the mgr to the
iods and two from the client to the iods.
If you look, you can see two connections from the client to the iods (the
client connects to the iods right after the mgr, so it tends to get port
#s right after that one):
32785-7000 (iod on 10.0.0.6)
32786-7000 (iod on 10.0.0.3)
There are two other connections for the mgr (again, making assumptions about port
allocation):
32773-7000 (iod on 10.0.0.6)
32774-7000 (iod on 10.0.0.3)
You don't see the other end of the connections to the remote system, which is why
there are only 6 entries instead of 8.
Anyway, that all looks like a normal quiescent system, unfortunately.
I don't know how to further help with this one. Sorry!
Rob
On Thu, 19 Aug 2004, Brannen S Hough wrote:
> Just when I thought I was out of the woods - I thought I'd
> found the networking issues that caused the problems (though not directly)
> was related to having 2 gigabit Ethernet cards in one machine and trying to
> be on the local testing network (for the PVFS nodes) and the company network
> (for general use) at the same time. Spend a day trying to create the hangup
> to get more information, but could not.
>
> The problem popped up again - while 2 copies of my test program
> were running (one on each node of a 2 node PVFS cluster). It took 35 solid
> minutes of reading and writing files at full speed for it to show up, and
> only one of the programs hung. The other completed normally. They were
> both reading and writing files using the PVFS library calls (pvfs_open,
> etc).
>
> I've got a run of 'netstat -tan' that might be useful, caught
> after the good test completed and while the bad test was still hung. I
> didn't wait around to see if it would spontaneously restart.
>
> Recorded 3:15 PM, Aug 19, 2004 - running 2 test programs concurrently on 2
> node PVFS.
>
> Tests running in direct mode (PVFS library calls, not standard file I/O
>
> Active Internet connections (servers and established)
> Proto Recv-Q Send-Q Local Address Foreign Address State
> tcp 0 0 0.0.0.0:7000 0.0.0.0:* LISTEN
> tcp 0 0 0.0.0.0:3000 0.0.0.0:* LISTEN
> tcp 0 0 10.0.0.3:7000 10.0.0.3:32786 ESTABLISHED
> tcp 0 0 10.0.0.3:32774 10.0.0.3:7000 ESTABLISHED
> tcp 0 0 10.0.0.3:32786 10.0.0.3:7000 ESTABLISHED
> tcp 0 0 10.0.0.3:3000 10.0.0.3:32784 ESTABLISHED
> tcp 0 0 10.0.0.3:32773 10.0.0.6:7000 ESTABLISHED
> tcp 0 0 10.0.0.3:7000 10.0.0.3:32774 ESTABLISHED
> tcp 0 0 10.0.0.3:32785 10.0.0.6:7000 ESTABLISHED
> tcp 0 0 10.0.0.3:32784 10.0.0.3:3000 ESTABLISHED
>
> What do you think? It seems odd that there are seven
> references to port 7000 (the 2 iods), but there isn't any traffic built up
> and pending. Strange.
More information about the PVFS-users
mailing list