[PVFS-users] RE: PVFS Hangups during concurrent reads/writes

Brannen S Hough bshough at impactsci.com
Tue Aug 17 18:28:31 EDT 2004


	OK - I was working on other things for a few days, so I just got
back to PVFS today.  In the meantime I was doing other network heavy
investigating and was having erratic behavior on my Xeon box.  Trying to
eliminate variables I disconnected the 2nd gigabit Ethernet card in the
Xeon, and the erratic behavior went away (so now I am just using the one
built into the motherboard).   I tried out the scenario that was causing the
hang-ups again, and lo and behold - runs fine.
	Not sure if I should be encouraged by this or discouraged by it,
since I still have no good idea what the problem is, or its root cause.
While running the tests network traffic would stop for a second or so, then
get right back to blinking madly.  I captured some 'netstat -tan' dumps
while it was going on, but they don't seem to point the finger at anything.
	I haven't read anything about RedHat 9 having problems with multiple
NICs/networks set up.  Probably helps explain why no one else has seen
something like this.

	Anyway, here are the dumps:
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:32768           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:32769         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:32770           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:937             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:6000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:21              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:7000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN
tcp        0      0 10.0.0.3:3000           10.0.0.4:35976
ESTABLISHED
tcp        0      0 10.0.0.3:32771          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:32783          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32777
ESTABLISHED
tcp        0      0 10.0.0.3:32777          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32802          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:3000           10.0.0.3:32791
ESTABLISHED
tcp        0      0 10.0.0.3:32778          10.0.0.4:7000
ESTABLISHED
tcp        0    128 10.0.0.3:32803          10.0.0.4:7000
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.4:36078
ESTABLISHED
tcp        0      0 10.0.0.3:32791          10.0.0.3:3000
ESTABLISHED
tcp        0    293 10.0.0.3:7000           10.0.0.3:32802
ESTABLISHED
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:32768           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:32769         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:32770           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:937             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:6000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:21              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:7000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN
tcp        0    136 10.0.0.3:3000           10.0.0.4:35976
ESTABLISHED
tcp        0      0 10.0.0.3:32771          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:32783          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32777
ESTABLISHED
tcp        0      0 10.0.0.3:32777          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32802          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:3000           10.0.0.3:32791
ESTABLISHED
tcp        0      0 10.0.0.3:32778          10.0.0.4:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32803          10.0.0.4:7000
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.4:36078
ESTABLISHED
tcp        0      0 10.0.0.3:32791          10.0.0.3:3000
ESTABLISHED
tcp        0     48 10.0.0.3:7000           10.0.0.3:32802
ESTABLISHED
[root at TestHAClient testpvfs]# /bin/netstat -tan
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:32768           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:32769         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:32770           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:937             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:6000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:21              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:7000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN
tcp        0      0 10.0.0.3:3000           10.0.0.4:35976
ESTABLISHED
tcp        0      0 10.0.0.3:32771          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:32783          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32777
ESTABLISHED
tcp        0      0 10.0.0.3:32777          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32802          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:3000           10.0.0.3:32791
ESTABLISHED
tcp        0      0 10.0.0.3:32778          10.0.0.4:7000
ESTABLISHED
tcp       48      0 10.0.0.3:32803          10.0.0.4:7000
ESTABLISHED
tcp        0     48 10.0.0.3:7000           10.0.0.4:36078
ESTABLISHED
tcp        0      0 10.0.0.3:32791          10.0.0.3:3000
ESTABLISHED
tcp        0     48 10.0.0.3:7000           10.0.0.3:32802
ESTABLISHED
[root at TestHAClient testpvfs]# /bin/netstat -tan
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:32768           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:32769         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:32770           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:937             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:6000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:21              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:7000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN
tcp        0      0 10.0.0.3:3000           10.0.0.4:35976
ESTABLISHED
tcp        0      0 10.0.0.3:32771          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:32783          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32777
ESTABLISHED
tcp        0      0 10.0.0.3:32777          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32802          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:3000           10.0.0.3:32791
ESTABLISHED
tcp        0      0 10.0.0.3:32778          10.0.0.4:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32803          10.0.0.4:7000
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.4:36078
ESTABLISHED
tcp        0      0 10.0.0.3:32791          10.0.0.3:3000
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32802
ESTABLISHED
[root at TestHAClient testpvfs]# /bin/netstat -tan
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:32768           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:32769         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:32770           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:937             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:6000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:21              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:7000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN
tcp        0      0 10.0.0.3:3000           10.0.0.4:35976
ESTABLISHED
tcp        0      0 10.0.0.3:32771          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:32783          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32777
ESTABLISHED
tcp        0      0 10.0.0.3:32777          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32802          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:3000           10.0.0.3:32791
ESTABLISHED
tcp        0      0 10.0.0.3:32778          10.0.0.4:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32803          10.0.0.4:7000
ESTABLISHED
tcp      128      0 10.0.0.3:7000           10.0.0.4:36078
ESTABLISHED
tcp        0      0 10.0.0.3:32791          10.0.0.3:3000
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32802
ESTABLISHED
[root at TestHAClient testpvfs]# /bin/netstat -tan
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:32768           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:32769         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:32770           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:937             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:6000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:21              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:7000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN
tcp        0      0 10.0.0.3:3000           10.0.0.4:35976
ESTABLISHED
tcp        0      0 10.0.0.3:32771          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:32783          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32777
ESTABLISHED
tcp        0      0 10.0.0.3:32777          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32802          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:3000           10.0.0.3:32791
ESTABLISHED
tcp        0      0 10.0.0.3:32778          10.0.0.4:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32803          10.0.0.4:7000
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.4:36078
ESTABLISHED
tcp        0      0 10.0.0.3:32791          10.0.0.3:3000
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32802
ESTABLISHED
[root at TestHAClient testpvfs]# /bin/netstat -tan
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:32768           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:32769         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:32770           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:937             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:6000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:21              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:7000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN
tcp        0      0 10.0.0.3:3000           10.0.0.4:35976
ESTABLISHED
tcp        0      0 10.0.0.3:32771          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:32783          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32777
ESTABLISHED
tcp        0      0 10.0.0.3:32777          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32802          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:3000           10.0.0.3:32791
ESTABLISHED
tcp        0      0 10.0.0.3:32778          10.0.0.4:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32803          10.0.0.4:7000
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.4:36078
ESTABLISHED
tcp        0      0 10.0.0.3:32791          10.0.0.3:3000
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32802
ESTABLISHED
[root at TestHAClient testpvfs]# /bin/netstat -tan
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:32768           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:32769         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:32770           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:937             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:6000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:21              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:7000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN
tcp        0      0 10.0.0.3:3000           10.0.0.4:35976
ESTABLISHED
tcp        0      0 10.0.0.3:32771          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:32783          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32777
ESTABLISHED
tcp        0      0 10.0.0.3:32777          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32802          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:3000           10.0.0.3:32791
ESTABLISHED
tcp        0      0 10.0.0.3:32778          10.0.0.4:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32803          10.0.0.4:7000
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.4:36078
ESTABLISHED
tcp        0      0 10.0.0.3:32791          10.0.0.3:3000
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32802
ESTABLISHED
[root at TestHAClient testpvfs]# /bin/netstat -tan
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:32768           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:32769         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:32770           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:937             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:6000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:21              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:7000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN
tcp        0      0 10.0.0.3:3000           10.0.0.4:35976
ESTABLISHED
tcp        0      0 10.0.0.3:32771          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:32783          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32777
ESTABLISHED
tcp        0      0 10.0.0.3:32777          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32802          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:3000           10.0.0.3:32791
ESTABLISHED
tcp        0      0 10.0.0.3:32778          10.0.0.4:7000
ESTABLISHED
tcp       48      0 10.0.0.3:32803          10.0.0.4:7000
ESTABLISHED
tcp        0     48 10.0.0.3:7000           10.0.0.4:36078
ESTABLISHED
tcp        0      0 10.0.0.3:32791          10.0.0.3:3000
ESTABLISHED
tcp        0     48 10.0.0.3:7000           10.0.0.3:32802
ESTABLISHED
[root at TestHAClient testpvfs]# /bin/netstat -tan
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:32768           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:32769         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:32770           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:937             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:6000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:21              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:7000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN
tcp        0      0 10.0.0.3:3000           10.0.0.4:35976
ESTABLISHED
tcp        0      0 10.0.0.3:32771          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:32783          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32777
ESTABLISHED
tcp        0      0 10.0.0.3:32777          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32802          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:3000           10.0.0.3:32791
ESTABLISHED
tcp        0      0 10.0.0.3:32778          10.0.0.4:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32803          10.0.0.4:7000
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.4:36078
ESTABLISHED
tcp        0      0 10.0.0.3:32791          10.0.0.3:3000
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32802
ESTABLISHED
[root at TestHAClient testpvfs]# /bin/netstat -tan
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:32768           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:32769         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:32770           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:937             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:6000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:21              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:7000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN
tcp        0      0 10.0.0.3:3000           10.0.0.4:35976
ESTABLISHED
tcp        0      0 10.0.0.3:32771          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:32783          10.0.0.4:22
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.3:32777
ESTABLISHED
tcp        0      0 10.0.0.3:32777          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32802          10.0.0.3:7000
ESTABLISHED
tcp        0      0 10.0.0.3:3000           10.0.0.3:32791
ESTABLISHED
tcp        0      0 10.0.0.3:32778          10.0.0.4:7000
ESTABLISHED
tcp        0      0 10.0.0.3:32803          10.0.0.4:7000
ESTABLISHED
tcp        0      0 10.0.0.3:7000           10.0.0.4:36078
ESTABLISHED
tcp        0      0 10.0.0.3:32791          10.0.0.3:3000
ESTABLISHED


> -----Original Message-----
> From: Rob Ross [mailto:rross at mcs.anl.gov]
> Sent: Thursday, August 12, 2004 3:51 PM
> To: Brannen S Hough
> Cc: pvfs-users at beowulf-underground.org
> Subject: RE: PVFS Hangups during concurrent reads/writes
> 
> Brannen,
> 
> Murali attempted to reproduce this with no luck.  It appears that someone
> else has reproduced it though...
> 
> The iods use a single socket for listening, but of course you end up with
> a separate socket per client (pvfsd or application using the library).
> It's fairly unlikely that it is getting "confused" -- that would be
> something that would have come up sooner I think.
> 
> It is possible, I guess, that the iod is somehow losing track of the
> pvfsd's socket somehow.  If that were to happen, it's possible that the
> pvfsd would hang indefinitely.  If the pvfsd were to later timeout, then
> it would be able to re-establish communication.
> 
> It would be helpful to me for you to run "netstat -tan" on the clients and
> servers during one of these stalled periods to see where data is sitting.
> This might help us figure out if the iod is somehow not seeing data
> waiting for it or if the pvfsd is somehow just hanging up.
> 
> I'm truly sorry that I don't have a better solution to this problem, but
> it's very difficult to debug these things when they can't be easily
> replicated here.  Thanks for all your patience.
> 
> Rob
> 
> On Tue, 10 Aug 2004, Brannen S Hough wrote:
> 
> >       Hi Rob,
> >
> >       Another follow up. I'm attaching the /etc/pvfstab file I'm using.
> I
> > mount the PVFS file system with the command
> >
> > "mount -t pvfs TestHAClient:/pvfs-meta /mnt/pvfs".
> >
> >       I linked in the latest headers and recompiled my test program to
> run
> > it again.  No dice, same effect happens when running 2 test programs
> > simultaneously on 2 different machines.  Running one of the test
> programs on
> > a machine that wasn't acting as part of the cluster (Manager or IONode)
> > didn't make a difference either.
> >
> >       Can you reproduce what I'm seeing there using my test program?
> And if
> > you have a test program that works well for you, could you send me a
> copy?
> > Maybe I can reproduce you not seeing what I see.
> >
> >       Do the IONodes use just the one socket (7000) for reading and
> writing?
> > Is it possible that it has two connections open, one for each of the
> pvfsd
> > instances communicating with it, and is getting reads vs. writes
> confused
> > (i.e. thinking it needs to wait on incoming data on a socket it actually
> > needs to write data to, and vice versa)?  The hang-ups only happen when
> one
> > test program is reading while the other one is writing (though never to
> the
> > same files - I'm careful about that) - though that does not explain how
> they
> > can "wake up" after a long time and continue where they left off.
> >
> >
> > > -----Original Message-----
> >
> > > From: Rob Ross [mailto:rross at mcs.anl.gov]
> >
> > > Sent: Monday, August 09, 2004 5:42 PM
> >
> > > To: Brannen S Hough
> >
> > > Cc: pvfs-users at beowulf-underground.org
> >
> > > Subject: Re: PVFS Hangups during concurrent reads/writes
> >
> > >
> >
> > > Hi Brannen,
> >
> > >
> >
> > > I did a quick search and couldn't find any mention of 2.4.20 select()
> >
> > > problems.  Of couse I would like this to be a kernel problem, or
> perhaps a
> >
> > > libc problem, but I don't see anything indicating that others have had
> the
> >
> > > same issues.
> >
> > >
> >
> > > At the same time, no, we haven't had anything like this reported
> either!
> >
> > > It's particularly odd to me that things work fine when on different
> >
> > > machines while working just fine on the same machine!  Usually it is
> the
> >
> > > other way around :).
> >
> > >
> >
> > > Your test program is a little odd in that it moves back and forth
> between
> >
> > > using the kernel and using the user library (if my cursory skim got
> the
> >
> > > right impression).  Also, you're playing a dangerous game keeping
> extra
> >
> > > copies of the PVFS headers in the test subdirectory; there are changes
> >
> > > between what I see in there and CVS for sure.
> >
> > >
> >
> > > Have you tried just using the kernel interface or just using the
> library?
> >
> > > If so, did those work ok?  Do you have an /etc/pvfstab file set up on
> your
> >
> > > machine pointing exactly to the same directory as the mount point?
> >
> > >
> >
> > > Can you verify for me that PVFS_USE_NODELAY is defined in
> pvfs/config.h
> >
> > > (not pvfs-kernel)?  It's probably defined twice (it's ok).
> >
> > >
> >
> > > Thanks, and sorry we don't have a quick solution for you!
> >
> > >
> >
> > > Rob
> >
> > >
> >
> > >
> >
> > > On Mon, 9 Aug 2004, Brannen S Hough wrote:
> >
> > >
> >
> > > >              I've been trying to isolate this problem and find a way
> >
> > > around
> >
> > > > it.  At its core it seems to be a select() call problem, which would
> >
> > > mean a
> >
> > > > linux kernel problem.   Attached is a screen shot of the trace from
> >
> > > running
> >
> > > > ddd on pvfsd, gets hung up on line 199 in sockset.c, which is
> calling
> >
> > > > dfd_select() (in pvfs-1.6.3-pre3/shared/dfd_set.c), which is calling
> >
> > > > select().
> >
> > > >
> >
> > > >              I tried updating my RedHat 9 to kernel version 2.4.20-
> 31.9,
> >
> > > > recompiling everything, and rerunning my tests, but I got the same
> >
> > > results.
> >
> > > > Any other ideas?   I could try rewriting the dfd_select routine to
> break
> >
> > > out
> >
> > > > each socket file descriptor individually and calling select() on
> each
> >
> > > > instead of passing the array of file descriptors to select(), but
> I'm
> >
> > > not
> >
> > > > sure that would fix the problem (and would make things slightly less
> >
> > > > efficient).
> >
> >





More information about the PVFS-users mailing list