[PVFS-users] RE: PVFS Hangups during concurrent reads/writes
David S Metheny
david.s.metheny at conwaycorp.net
Thu Aug 12 16:04:38 EDT 2004
I've not seen the job come back to life and I've let one of the jobs sit for
more than 12 hours. However, I don't let all the jobs sit for very long in
the hung state before canceling them.
A couple more notes:
- When the client node hung up during the 2 client node test, I could still
access the PVFS cluster from other nodes. When trying to do a "df -h" or an
"ls" to PVFS on the "hung client", the command never returned.
- We switched to using the pvfsd (user space daemon) on the client nodes
from the 1.6.3-pre3 release. We could not replicate the problem with the
user space daemon.
From: Rob Ross [mailto:rross at mcs.anl.gov]
Sent: Thursday, August 12, 2004 2:53 PM
To: David S Metheny
Cc: 'Brannen S Hough'; pvfs-users at beowulf-underground.org
Subject: RE: [PVFS-users] RE: PVFS Hangups during concurrent reads/writes
This sounds different from what Brannen is seeing, as he was seeing the same
problem for a wide variety of versions.
What type of job is it, does it come back to life later, etc?
On Thu, 12 Aug 2004, David S Metheny wrote:
> We are seeing very similar problems with the kpvfsd. We verified
> that our job runs on 2 client nodes with the 1.6.3-pre1 release. The
> same jobs "hangs up" on 1 of the clients using the 1.6.3-pre3 with
> kpvfsd. When I run using the PVFS library from the 1.6.3-pre3 release, the
job runs fine.
> Each client will be doing read/write accesses to the PVFS cluster.
> The clients read portions of the same input file, and write to
> individual output files. Then the clients read it's individual output
> files, and writes to different locations in a single output file. One
> client is successfully able to complete all these tasks, and the other
More information about the PVFS-users