[PVFS-users] RE: PVFS Hangups during concurrent reads/writes

David S Metheny david.s.metheny at conwaycorp.net
Thu Aug 12 16:04:38 EDT 2004


I've not seen the job come back to life and I've let one of the jobs sit for
more than 12 hours. However, I don't let all the jobs sit for very long in
the hung state before canceling them. 

A couple more notes:

- When the client node hung up during the 2 client node test, I could still
access the PVFS cluster from other nodes. When trying to do a "df -h" or an
"ls" to PVFS on the "hung client", the command never returned.  

- We switched to using the pvfsd (user space daemon) on the client nodes
from the 1.6.3-pre3 release. We could not replicate the problem with the
user space daemon. 

-----Original Message-----
From: Rob Ross [mailto:rross at mcs.anl.gov] 
Sent: Thursday, August 12, 2004 2:53 PM
To: David S Metheny
Cc: 'Brannen S Hough'; pvfs-users at beowulf-underground.org
Subject: RE: [PVFS-users] RE: PVFS Hangups during concurrent reads/writes

This sounds different from what Brannen is seeing, as he was seeing the same
problem for a wide variety of versions.

What type of job is it, does it come back to life later, etc?

Thanks,

Rob

On Thu, 12 Aug 2004, David S Metheny wrote:

> Rob/Brannen,
>     We are seeing very similar problems with the kpvfsd. We verified 
> that our job runs on 2 client nodes with the 1.6.3-pre1 release. The 
> same jobs "hangs up" on 1 of the clients using the 1.6.3-pre3 with 
> kpvfsd. When I run using the PVFS library from the 1.6.3-pre3 release, the
job runs fine.
>  
>     Each client will be doing read/write accesses to the PVFS cluster. 
> The clients read portions of the same input file, and write to 
> individual output files. Then the clients read it's individual output 
> files, and writes to different locations in a single output file. One 
> client is successfully able to complete all these tasks, and the other
hangs up.



More information about the PVFS-users mailing list