[Pvfs2-users] PVFS2 v.1.5.1 'Job time out' on some pvfs2-cp and
pvfs2-rm
Mark Van De Vyver
mvyver at gmail.com
Thu Feb 15 20:47:41 EST 2007
Hi,
Thank you for all the effort put into making PVFS2 available.
I'm relatively new to Linux (from WinXP), and have built a 3 node
cluster using the Rocks Cluster software v4.2.1. I've installed the
PVFS2 roll and by following the PVFS2 roll guide all has proceeded
very smoothly - really, thanks - I'd expected a few days/weeks to get
to this point.
At the end of this email I pose some questions that the following
behavior has raised.
About my set-up:
A single user. I made no changes to the PVFS configuration
established by the PVFS2 roll, and have one head node and two
compute-I/O nodes.
PVFS version 1.5.1
The unexpected behavior:
Using pvfs2-cp I have copied approx 900GB of files from serval DVD
using dd (I dd to a tmpfs area then pvfs2-cp this 'image' to
/mnt/pvfs2/some/path).
I have noticed that this runs fine so long as it is the first time the
file is copied. If I use pvfs2-rm to delete a file, not necessarily
from the same node used to make the copy, the following occurs (all
nodes seems to be up and working fine):
- I can see the file is removed using the gnome file browser.
- The pvfs2-rm seems to hang, and the hollowing message is displayed:
[E 15:10:02.584608] Job time out: cancelling bmi operation, job_id: 21.
[E 15:10:02.584769] msgpair failed, will retry: Operation cancelled
(possibly due to timeout)
If I try to re-copy the file (using pvfs2-cp), again, not necessarily
from the same node it was first copied on, then I see and the copy
fails.
[E 15:26:53.690560] Job time out: cancelling bmi operation, job_id: 25.
[E 15:26:53.690710] msgpair failed, will retry: Operation cancelled
(possibly due to timeout)
[E 15:26:53.690733] *** msgpairarray_completion_fn: msgpair to server
tcp://pvfs2-compute-0-1:3334 failed: Operation cancelled (possibly due
to timeout)
[E 15:26:53.690743] *** No retries requested.
pvfs2-cp: src/client/sysint/sys-getattr.sm:331: getattr_acache_lookup:
Assertion `object_ref.handle != ((PVFS_handle)0)' failed.
/
On rebooting one of the nodes I was forced to run fsck, after this the
cluster seems to have returned to 'normal'.
The good news is that the std linux commands: cp and rm don't seem to
have any trouble, so I am using those at the moment..... I couldn't
find any advice that cp, etc, is preferred to pvfs2-cp, or vice versa.
1) Is this a known issue that is fixed in PVFS 2.6?
2) Is it fine to continue to use v1.5.1 so long as I don't use the
PVFS-* commands?
3) Is upgrading to v2.6 on a rocks cluster 'straight forward', or is
it likely to involve some 'debugging' and a few days work - bear in
mind my relative inexperience with Linux.
Regards
Mark
More information about the Pvfs2-users
mailing list