[Pvfs2-users] PVFS2 v.1.5.1 'Job time out' on some pvfs2-cp and pvfs2-rm

Mark Van De Vyver mvyver at gmail.com
Thu Feb 15 22:53:40 EST 2007


Hi Sam,
Thank you for the prompt response.
I attach the /tmp/pvfs2-client.log files I found on each machine.  I
didn't see any /tmp/pvfs2-server.log files.  Are the one and the same?

I may have been a little hasty earlier in claiming that cp and rm work
fine... I think I am seeing some error when I cp from the tmpfs area
to the PVFS2 area.
I'm still trying to work out what is happening where and when

Hope this helps.
Regards
Mark


On 2/16/07, Sam Lang <slang at mcs.anl.gov> wrote:
>
> On Feb 15, 2007, at 7:47 PM, Mark Van De Vyver wrote:
>
> > Hi,
> > Thank you for all the effort put into making PVFS2 available.
> > I'm relatively new to Linux (from WinXP), and have built a 3 node
> > cluster using the Rocks  Cluster software v4.2.1.  I've installed the
> > PVFS2 roll and by following the PVFS2 roll guide all has proceeded
> > very smoothly - really, thanks - I'd expected a few days/weeks to get
> > to this point.
> >
> > At the end of this email I pose some questions that the following
> > behavior has raised.
> >
> > About my set-up:
> > A single user.  I made no changes to the PVFS configuration
> > established by the PVFS2 roll, and have one head node and two
> > compute-I/O nodes.
> > PVFS version 1.5.1
> >
> > The unexpected behavior:
> > Using pvfs2-cp I have copied approx 900GB of files from serval DVD
> > using dd (I dd to a tmpfs area then pvfs2-cp this 'image' to
> > /mnt/pvfs2/some/path).
> > I have noticed that this runs fine so long as it is the first time the
> > file is copied.  If I use pvfs2-rm to delete a file, not necessarily
> > from the same node used to make the copy, the following occurs (all
> > nodes seems to be up and working fine):
> > - I can see the file is removed using the gnome file browser.
> > - The pvfs2-rm seems to hang, and the hollowing message is displayed:
> >
> > [E 15:10:02.584608] Job time out: cancelling bmi operation, job_id:
> > 21.
> > [E 15:10:02.584769] msgpair failed, will retry: Operation cancelled
> > (possibly due to timeout)
> >
> Hi Mark,
>
> It looks like the first failure with pvfs2-rm caused one of the
> servers to crash, giving the appearance that pvfs2-rm was hanging.
> It probably timed out at about 5 minutes or so?  The error message is
> that timeout.
>
> > If I try to re-copy the file (using pvfs2-cp), again, not necessarily
> > from the same node it was first copied on, then I see and the copy
> > fails.
> >
> > [E 15:26:53.690560] Job time out: cancelling bmi operation, job_id:
> > 25.
> > [E 15:26:53.690710] msgpair failed, will retry: Operation cancelled
> > (possibly due to timeout)
> > [E 15:26:53.690733] *** msgpairarray_completion_fn: msgpair to server
> > tcp://pvfs2-compute-0-1:3334 failed: Operation cancelled (possibly due
> > to timeout)
>
> The failure here with pvfs2-cp at this point is also because the
> server crashed in the previous pvfs2-rm.
>
> > [E 15:26:53.690743] *** No retries requested.
> > pvfs2-cp: src/client/sysint/sys-getattr.sm:331: getattr_acache_lookup:
> > Assertion `object_ref.handle != ((PVFS_handle)0)' failed.
> > /
> >
>
> This is a bug, when pvfs2-cp fails due to timeout, we shouldn't
> assertion fail.  I will look into this, although it may have already
> been fixed since 1.5.1.
>
> > On rebooting one of the nodes I was forced to run fsck, after this the
> > cluster seems  to have returned to 'normal'.
>
> You can probably just restart the servers to get things back.
>
> >
> > The good news is that the std linux commands: cp and rm don't seem to
> > have any trouble, so I am using those at the moment..... I couldn't
> > find any advice that cp, etc, is preferred to pvfs2-cp, or vice versa.
>
> I think in general a lot more effort is made to get the kernel module
> working properly than the client tools (pvfs2-*).  That being said,
> we don't discourage the use of the client tools, they just don't get
> as much pounding, and they aren't written to match the functionality
> that the VFS provides.
>
> >
> > 1) Is this a known issue that is fixed in PVFS 2.6?
>
> The issue I think is why pvfs2-rm causes the server(s) to crash.  If
> possible, could you send us the logs of the servers?  They should be
> in /tmp/pvfs2-server.log.
>
> > 2) Is it fine to continue to use v1.5.1 so long as I don't use the
> > PVFS-* commands?
>
> Yes.  There are known bugs in the 1.5.1 release, but they aren't
> likely to cause any problems for what you're doing.
>
> > 3) Is upgrading to v2.6 on a rocks cluster 'straight forward', or is
> > it likely to involve some 'debugging' and a few days work - bear in
> > mind my relative inexperience with Linux.
>
> I've never installed Rocks so I'm going to have to let someone else
> answer that.  We pride ourselves on making PVFS easy to install and
> deploy, and that hasn't changed in the newer releases.
>
> -sam
>
> >
> > Regards
> > Mark
> > _______________________________________________
> > Pvfs2-users mailing list
> > Pvfs2-users at beowulf-underground.org
> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> >
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pvfs2-client.log.frontend
Type: application/octet-stream
Size: 115352 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-users/attachments/20070216/5f07a6fe/pvfs2-client.log-0003.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pvfs2-client.log.compute-0-0
Type: application/octet-stream
Size: 1968 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-users/attachments/20070216/5f07a6fe/pvfs2-client.log-0004.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pvfs2-client.log.compute-0-1
Type: application/octet-stream
Size: 119 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-users/attachments/20070216/5f07a6fe/pvfs2-client.log-0005.obj


More information about the Pvfs2-users mailing list