[Pvfs2-users] PVFS2 v.1.5.1 'Job time out' on some pvfs2-cp and pvfs2-rm

Sam Lang slang at mcs.anl.gov
Thu Feb 15 23:03:22 EST 2007


Hi Mark,

It looks like you've got a number of configuration problems on your  
nodes that need to be fixed before PVFS can be run.  On the frontend  
node, there are a bunch of "Network is unreachable" errors in the  
client log.  It looks like you don't have the network on that node  
setup properly.  On the compute-0-0 node, there are "Host name lookup  
failure" errors, where its trying to resolve 'frontend' but fails.   
You need to setup DNS properly for that to work.

Probably a good first step would be to verify that you can ping one  
node from another.  From compute-0-0:

 >  ping frontend

When that succeeds, and pinging the other nodes in your cluster  
succeeds as well, then you can proceed with the PVFS setup.

It doesn't look like you've setup any PVFS servers.  PVFS is an  
asymmetric network file system: the servers usually run on backend  
nodes separate from the clients on the compute nodes.  There's more  
information about the configurations that PVFS works well for in the  
user's guide on the website.

-sam

On Feb 15, 2007, at 9:53 PM, Mark Van De Vyver wrote:

> Hi Sam,
> Thank you for the prompt response.
> I attach the /tmp/pvfs2-client.log files I found on each machine.  I
> didn't see any /tmp/pvfs2-server.log files.  Are the one and the same?
>
> I may have been a little hasty earlier in claiming that cp and rm work
> fine... I think I am seeing some error when I cp from the tmpfs area
> to the PVFS2 area.
> I'm still trying to work out what is happening where and when
>
> Hope this helps.
> Regards
> Mark
>
>
> On 2/16/07, Sam Lang <slang at mcs.anl.gov> wrote:
>>
>> On Feb 15, 2007, at 7:47 PM, Mark Van De Vyver wrote:
>>
>> > Hi,
>> > Thank you for all the effort put into making PVFS2 available.
>> > I'm relatively new to Linux (from WinXP), and have built a 3 node
>> > cluster using the Rocks  Cluster software v4.2.1.  I've  
>> installed the
>> > PVFS2 roll and by following the PVFS2 roll guide all has proceeded
>> > very smoothly - really, thanks - I'd expected a few days/weeks  
>> to get
>> > to this point.
>> >
>> > At the end of this email I pose some questions that the following
>> > behavior has raised.
>> >
>> > About my set-up:
>> > A single user.  I made no changes to the PVFS configuration
>> > established by the PVFS2 roll, and have one head node and two
>> > compute-I/O nodes.
>> > PVFS version 1.5.1
>> >
>> > The unexpected behavior:
>> > Using pvfs2-cp I have copied approx 900GB of files from serval DVD
>> > using dd (I dd to a tmpfs area then pvfs2-cp this 'image' to
>> > /mnt/pvfs2/some/path).
>> > I have noticed that this runs fine so long as it is the first  
>> time the
>> > file is copied.  If I use pvfs2-rm to delete a file, not  
>> necessarily
>> > from the same node used to make the copy, the following occurs (all
>> > nodes seems to be up and working fine):
>> > - I can see the file is removed using the gnome file browser.
>> > - The pvfs2-rm seems to hang, and the hollowing message is  
>> displayed:
>> >
>> > [E 15:10:02.584608] Job time out: cancelling bmi operation, job_id:
>> > 21.
>> > [E 15:10:02.584769] msgpair failed, will retry: Operation cancelled
>> > (possibly due to timeout)
>> >
>> Hi Mark,
>>
>> It looks like the first failure with pvfs2-rm caused one of the
>> servers to crash, giving the appearance that pvfs2-rm was hanging.
>> It probably timed out at about 5 minutes or so?  The error message is
>> that timeout.
>>
>> > If I try to re-copy the file (using pvfs2-cp), again, not  
>> necessarily
>> > from the same node it was first copied on, then I see and the copy
>> > fails.
>> >
>> > [E 15:26:53.690560] Job time out: cancelling bmi operation, job_id:
>> > 25.
>> > [E 15:26:53.690710] msgpair failed, will retry: Operation cancelled
>> > (possibly due to timeout)
>> > [E 15:26:53.690733] *** msgpairarray_completion_fn: msgpair to  
>> server
>> > tcp://pvfs2-compute-0-1:3334 failed: Operation cancelled  
>> (possibly due
>> > to timeout)
>>
>> The failure here with pvfs2-cp at this point is also because the
>> server crashed in the previous pvfs2-rm.
>>
>> > [E 15:26:53.690743] *** No retries requested.
>> > pvfs2-cp: src/client/sysint/sys-getattr.sm:331:  
>> getattr_acache_lookup:
>> > Assertion `object_ref.handle != ((PVFS_handle)0)' failed.
>> > /
>> >
>>
>> This is a bug, when pvfs2-cp fails due to timeout, we shouldn't
>> assertion fail.  I will look into this, although it may have already
>> been fixed since 1.5.1.
>>
>> > On rebooting one of the nodes I was forced to run fsck, after  
>> this the
>> > cluster seems  to have returned to 'normal'.
>>
>> You can probably just restart the servers to get things back.
>>
>> >
>> > The good news is that the std linux commands: cp and rm don't  
>> seem to
>> > have any trouble, so I am using those at the moment..... I couldn't
>> > find any advice that cp, etc, is preferred to pvfs2-cp, or vice  
>> versa.
>>
>> I think in general a lot more effort is made to get the kernel module
>> working properly than the client tools (pvfs2-*).  That being said,
>> we don't discourage the use of the client tools, they just don't get
>> as much pounding, and they aren't written to match the functionality
>> that the VFS provides.
>>
>> >
>> > 1) Is this a known issue that is fixed in PVFS 2.6?
>>
>> The issue I think is why pvfs2-rm causes the server(s) to crash.  If
>> possible, could you send us the logs of the servers?  They should be
>> in /tmp/pvfs2-server.log.
>>
>> > 2) Is it fine to continue to use v1.5.1 so long as I don't use the
>> > PVFS-* commands?
>>
>> Yes.  There are known bugs in the 1.5.1 release, but they aren't
>> likely to cause any problems for what you're doing.
>>
>> > 3) Is upgrading to v2.6 on a rocks cluster 'straight forward',  
>> or is
>> > it likely to involve some 'debugging' and a few days work - bear in
>> > mind my relative inexperience with Linux.
>>
>> I've never installed Rocks so I'm going to have to let someone else
>> answer that.  We pride ourselves on making PVFS easy to install and
>> deploy, and that hasn't changed in the newer releases.
>>
>> -sam
>>
>> >
>> > Regards
>> > Mark
>> > _______________________________________________
>> > Pvfs2-users mailing list
>> > Pvfs2-users at beowulf-underground.org
>> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>> >
>>
>> <pvfs2-client.log.frontend>
>> <pvfs2-client.log.compute-0-0>
>> <pvfs2-client.log.compute-0-1>



More information about the Pvfs2-users mailing list