[Pvfs2-developers] PVFS2 Removal of large files
Rob Ross
rross at mcs.anl.gov
Thu Oct 2 16:51:06 EDT 2008
Maybe removing the 2TByte file takes longer than 30 seconds on ext3,
so client times out. It would be useful to know when the server first
succeeds. Maybe some tuning on client side to catch the case where on
retry the objects aren't there?
-- Rob
On Oct 2, 2008, at 3:21 PM, "David Metheny" <david.metheny at gmail.com>
wrote:
> I’m seeing an issue when removing large files from a PVFS2 file syst
> em. My example setup is a 12 node PVFS2 file system with 2.2TB EXT3
> SAN mounts to each pvfs2 server. The server is configured for 30 sec
> ond timeouts and 5 retries. We really don’t want to change the timeo
> ut values and retries if possible.
>
>
>
> There is a 2TB file that exists. When the client tries to ‘rm’
> the 2TB file, the client basically goes through the 30 second timeou
> t and exhausts the retries and then reports back to the command line
> “Invalid Argument”. From everything I can tell, the file
> *really* gets deleted and doesn’t show up in a directory listing.
>
>
>
> I’ve included the client command line results and the log messages f
> rom the delete below
>
>
>
> bash-2.05b$ rm cmsdb_silo_mstr_20080606a
>
> rm: cannot remove `cmsdb_silo_mstr_20080606a': Invalid argument
>
>
>
> Oct 2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time
> out: cancelling bmi operation, job_id: 192955100.
>
> Oct 2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time
> out: cancelling bmi operation, job_id: 192955103.
>
> Oct 2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time
> out: cancelling bmi operation, job_id: 192955106.
>
> Oct 2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time
> out: cancelling bmi operation, job_id: 192955109.
>
> Oct 2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time
> out: cancelling bmi operation, job_id: 192955112.
>
> Oct 2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time
> out: cancelling bmi operation, job_id: 192955115.
>
> Oct 2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time
> out: cancelling bmi operation, job_id: 192955118.
>
> Oct 2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time
> out: cancelling bmi operation, job_id: 192955121.
>
> Oct 2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time
> out: cancelling bmi operation, job_id: 192955124.
>
> Oct 2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time
> out: cancelling bmi operation, job_id: 192955127.
>
> Oct 2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time
> out: cancelling bmi operation, job_id: 192955130.
>
> Oct 2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time
> out: cancelling bmi operation, job_id: 192955133.
>
> Oct 2 10:29:35 clientNode1 PVFS2: [E] msgpair failed, will retry:
> Operation cancelled (possibly due to timeout)
>
> Oct 2 10:29:36 clientNode1 last message repeated 11 times.
>
>
>
> <SKIPPING REPEAT OF THE ABOVE 5 MORE TIMES>
>
>
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] ***
> msgpairarray_completion_fn: msgpair to server tcp://server1HA:3334
> failed: Operation cancelled (possibly due to timeout)
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] ***
> msgpairarray_completion_fn: msgpair to server tcp://server2HA:3334
> failed: Operation cancelled (possibly due to timeout)
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] ***
> msgpairarray_completion_fn: msgpair to server tcp://server3HA:3334
> failed: Operation cancelled (possibly due to timeout)
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] ***
> msgpairarray_completion_fn: msgpair to server tcp://server4HA:3334
> failed: Operation cancelled (possibly due to timeout)
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] ***
> msgpairarray_completion_fn: msgpair to server tcp://server5HA:3334
> failed: Operation cancelled (possibly due to timeout)
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] ***
> msgpairarray_completion_fn: msgpair to server tcp://server6HA:3334
> failed: Operation cancelled (possibly due to timeout)
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] ***
> msgpairarray_completion_fn: msgpair to server tcp://server7HA:3334
> failed: Operation cancelled (possibly due to timeout)
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] ***
> msgpairarray_completion_fn: msgpair to server tcp://server8HA:3334
> failed: Operation cancelled (possibly due to timeout)
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] ***
> msgpairarray_completion_fn: msgpair to server tcp://server9HA:3334
> failed: Operation cancelled (possibly due to timeout)
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] ***
> msgpairarray_completion_fn: msgpair to server tcp://server10HA:3334
> failed: Operation cancelled (possibly due to timeout)
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] ***
> msgpairarray_completion_fn: msgpair to server tcp://server11HA:3334
> failed: Operation cancelled (possibly due to timeout)
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] ***
> msgpairarray_completion_fn: msgpair to server tcp://server12HA:3334
> failed: Operation cancelled (possibly due to timeout)
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] Error: failed removing one or
> more datafiles associated with the meta handle 1610612708
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] WARNING: PVFS_sys_remove()
> encountered an error which may lead to inconsistent state: Operation
> cancelled (possibly due to timeout)
>
> Oct 2 10:32:10 clientNode1 PVFS2: [E] WARNING: PVFS2 fsck (if
> available) may be needed.
>
> Oct 2 10:32:10 clientNode1 kernel: pvfs2: warning: got error code
> without errno equivalent: -1610612865.
>
> Oct 2 10:32:59 clientNode1 PVFS2: [E] job_time_mgr_expire: job time
> out: cancelling bmi operation, job_id: 192955696.
>
> Oct 2 10:32:59 clientNode1 PVFS2: [E] msgpair failed, will retry:
> Operation cancelled (possibly due to timeout)
>
> Oct 2 10:33:29 clientNode1 PVFS2: [E] job_time_mgr_expire: job time
> out: cancelling bmi operation, job_id: 192955732.
>
> Oct 2 10:33:29 clientNode1 PVFS2: [E] msgpair failed, will retry:
> Operation cancelled (possibly due to timeout)
>
> Oct 2 10:33:59 clientNode1 PVFS2: [E] job_time_mgr_expire: job time
> out: cancelling bmi operation, job_id: 192955766.
>
> Oct 2 10:33:59 clientNode1 PVFS2: [E] msgpair failed, will retry:
> Operation cancelled (possibly due to timeout)
>
> Oct 2 10:34:20 clientNode1 PVFS2: [E] Error: failed removing one or
> more datafiles associated with the meta handle 1252698765
>
> Oct 2 10:34:20 clientNode1 PVFS2: [E] WARNING: PVFS_sys_remove()
> encountered an error which may lead to inconsistent state: No such
> file or directory
>
> Oct 2 10:34:20 clientNode1 PVFS2: [E] WARNING: PVFS2 fsck (if
> available) may be needed.
>
>
>
>
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.beowulf-underground.org/pipermail/pvfs2-developers/attachments/20081002/0bedc742/attachment.htm
More information about the Pvfs2-developers
mailing list