[Pvfs2-developers] Operation cancelled (possibly due to timeout) error

Phil Carns carns at mcs.anl.gov
Fri Oct 17 10:46:56 EDT 2008


Hi Brian,

The second error message that you reported (final-response.sm line 127) 
is a minor bug that was triggered by the timeouts.  That has been fixed 
in CVS, and you can find a description and patch at the following links 
if you want to try it:

http://www.pvfs.org/fisheye/changelog/PVFS/?cs=MAIN:pcarns:20081008183827
http://www.pvfs.org/fisheye/rdiff/PVFS?csid=MAIN:pcarns:20081008183827&u&N

Your main error messages do seem to indicate that something on your 
system isn't keeping up, though.  It is hard to tell if it is the 
network or the disk that stalled, but your information about the SAN 
does seem to implicate the disk.

I have two configuration suggestions that you can try.  First, modify 
the <StorageHints> section in the server configuration file to include 
this line:

    TroveMethod alt-aio

That will switch the disk I/O method in PVFS to a faster mechanism. 
This worked fine in the 2.7.1 release but it was not yet the default then.

Secondly, as far as timeouts are concerned, I would start by increasing 
"ServerJobFlowTimeoutSecs", from 30 to maybe 300.

None of the things that you are seeing will harm your data, but your 
system certainly won't perform very well.

One final configuration option that you can try is changing 
"TroveSyncData no" to "TroveSyncData yes".  I would suggest saving that 
one for last after you have resolved your timeout problem, and then try 
your benchmark with both settings.  On some SANs you may have better 
performance with syncing enabled but I don't know how to find out except 
to test it and see.

Good luck, and let us know what you find.

-Phil

brain at autistici.org wrote:
> Hello,
> 
>  I am continuing my pvfs2 test, and found another problem. I
> will explain the new configuration since it has changed since
> my last mail:
> 
> - 1 host with ~30Gb ext3 slice mounted from a SAN via Qlogic
>   FC acting as metadata server and client;
> - 5 hosts with ~400Gb ext3 slice each as above, acting as I/O
>   server and clients;
> - 24 hosts acting as clients only.
> - Debian 4.0, kernel .2.6.24, pvfs2 module 2.7.1.
> 
> Well, the test I am doing executes the following for every
> machine (29 in total) in the cluster, except metadata server:
> 
> iozone -Cce -s8g -r256k -i0 -i1 -t4 -F /mnt/test/iozone{1,2,3,4}
> 
> which should write 32gb of data in 4 files of 8Gb each; then
> rewrite and read those data.
> 
> While the machine are still in the first test (writing), the
> server logs start being filled with the following:
> 
> ----- cut here -----
> [E 10/15 15:51] handle_io_error: flow proto error cleanup started
> on 0x2aaab40252c0: Operation cancelled (possibly due to timeout)
> 
> [E 10/15 15:51] handle_io_error: flow proto 0x2aaab40252c0
> canceled 3 operations, will clean up.
> 
> [E 10/15 15:51] handle_io_error: flow proto 0x2aaab40252c0
> error cleanup finished: Operation cancelled (possibly due to timeout)
> ----- cut here -----
> 
> Only once I found the following too:
> 
> ----- cut here -----
> [E 10/15 15:54] src/server/final-response.sm line 127: Error: PINT_encode=
> ()
> failure.
> [E 10/15 15:54]         [bt] pvfs2-server [0x4507fb]
> [E 10/15 15:54]         [bt] pvfs2-server(PINT_state_machine_invoke+0xe8)
> [0x440d28]
> [E 10/15 15:54]         [bt] pvfs2-server(PINT_state_machine_next+0xc9)
> [0x441049]
> [E 10/15 15:54]         [bt] pvfs2-server(PINT_state_machine_continue+0x1=
> e)
> [0x440b9e]
> [E 10/15 15:54]         [bt] pvfs2-server(main+0xe3e) [0x41215e]
> [E 10/15 15:54]         [bt] /lib/libc.so.6(__libc_start_main+0xe6)
> [0x2b3b3d1811a6]
> [E 10/15 15:54]         [bt] pvfs2-server [0x40f7d9]
> [E 10/15 15:54] Server Response 0x2aaaac032690 is of type:
> PVFS_SERV_SMALL_IO
> [E 10/15 15:54] FIXME: unimplemented resp type to print
> ----- cut here -----
> 
> I can see the above errors only on three of the five I/O servers.
> Precisely those servers which are using a 'logical volume' from the same
> 'virtual disk' in the SAN. The other two I/O servers are using a
> different virtual disk and show no errors.
> 
> Is it possible that the error reported by pvfs is actually a SAN/FC related
> error in which it says that the SAN is too much loaded? This would explain
> why only three servers are having problems...
> 
> Is there any chance these errors can harm the data being written?
> 
> Should increasing the timeout be the solution, which of the following
> parameters should I modify: ServerJobBMITimeoutSecs,
> ServerJobFlowTimeoutSecs,
> ClientJobBMITimeoutSecs, ClientJobFlowTimeoutSecs, ClientRetryLimit,
> ClientRetryDelayMilliSecs?
> 
> 
> Thank you very much,
> 
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers



More information about the Pvfs2-developers mailing list