[Pvfs2-developers] Getting errors when using mpich2-1.0.5p3+pvfs2.6.0

Sam Lang slang at mcs.anl.gov
Wed Aug 15 16:55:35 EDT 2007


Hi Christina,

Sometimes job timeouts are due to timeout values being set too low  
for the particular system, especially with older setups.  You can try  
to increase the timeouts in the fs.conf (ClientJobFlowTImeoutSecs),  
it usually defaults to 300 (5 minutes).   It may also be that there  
are failures on the servers and they're not returning responses back  
to the client.  Do you see any messages in the server logs?  Can you  
verify that the servers are still running after seeing this error?

-sam

On Aug 15, 2007, at 3:17 PM, Christina Patrick wrote:

> Hi Everybody,
>
> I have been facing some problems recently when using mpich2 and pvfs2.
> My program worked fine earlier and I did not face any problems before
> while executing those programs. All of a sudden, when I run my
> programs now on a reconfigured setup (8 IO servers, 8 clients and 4
> metadata servers), I get the below error messages. I have browsed
> through the forums and there have been similar reports before.
> However, I couldn't really figure out if anybody got a solution to the
> problem. I generally get the error when I scale the number of
> instances running to 16 or 32.
>
>
>
> 6: [E 06:13:46.025685] job_time_mgr_expire: job time out: cancelling
> flow operation, job_id: 67.
> 6: [E 06:13:46.025976] fp_multiqueue_cancel: flow proto cancel called
> on 0x8cebcac
> 6: [E 06:13:46.026004] handle_io_error: flow proto error cleanup
> started on 0x8cebcac, error_code: -1610613121
> 6: [E 06:13:46.026099] handle_io_error: flow proto 0x8cebcac canceled
> 1 operations, will clean up.
> 6: [E 06:13:46.026138] handle_io_error: flow proto 0x8cebcac error
> cleanup finished, error_code: -1610613121
> 11: [E 06:13:46.075671] job_time_mgr_expire: job time out: cancelling
> flow operation, job_id: 71.
> 11: [E 06:13:46.075994] fp_multiqueue_cancel: flow proto cancel called
> on 0x96f3aac
> 11: [E 06:13:46.076022] handle_io_error: flow proto error cleanup
> started on 0x96f3aac, error_code: -1610613121
> 11: [E 06:13:46.076117] handle_io_error: flow proto 0x96f3aac canceled
> 1 operations, will clean up.
> 11: [E 06:13:46.076152] handle_io_error: flow proto 0x96f3aac error
> cleanup finished, error_code: -1610613121
> 14: [E 06:19:45.563289] handle_io_error: flow proto error cleanup
> started on 0x9c6349c, error_code: -1073741973
>
>
> I would appreciate any help and suggestions that you'll can offer,
>
> Regards,
> Christina.
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>



More information about the Pvfs2-developers mailing list