[PVFS2-developers] job timeouts

Pete Wyckoff pw at osc.edu
Fri Jul 9 13:49:43 EDT 2004


I was trying again just now to run performance benchmarks for large
files from large numbers of clients with MPIIO on the current CVS tree.
Ran into failures for certain configurations though, and narrowed it
down to some manageable failures.  These values for TCP on ethernet
running good-old "perf" using local disks on the servers:

    6 clients, 2 servers:   450 MB ok,  470 MB bad
    6 clients, 1 server :   200 MB ok,  250 MB bad
    4 clients, 1 server :   300 MB ok,  350 MB bad
    1 client,  1 server :  1200 MB ok, 1300 MB bad

The size figure is how much data each client writes to a non-overlapping
shared file.  Note how the failure point seems to scale linearly with
number of clients in the 1-server case.  The clients complain like this:

    io_datafile_complete_operations failed: Operation cancelled (possibly
    due to timeout)
    *** error path with 0 msgpairs pending, 1 flows pending, 1 write acks
    pending

and sometimes the run goes to completion, albeit very slowly, and
sometimes the clients will SEGV.  The failures are very repeatable here,
as far as MB range goes above, and do not seem tied to any particular
nodes.

Having narrowed it down as above, I turned on debugging to track it
down to the job timeout thing:

    PVFS_isys_io calling PINT_client_state_machine_test()
    job_timer: expiring job!
    Job timer: cancelling bmi.
    bmi canceling: 36
    job_timer: expiring job!
    ...

If I disable __job_time_mgr_add(), as if all timeouts were INF,
everything seems to work fine.  Why does the timeout cause failures for
large runs?  Does anyone else see issues running these sorts of tests?

		-- Pete


More information about the PVFS2-developers mailing list