[PVFS2-developers] job timeouts

Phil Carns pcarns at parl.clemson.edu
Mon Jul 12 11:59:55 EDT 2004


On Friday 09 July 2004 19:29, neillm at mcs.anl.gov wrote:
> On Fri, Jul 09, 2004 at 12:49:43PM -0400, Pete Wyckoff wrote:
> >     6 clients, 2 servers:   450 MB ok,  470 MB bad
> >     6 clients, 1 server :   200 MB ok,  250 MB bad
> >     4 clients, 1 server :   300 MB ok,  350 MB bad
> >     1 client,  1 server :  1200 MB ok, 1300 MB bad
> >
> > The size figure is how much data each client writes to a non-overlapping
> > shared file.  Note how the failure point seems to scale linearly with
> > number of clients in the 1-server case.  The clients complain like this:
>
> Pete -- thanks for getting this data!  Much appreciated.
>
> >     io_datafile_complete_operations failed: Operation cancelled (possibly
> >     due to timeout)
> >     *** error path with 0 msgpairs pending, 1 flows pending, 1 write acks
> >     pending
>
> This looks OK to me in the general cancellation case.
>
> > Having narrowed it down as above, I turned on debugging to track it
> > down to the job timeout thing:
> >
> >     PVFS_isys_io calling PINT_client_state_machine_test()
> >     job_timer: expiring job!
> >     Job timer: cancelling bmi.
> >     bmi canceling: 36
> >     job_timer: expiring job!
> >     ...
>
> ... But this doesn't.  Phil can comment better here on the (likely
> intentional but unqualified) client operation timeout cancellation
> you're seeing, but there may be another problem here in that I didn't
> consider this case when re-working the sys-io for cancellation
> handling.  If sys-io can have internal jobs cancelled outside of the
> io_cancel method, we have to make some changes (in that sys-io has to
> be aware of this; set op_cancelled flag, update context phases and
> flags, etc).  It shouldn't be very difficult, but it'll take a second
> to nail down the specifics.  I suspect the segfault you're seeing is
> exactly because of this behaviour.  Phil?

I don't think that this should be a problem.  When a job gets cancelled via 
the job timeout mechanism, it pops out with a negative error code, just like 
if the job had failed altogether.  Its just that the error code happens to be 
-PVFS_ECANCELED.  I don't really think it should have to be handled as a 
special case.  We should maybe test this out some more by setting the job 
timeout to something extremely low (like 1 second or something) and watch 
what it does when cancellations happen frequently.

I'm curious as to why the jobs are timing out in the first place.  I wonder if 
the server side request scheduler is stalling something for longer than our 
default client side timeout in some cases.

For example, maybe a getattr operation is waiting on a big I/O operation to 
finish before it can continue.  The client side job timeout is currently set 
to 30 seconds, so if an I/O operation (or a series of them that got queued 
before the hypothetical getattr) take longer than that to finish then it 
would cause a job timeout on the client side.  

In this case, it is an I/O operation that is timing out.  But that could 
happen fairly easily too if a meta operation on a datafile (such as checking 
file size) got queued _between_ a couple of I/O operations, therefore 
serializing some of the items in the request scheduler.  I/O operations are 
only allowed to operate concurrently with other I/O operations; they get 
queued behind anything else.

-Phil


More information about the PVFS2-developers mailing list