[PVFS2-developers] job timeouts
Phil Carns
pcarns at parl.clemson.edu
Mon Jul 12 11:59:55 EDT 2004
On Friday 09 July 2004 19:29, neillm at mcs.anl.gov wrote:
> On Fri, Jul 09, 2004 at 12:49:43PM -0400, Pete Wyckoff wrote:
> > 6 clients, 2 servers: 450 MB ok, 470 MB bad
> > 6 clients, 1 server : 200 MB ok, 250 MB bad
> > 4 clients, 1 server : 300 MB ok, 350 MB bad
> > 1 client, 1 server : 1200 MB ok, 1300 MB bad
> >
> > The size figure is how much data each client writes to a non-overlapping
> > shared file. Note how the failure point seems to scale linearly with
> > number of clients in the 1-server case. The clients complain like this:
>
> Pete -- thanks for getting this data! Much appreciated.
>
> > io_datafile_complete_operations failed: Operation cancelled (possibly
> > due to timeout)
> > *** error path with 0 msgpairs pending, 1 flows pending, 1 write acks
> > pending
>
> This looks OK to me in the general cancellation case.
>
> > Having narrowed it down as above, I turned on debugging to track it
> > down to the job timeout thing:
> >
> > PVFS_isys_io calling PINT_client_state_machine_test()
> > job_timer: expiring job!
> > Job timer: cancelling bmi.
> > bmi canceling: 36
> > job_timer: expiring job!
> > ...
>
> ... But this doesn't. Phil can comment better here on the (likely
> intentional but unqualified) client operation timeout cancellation
> you're seeing, but there may be another problem here in that I didn't
> consider this case when re-working the sys-io for cancellation
> handling. If sys-io can have internal jobs cancelled outside of the
> io_cancel method, we have to make some changes (in that sys-io has to
> be aware of this; set op_cancelled flag, update context phases and
> flags, etc). It shouldn't be very difficult, but it'll take a second
> to nail down the specifics. I suspect the segfault you're seeing is
> exactly because of this behaviour. Phil?
I don't think that this should be a problem. When a job gets cancelled via
the job timeout mechanism, it pops out with a negative error code, just like
if the job had failed altogether. Its just that the error code happens to be
-PVFS_ECANCELED. I don't really think it should have to be handled as a
special case. We should maybe test this out some more by setting the job
timeout to something extremely low (like 1 second or something) and watch
what it does when cancellations happen frequently.
I'm curious as to why the jobs are timing out in the first place. I wonder if
the server side request scheduler is stalling something for longer than our
default client side timeout in some cases.
For example, maybe a getattr operation is waiting on a big I/O operation to
finish before it can continue. The client side job timeout is currently set
to 30 seconds, so if an I/O operation (or a series of them that got queued
before the hypothetical getattr) take longer than that to finish then it
would cause a job timeout on the client side.
In this case, it is an I/O operation that is timing out. But that could
happen fairly easily too if a meta operation on a datafile (such as checking
file size) got queued _between_ a couple of I/O operations, therefore
serializing some of the items in the request scheduler. I/O operations are
only allowed to operate concurrently with other I/O operations; they get
queued behind anything else.
-Phil
More information about the PVFS2-developers
mailing list