[PVFS2-developers] job timeouts

neillm at mcs.anl.gov neillm at mcs.anl.gov
Mon Jul 12 11:21:45 EDT 2004


On Mon, Jul 12, 2004 at 10:59:55AM +0000, Phil Carns wrote:
> I don't think that this should be a problem.  When a job gets
> cancelled via the job timeout mechanism, it pops out with a negative
> error code, just like if the job had failed altogether.  Its just
> that the error code happens to be -PVFS_ECANCELED.  I don't really
> think it should have to be handled as a special case.  We should
> maybe test this out some more by setting the job timeout to
> something extremely low (like 1 second or something) and watch what
> it does when cancellations happen frequently.

Ah, yes of course -- thanks for the reminder!

> In this case, it is an I/O operation that is timing out.  But that could 
> happen fairly easily too if a meta operation on a datafile (such as checking 
> file size) got queued _between_ a couple of I/O operations, therefore 
> serializing some of the items in the request scheduler.  I/O operations are 
> only allowed to operate concurrently with other I/O operations; they get 
> queued behind anything else.

You're right.  Keep in mind that Pete's commented the
TROVE_AIO_THREADED define as well (right?), which slows down I/O
*significantly* on the server.  This could be taking longer than we
think when heavily loaded, since we rarely test in this configuration.
Also, all I/O requests are serialized in trove in this mode --
regardless of the request scheduler.  Given this, I think it's okay to
start by assuming these timeouts really are being tripped on the
server due to delay.

-Neill.


More information about the PVFS2-developers mailing list