[PVFS2-developers] job timeouts
neillm at mcs.anl.gov
neillm at mcs.anl.gov
Mon Jul 12 11:21:45 EDT 2004
On Mon, Jul 12, 2004 at 10:59:55AM +0000, Phil Carns wrote:
> I don't think that this should be a problem. When a job gets
> cancelled via the job timeout mechanism, it pops out with a negative
> error code, just like if the job had failed altogether. Its just
> that the error code happens to be -PVFS_ECANCELED. I don't really
> think it should have to be handled as a special case. We should
> maybe test this out some more by setting the job timeout to
> something extremely low (like 1 second or something) and watch what
> it does when cancellations happen frequently.
Ah, yes of course -- thanks for the reminder!
> In this case, it is an I/O operation that is timing out. But that could
> happen fairly easily too if a meta operation on a datafile (such as checking
> file size) got queued _between_ a couple of I/O operations, therefore
> serializing some of the items in the request scheduler. I/O operations are
> only allowed to operate concurrently with other I/O operations; they get
> queued behind anything else.
You're right. Keep in mind that Pete's commented the
TROVE_AIO_THREADED define as well (right?), which slows down I/O
*significantly* on the server. This could be taking longer than we
think when heavily loaded, since we rarely test in this configuration.
Also, all I/O requests are serialized in trove in this mode --
regardless of the request scheduler. Given this, I think it's okay to
start by assuming these timeouts really are being tripped on the
server due to delay.
-Neill.
More information about the PVFS2-developers
mailing list