[PVFS2-developers] job timeouts
pw at osc.edu
Mon Jul 12 16:22:21 EDT 2004
pcarns at parl.clemson.edu wrote on Mon, 12 Jul 2004 10:59 +0000:
> I'm curious as to why the jobs are timing out in the first place. I wonder if
> the server side request scheduler is stalling something for longer than our
> default client side timeout in some cases.
> For example, maybe a getattr operation is waiting on a big I/O operation to
> finish before it can continue. The client side job timeout is currently set
> to 30 seconds, so if an I/O operation (or a series of them that got queued
> before the hypothetical getattr) take longer than that to finish then it
> would cause a job timeout on the client side.
Looking more carefully at the client log, I can't see much of a pattern,
but I don't think it is this hypothetical getattr interleaving scenario.
The code does just this:
MPI_File_write(, ..., 1500 MB, ..)
and it is only 1 client and 1 server.
The verbose client log shows a long series of 256kB writes (in 5 or 6
TCP writes underneath), with an earlier testcontext completing
immediately following, well clocked. But around completing request
#5385, it times out on a much earlier one, #36. Then later after
completing #9950, it notices that #6083 had failed. These are all IO
write operations, part of the same single pvfs2_write call.
I put the client verbose log at http://www.osc.edu/~pw/client.log.bz2
but it is large: 3.1 MB. The second half of the file is a read test
that comes immediately after the write discussed above. You can see
that activity for the read does not yet start while write-related requests
are getting cancelled.
Neil is right, I can't run with TROVE_AIO_THREADED yet due to the AIO
issues with (old) redhat glibc. So maybe all this is yet again due to
broken redhat. Assure me, though, that someone can run MPI perf
correctly on fairly large files.
More information about the PVFS2-developers