[PVFS-developers] PVFS kernel patch for "Invalid Argument"
Rob Ross
rross at mcs.anl.gov
Fri Jul 30 13:00:51 EDT 2004
Hi David,
I understand the problem that you are describing, but I am not completely
satisfied with the solution.
I actually implemented (what I think was) a solution to this problem in
the pvfsd for the iod restart case. You guys use kpvfsd though, right?
Or is this happening with pvfsd also?
Anyway, the solution there was to redo the open when an I/O failure
occurred. It's very heavyweight, but it hides the failure nicely.
The problem with just reconnecting to an iod is that it doesn't actually
solve the open file problem, which is to say that even if you hide the
connection failure the iod has lost the capability, so trying to perform
I/O isn't going to work anyway, or at least generally it shouldn't.
You could just use a nonblocking select() on the socket to detect closed
sockets before starting to build up the linked lists. Of course there's
still a window of time in there during which the iod could die. If you
were going to implement such a select() I would prefer to see it as a
sockio function that would check for errors on a single socket.
This also cuts down on this additional iod traffic, which is a nice
benefit. The iods are already getting NOOPs on first open from the
iodcomm code.
The send_single_req() stuff in David's mgr patch looks promising; I will
spend some more time looking at that.
Thanks; let's work on this some more.
Rob
On Thu, 29 Jul 2004, David S Metheny wrote:
> After restarting either the Manager or IOD (or both), the first attempt
> to access a file from a client that already had established socket
> connections to the IOD(s) will fail with an "Invalid Argument" error. Any
> subsequent connections will work correctly.
>
> Two scenarios result in the IOD(s) closing all open socket connections:
>
> (1) The Manager process is stopped, resulting in a "shutdown" request
> being sent to all IODs
> (2) The IOD process(es) are stopped.
>
> Both cases result in the client now having "stale" socket connections to the
> IOD(s).
>
> Due to the nature of the socket communication between a client and an
> IOD, a stale socket cannot be detected by the client until a "receive" is
> performed on the socket. Furthermore, due to the fact requests are sent to
> the IOD using a linked list of "send" and "receive" operations, the closed
> socket cannot be detected by the client until the first "receive" operation
> is performed. Any "send" operations that have already been processed are
> now lost.
>
> This "linked list" method of client/IOD communication makes it difficult
> to implement retry logic without significantly restructuring the code.
>
> This solution is to have the client send a NOOP to each IOD when opening
> a file. If the socket to the IOD is stale, the error is detected when
> receiving the ACK from the NOOP request, resulting in the stale socket being
> closed. A new socket will be opened the next time the client attempts to
> communicate with the IOD.
>
> This solution has the following attributes:
>
> (1) It handles the case where new files are opened after restarting
> either the Manager or IOD processes.
> (2) It does not create significant overhead. These NOOP requests are
> only sent when a file is being opened, not when any actual IO is being
> performed.
> (3) Any files already open at the time the Manager or IOD processes are
> restarted will still behave as they have always done
>
> This patch addresses the "Invalid Argument" in the kernel module code, and a
> patch for the PVFS library will be posted soon.
>
>
More information about the PVFS-developers
mailing list