[PVFS-developers] PVFS kernel patch for "Invalid Argument"

david.s.metheny at conwaycorp.net david.s.metheny at conwaycorp.net
Fri Jul 30 14:24:26 EDT 2004


Yes, we use kpvfsd. 

Quoting Rob Ross <rross at mcs.anl.gov>:

> Hi David,
> 
> I understand the problem that you are describing, but I am not completely 
> satisfied with the solution.
> 
> I actually implemented (what I think was) a solution to this problem in 
> the pvfsd for the iod restart case.  You guys use kpvfsd though, right?  
> Or is this happening with pvfsd also?
> 
> Anyway, the solution there was to redo the open when an I/O failure 
> occurred.  It's very heavyweight, but it hides the failure nicely.
> 
> The problem with just reconnecting to an iod is that it doesn't actually
> solve the open file problem, which is to say that even if you hide the
> connection failure the iod has lost the capability, so trying to perform 
> I/O isn't going to work anyway, or at least generally it shouldn't.
> 
> You could just use a nonblocking select() on the socket to detect closed 
> sockets before starting to build up the linked lists.  Of course there's 
> still a window of time in there during which the iod could die.  If you 
> were going to implement such a select() I would prefer to see it as a 
> sockio function that would check for errors on a single socket.
> 
> This also cuts down on this additional iod traffic, which is a nice 
> benefit.  The iods are already getting NOOPs on first open from the 
> iodcomm code.
> 
> The send_single_req() stuff in David's mgr patch looks promising; I will 
> spend some more time looking at that.
> 
> Thanks; let's work on this some more.
> 
> Rob
> 
> On Thu, 29 Jul 2004, David S Metheny wrote:
> 
> >     After restarting either the Manager or IOD (or both), the first
> attempt
> > to access a file from a client that already had established socket
> > connections to the IOD(s) will fail with an "Invalid Argument" error.  Any
> > subsequent connections will work correctly. 
> >  
> > Two scenarios result in the IOD(s) closing all open socket connections:
> >  
> >    (1) The Manager process is stopped, resulting in a "shutdown" request
> > being sent to all IODs
> >    (2) The IOD process(es) are stopped.
> >  
> > Both cases result in the client now having "stale" socket connections to
> the
> > IOD(s).
> >  
> >     Due to the nature of the socket communication between a client and an
> > IOD, a stale socket cannot be detected by the client until a "receive" is
> > performed on the socket.  Furthermore, due to the fact requests are sent
> to
> > the IOD using a linked list of "send" and "receive" operations, the closed
> > socket cannot be detected by the client until the first "receive"
> operation
> > is performed.  Any "send" operations that have already been processed are
> > now lost.
> >  
> >     This "linked list" method of client/IOD communication makes it
> difficult
> > to implement retry logic without significantly restructuring the code.
> >  
> >     This solution is to have the client send a NOOP to each IOD when
> opening
> > a file.  If the socket to the IOD is stale, the error is detected when
> > receiving the ACK from the NOOP request, resulting in the stale socket
> being
> > closed.  A new socket will be opened the next time the client attempts to
> > communicate with the IOD.
> >  
> >     This solution has the following attributes:
> >  
> >     (1) It handles the case where new files are opened after restarting
> > either the Manager or IOD processes.
> >     (2) It does not create significant overhead.  These NOOP requests are
> > only sent when a file is being opened, not when any actual IO is being
> > performed.
> >     (3) Any files already open at the time the Manager or IOD processes
> are
> > restarted will still behave as they have always done
> >  
> > This patch addresses the "Invalid Argument" in the kernel module code, and
> a
> > patch for the PVFS library will be posted soon. 
> >  
> > 
> _______________________________________________
> PVFS-developers mailing list
> PVFS-developers at www.beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs-developers
> 




-----------------------------------------------------------------------------
This message was sent using Conway Corporation WebMail -- www.conwaycorp.net


More information about the PVFS-developers mailing list