[PVFS-developers] Recovering from an IOD failure

Rob Ross rross at mcs.anl.gov
Tue Feb 10 18:22:02 EST 2004


Hi Don, I see you're digging around again :).

On Tue, 10 Feb 2004, Porter Don wrote:

> I am going to apologize in advance for the length of this email, but
> hopefully it reflects the depth of the "rabbit hole" I have been traveling
> down :)

No sweat.

> 1) When a file is opened/created and an iod is down, the manager sends a
> close to all iods and deletes the metadata file, but not the data file on
> the iods it got to.  This is a potential source of the ever-so-annoying
> .saveme files.  
> 
> The bigger problem, however, is if it is the first time to open a file that
> already exists and there is a failed iod, the manager does _nothing_ to try
> to close the newly opened file descriptors on the iod.  These do not ever
> get closed unless the manager dies, resulting in a file descriptor leak
> (albeit rather small).  
> 
> In both cases, it seems like a little cleanup logic in do_open would easily
> solve the problem.

Yeah, this seems like a reasonably straight forward thing to fix.

> 2) In mgr.c/send_req, if the manager had an open socket connection that
> dies, there is no retry logic.  It seems like the manager ought to at least
> try once to reestablish the connection on an EPIPE.  This would primarily
> help the case where an iod died and came back up between requests.

Sounds good.  Not like the mgr can do much else then anyway :).

> 3) The big one is that when iods come back up, say after a power loss, their
> state is out of sync with the rest of the cluster.  If a client tries to
> submit an io request to a newly bounced iod (with the rest not bounced), the
> iod will not have a file with that inode/cap open and squash the connection
> (as it does anytime it thinks it is getting a bogus request).  The net
> result will be that the client will have to reopen the file on the manager
> and all iods (hopefully closing the old one first).  Because all files are
> open on all iods, the loss of a single iod means that the state of the
> entire cluster will have to be reset en masse, albeit not necessarily all at
> once.
> 
> This is a lot of overhead and lost time in the case that an iod node can be
> brought back up quickly enough that all connections haven't timed out or
> given up.
> 
> My thought was to add the manager's name to the iod conf file and to add a
> manager call to get the state (whate inodes are open, with what caps, with
> what permissions) on startup (before servicing any requests.
> 
> Pros/Cons:
> 
> -slower startup time, but if it is a clean start, the return should be
> relatively quick because there should be little traffic and .  If it is an
> unclean start, it will still be quicker for one iod to get itself
> synchronized rather than everyone else.
> 
> -If the manager is not responding, it can assume no state and go on.  I
> really don't know that a newly started iod with no running manager would
> really do much anyway, though.
> 
> -The only challenge will be that there is probably a good bit of logic in
> the current client programs (lib and kernel module) that will go ahead and
> close/reopen the file if it detects a failure.  This will need to be taken
> into account to avoid any race conditions.  Hopefully the fact that all
> servers are single threaded will work to our advantage here :)
> 
> -This _MUST_ be done before the iod even starts accepting socket connections
> from anyone, otherwise this is begging for a race condition where the
> manager is waiting on something from the iod and the iod is waiting for
> something from the manager.  If there is no connection, the manager's calls
> to iods will fail and move on.
> 
> -To slightly challenge those who are trying to do something malicious, the
> manager could only allow this command to be run for sockets from machines in
> the iodtab file.

I agree that this can be a problem.

For pvfsd I handled this problem by closing/reopening; that worked nicely,
and applications never see it.  So only libpvfs/mpi-io applications should 
see this.

The approach that you outline above does have the advantage of efficiently
"catching up" the server on what is going on.  It simplifies the
client-side piece to just reconnecting, which is nice.

The alternative that I've had in mind for a while was just implementing 
the retry logic in libpvfs.  For the first order just the close/reopen 
would do.  More efficient approaches (e.g. clients forwarding cap 
information to the newly started iod) have security implications worse 
than the ones for your solution.

I don't have a strong feeling about this, and honestly we're not likely to 
take this on here any time soon due to resource constraints.  So if you 
wanted to go after the approach you suggest, I think that would be fine.

I do feel like server failures should be infrequent, so simplicity of the 
solution should take precedence over efficiency.  Is this something that 
you guys are running into a lot?

Rob



More information about the PVFS-developers mailing list