[PVFS2-developers] yesterdays checkins
pw at osc.edu
Tue Mar 29 10:55:53 EST 2005
I should probably mention what that was yesterday. Our goal here at OSC
is for pvfs2 clients to be robust with respect to server failures.
These might be ethernet switch problems, node hardware problems, or disk
errors. When a server fails the client should hang forever, waiting
until it comes back up to finish the request. That's the computing
model we try to present to users.
To this end, I made changes in msgpairarray and sys-io to restart just
the failed part(s) of a set of transfers. Instead of retrying the entire
set, some number of messages for each server usually, the code now
continues processing results from good servers and retries only the bad
ones. This will reduce traffic due to redundant transfers to the good
servers during a failure scenario, too.
Previously there were many paths whereby the application would see the
failures and exit. Now they are contained in the pvfs2 client code
until the retry count is reached. I likely missed some spots though, so
if you have a repeatable scenario where a client exits when a server
dies, let me know.
This all works along with the previous HA work. My changes just give
you a longer window to get your backup server in place, if that is what
The one change I did not check in from our tree is
#define PVFS2_CLIENT_RETRY_LIMIT (INT_MAX)
That makes each client try forever (well 187 years). The CVS tree
limits the number of retries to 5. Maybe this should be a configure
option. Not sure where to document this either.
More information about the PVFS2-developers