[PVFS-developers] Relocatable Metadata

Rob Ross rross@mcs.anl.gov
Wed, 24 Sep 2003 16:18:13 -0500 (CDT)


On Wed, 24 Sep 2003, Porter Don wrote:

> I have been looking at setting up redundancy and/or failover for pvfs nodes.
> Obviously the use of the local filesystem's inode numbers prevent direct
> copying of metadata files to another machine, lest there be new file with
> the same inode an old one had.

Looking at the bigger picture, how are you planning on keeping these two 
copies synchronized?

> I was wondering what reasons went into the decision to use fs inodes as the
> iod indices other than easy management of the used/free indices?  

I was looking for an easy way to get unique values.  The decision was made 
a very long time ago, well before I thought that this stuff was going to 
actually be *used* anywhere :(.

> I have been experimenting with writing some code to dole out indices when
> files are created and keep a table of used/free indices.  Upon cursory
> inspection, this seems to work - allowing metadata (and the table) to be
> copied to another machine.

Yes, that should work fine.

I've seen another solution proposed in a paper that involved using a hash
of the file name (since we don't have links this is a 1 to 1 mapping), but
they didn't seem to handle directory renames in any reasonable way.

I like your solution better.

> I suppose my question, then, is are there design considerations that I am
> missing in this approach and how does everyone feel about such an idea?

Again, I think it's fine.  Be very careful in the meta directory though; 
that code is extremely fragile.

> Granted, it would slow the manager down some, but perhaps easier
> backup/redundancy might be worth the trade.

Sure.  You can preallocate a few of them anyway, then have a simple 
recovery scheme for figuring out if preallocated ones were used or not at 
startup to handle failures.  No reason why it can't be reasonably fast.

> Furhter, the time for a few calculations is nothing compared to the time
> it takes for data to travel on the network anyway.  Also, to represent
> the available inodes on a 73 GB ext2 disk (for instance), we are only
> talking about around 950k (~1%).

Sounds a-ok.  This is a pretty big change though, so I'll probably not 
jump on integrating it right away -- I'd like to see it in use a little 
more first.

Regards,

Rob