[PVFS-developers] Relocatable Metadata

Bob Arctor curious@pb194.luban.sdi.tpnet.pl
Wed, 5 Nov 2003 11:54:48 +0100


On Wed, 24 Sep 2003 16:16:23 -0500
Porter Don <PorterDE@mercury.hendrix.edu> wrote:

> I have been looking at setting up redundancy and/or failover for pvfs nodes.
> Obviously the use of the local filesystem's inode numbers prevent direct
> copying of metadata files to another machine, lest there be new file with
> the same inode an old one had


it might be silly ideas but they might work for those who do need failover other than backup of whole filesystem : (for those who use pvfs to achieve big storage space + speed , not speed only). backup mechanism need to be inode-independent...

machines A connected to network as 'proxy' and machine B via machine A to network . 
   pvfs would be set up on machine B and A a special failover cluster, so when B requests data from the net, machine A caches all the data and pass it to machine B. then machine B stores data as usual, and sends ACK back to network , as usual. machine A have to relay the ACK packets too, but when they'll timeout it'll 'take over' and act as an independent node . 
 advantages :
	*fast caching - machine A can respond quickly (async mode with machine B) storing data stored by slower machine B in local buffers. 
	*false alarms due to overload - machine A may take over not only during real machine B failure, but also when machine B is merely too slow to answer requests. if machine B is not failing, but simply too slow to store data, such 'pair' may recover during cluster idle time (when B will be able to follow changes on machine A and copy results)
or too slow machine may be replaced with faster one. 
	*possiblity of replacing B with i.e. B+B[1]+B[2]+B[...] clusters 'on the fly' - thus possiblity to expand speed of backup storage - yet only way to expand cluster of pvfs machines on the fly. this will allow only increase in performance, not overall storage sizes

   imagine you have 400M RAM + 100+100+100+100 G RAID array for each A node, and 100M RAM and 100G cheap ide drive for B node. 
as you see node B will fail when cluster will reach ~1/4 of it's load (thus B node will not be 'full' while node A still having 300G of free space on fast RAID array)   you can add new B[2] nodes to your cluster, while it still up and running)
also nodes B will 'fail' from time to time when you'll put too much load over cluster (so i.e. during bursts of new data being saved to the cluster) . if you will find this annoying (i.e. some program will require lot of data being processed) you can add new B[2] nodes which will form RAID array , or try to add more RAM to B nodes which will increase clusterwide 'disk cache' size, and your program will run not-interrupted during upgrade!

 drawbacks:
	*if machine A will fail, cluster will stop for a while, until new machine A will be placed, and data from machine B copied to new machine A . in ideal conditions ('safe' cluster - not overloaded and in async mode) no data will be lost, as all data is also mirrored by machine B. if all 'B' machines in cluster were connected to 'failsafe' network (i.e. wireless network - unused on 'normal' conditos, 'disconnected' machine B will broadcast it's content to 'new AB par' from the backup pool' . this will allow no extra cables being needed for backup on i.e. 1000+ cluster pools . as soon as new AB pair would work wireless channel would be freed)