[PVFS-developers] Article in iX - mgr bottleneck

Olaf Doschke Olaf.Doschke at t-online.de
Tue May 25 20:59:00 EDT 2004


Hello all,

I have read an interesting article about PVFS (version 2) in the german magazine 
iX (issue 6/2004). Although they could only give a very brief description of the 
PVFS compared to what any developer of PVFS knows about this system, one thing 
they mentioned is, that the management node (mgr) tends to be the bottleneck, 
when it has to manage the metadata of files stored on many io-nodes (iod).

Is it true, that you work on Parallelism of this component? Do I understand this 
right from iX?

If so, I have some ideas and suggestions on this problem. Perhaps this will at
least initiate a little discussion about that problem:

If some client does an ls of a PVFS directory it is indispensably that the mgr 
node knows everything that needs to be known about the files in that dir. As I
understand the PVFS strategy so far is, that the mgr gets to know the metadata 
about a file that is to be written before it is written to the io-nodes. Perhaps
this is also needed, because the mgr is not only told to what iods the file is 
distributed, but it manages this distribution of the file parts and tells the 
client (or its BMI) to which iods it should send  which parts of the file. But 
does the mgr really need to know everything at once, or wouldn't it be suffi-
cient if he knew just in time?

In your description you also write:
"By having a daemon which atomically operates on file metadata we avoid many of 
the shortcomings of storage area network approaches, which have to implement 
complex locking schemes to ensure that metadata stays consistent in the face of 
multiple accesses."

This is truly a point. What about a hierarchical system of mgr nodes then, dele-
gating knowledge and requests? 

It's comparable to a companys management: The bigger a company is, the more lay-
ers of management, submanagement etc. are above the layer of the "real workers".
Despite of tending towards disinformation of the highest management about what's 
going on at the base, computers have the advantage of passing information to the 
highest management node without information loss ;-).

Some more concrete ideas would be: 

1. Dividing responsability hierarchical

This could be done according to the hierarchical tree of dirs and subdirs. This 
is comparable to some secretaries responsible for staff with surenames beginning 
with certain letters, eg. A-D, E-H etc. This way, each client would knew, to what 
node it should send requests initially, without bothering (main) mgr for that.

2. Load balancing

You could also make it more flexibly: There could be some load balancing for re-
sponsability areas.

A very natural way of load balancing would be, that each mgr node accepts all 
requests up to a certain amount and answers additional requests with "I'm busy,
please ask someone else or ask later." or it would itself pass that request to 
other mgr nodes itself and ask that mgrs to communicate with the origin of the 
request, if not busy.

Another reason for passing requests, especially requests for storing data, could 
be, that no diskspace is available on that node. If that happens, this node could 
also be locked for such requests. This would be done by spreading the informa-
tion, that a certain mgr or iod is not available for further write operations to 
all other iods/mgrs.

3. asynchron updating of meta data

For the first time a file is written only the iod really needs to know, what
part(s) of which file(s) it has to store and that it has that part(s) of the file
(and who is the owner and has read/write/execute rights on it). It then could 
afterwards inform the (highest) mgr node, that it has that file or certain parts 
of it.

There is of course the need of planning the storage, dividing the file in a way,
it is stored fastest. But the time for planning is lost. So why not make a ran-
dom choice of storing parts of the file, perhaps based on experience made in the 
past about bandwith and/or available space?

And if someone does an ls for a dir the (main) mgr does not know? The mgr could 
then spread this request to sub-mgr nodes or even iods to gather that informa-
tion just in time. Consistency is thereby given, because at each time only one 
node knows the information (or part of information) about a file. If that infor-
mation is passed to the main mgr it is deleted on the iod or sub-mgr and at last 
only the main mgr would know it. So I'm not talking about redundant spreading of 
the meta data, but a "lazy load" into the main mgr node, so that the bottleneck 
of the mgr only has an effect of how fast he knows things himself, not how fast 
requests are handled.

I know this will mostly accelerate write requests, but delegation of requests 
could also work for read requests.


I think thats enough ideas for the first time...


My interest for the PVFS is based on myself having had an idea of doing something 
quite similar: A net backup system, which spreads file parts on a grid of compu-
ters (being in the internet or only a LAN). I had opted for a P2P system that does 
it roughly this way:

a) quite similar to file sharing tools you'd have a file backup folder (not a 
sharing folder). The difference to a file sharing tool is: you don't wait for a 
request of a certain file, but each P2P net node pushes the files it needs to 
backup to other nodes. 

b) You only push _parts_ of files to other nodes, giving no node the full file 
you want to backup (well, I assume the situation is quite more than 2 nodes in 
the net). For security you could also scramble the file parts and/or encrypt the 
file or it's parts before scrambling or afterwards, so that no one but you could 
restore that file, decrypt and unscramble or unscramble and decrypt it.

c) To make the backup reachable despite of nodes switching on and off I thought 
about spreading each file parts to more than one node. There could be an expi-
ration date for how long a file (or part of it) needs to be stored. And only up 
to that date the file parts are spread in the net, not only by yourself but also 
by the nodes that got a file part.

d) In small nets this mechnism would tend to a situation where each node has all 
parts of each file backuped (or as much parts as are allowed due to "rule" b). 
So a mechanism to avoid that would be to give a file part a time to live, eg. on-
ly allowing copies up to a certain degree (eg. only up to a copy of a copy of a 
copy). Additional allowing a node to push file parts on other nodes without coun-
ting down that TTL when the node is scheduled to disconnect from the net after
it has backuped all of it's own files and/or isn't on/online all the time.

e) If a file changes or is deleted, that does not mean anything to the file parts 
stored at nodes: They don't store the file, they store just a backup of it, that 
doesn't need any maintenance despite of being deleted when the expiration date is 
reached. So a deleted file is simply not backuped any more and a changed file is 
handled as a new one.

f) handling of identification of the origin and the filepart are done with IDs,
a globally unique ID would be preferable. So for restauration of files a node 
needs to know it's own GUID and the GUID of a file or all the file parts of the 
file. A request for restoring that file would have to contain all these infor-
mations, which could be stored in a file that itself can of course be part of the
backup, so finally you'd just need some bytes with the IDs of that file safely 
stored on a client or floppy disks or an USB stick or whatever.

It's quite a different idea than PVFS, because it's main aspect is to safely 
store redundant backups of files and recover them from the spread nodes just in 
the case you need a backup. This way I don't need much meta data, especially a 
file spread this way into a backup net needs not to be maintained. And also the 
restauration of files (reading of files) is not guaranteed to be fast in this 
P2P backup net, since you need to post a search to the net, because there is no 
mgr that knows where what parts of a file are. But my main goal is not the
read performance... 

Of course a shortcoming of this is not only how fast a restauration could be, 
but also there is no way of assuring, that a file can be restored from the net 
at all. This is much easier achieved, if the net is a LAN where you can be sure, 
that most of the nodes are on and online when you need a backup back, than if you
depend on getting a file back this way from the internet. So I'm thinking of in-
tegrating that idea of some meta data stored at some mgr nodes...

Perhaps this also rises up some ideas for PVFS.

Finally:
Please don't accuse me of forcing you to do some work for me, I just suggested
some ideas, not saying this is the only way to do it or the best way. Especially
the second part about the P2P net backup I just wrote to inform you of the back-
ground for my ideas. I've spent some time thinkin about distributed storing. If 
it doesn't fit into your concept, it doesn't. Feel free to make use of all these
ideas and feel free to ask me, if something isn't clear.

Bye, Olaf Doschke.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.beowulf-underground.org/pipermail/pvfs-developers/attachments/20040525/67812efb/attachment.htm


More information about the PVFS-developers mailing list