[PVFS-developers] Article in iX - mgr bottleneck
Rob Ross
rross at mcs.anl.gov
Tue May 25 17:52:47 EDT 2004
On 25 May 2004, Olaf Doschke wrote:
> I have read an interesting article about PVFS (version 2) in the german magazine
> iX (issue 6/2004). Although they could only give a very brief description of the
> PVFS compared to what any developer of PVFS knows about this system, one thing
> they mentioned is, that the management node (mgr) tends to be the bottleneck,
> when it has to manage the metadata of files stored on many io-nodes (iod).
>
> Is it true, that you work on Parallelism of this component? Do I understand this
> right from iX?
The mgr can indeed be a bottleneck under some workloads, although recent
patches have helped quite a bit I think by better implementing the
scatter/gather of iod messages (thanks Stuart!).
[snip]
One thing that I will mention right away is that PVFS1 is mostly being
bugfixed at this point. PVFS2 is the place to look for active development
and application of interesting new algorithms to address scalability and
performance.
> Some more concrete ideas would be:
>
> 1. Dividing responsability hierarchical
It's difficult to do this in a way that leads to good distributions --
think about a big parallel application dropping lots of files in a single
directory, for example.
> 2. Load balancing
>
> A very natural way of load balancing would be, that each mgr node accepts all
> requests up to a certain amount and answers additional requests with "I'm busy,
> please ask someone else or ask later." or it would itself pass that request to
> other mgr nodes itself and ask that mgrs to communicate with the origin of the
> request, if not busy.
This would require access to the same metadata from more than one mgr.
That is a very difficult problem that opens up all sorts of consistency
issues.
I completely agree that avoiding servers that have no more free space
would be a good thing to do. We don't do that right now in PVFS2. We do,
however, allow for arbitrary distribution of metadata to a set of servers
(rather than one). Our "arbitrary" distribution is random at the moment,
but I believe that Thomas Ludwig et. al. are planning some more
interesting work in this area.
> 3. asynchron updating of meta data
>
> For the first time a file is written only the iod really needs to know, what
> part(s) of which file(s) it has to store and that it has that part(s) of the file
> (and who is the owner and has read/write/execute rights on it). It then could
> afterwards inform the (highest) mgr node, that it has that file or certain parts
> of it.
This won't work under the existing systems, because there needs to be a
way for some other client to read from the file immediately. Thus the
metadata must be available immediately. Also, deferring metadata write
might lead to situations where someone is writing to a file that doesn't
appear in the file system space at all (not in "ls" for example). What
would the semantics be for more than one concurrent create in that case?
> There is of course the need of planning the storage, dividing the file in a way,
> it is stored fastest. But the time for planning is lost. So why not make a ran-
> dom choice of storing parts of the file, perhaps based on experience made in the
> past about bandwith and/or available space?
Definitely using feedback to direct future file layout has potential, and
I imagine that Thomas et. al. will be doing just that sort of thing.
Perhaps we'll piggyback free space/traffic information on all requests for
this purpose.
> And if someone does an ls for a dir the (main) mgr does not know? The mgr could
> then spread this request to sub-mgr nodes or even iods to gather that informa-
> tion just in time. Consistency is thereby given, because at each time only one
> node knows the information (or part of information) about a file. If that infor-
> mation is passed to the main mgr it is deleted on the iod or sub-mgr and at last
> only the main mgr would know it. So I'm not talking about redundant spreading of
> the meta data, but a "lazy load" into the main mgr node, so that the bottleneck
> of the mgr only has an effect of how fast he knows things himself, not how fast
> requests are handled.
I see what you're saying, but you must realize that detecting
*nonexistent* files is an important part of everyday file system
operation. The approach you have outlined would (as I understand it)
result in this request for information from sub-mgrs on every ls and
create, which are two *very* common operations.
On the other hand, perhaps I'm completely missing something. If you could
make this scheme work, it would definitely be an interesting new approach!
[snip]
> My interest for the PVFS is based on myself having had an idea of doing something
> quite similar: A net backup system, which spreads file parts on a grid of compu-
> ters (being in the internet or only a LAN). I had opted for a P2P system that does
> it roughly this way:
[snip]
Your system sounds a lot like the Farsite project from Microsoft:
http://research.microsoft.com/sn/Farsite/
It's definitely an interesting area of work.
> It's quite a different idea than PVFS, because it's main aspect is to safely
> store redundant backups of files and recover them from the spread nodes just in
> the case you need a backup. This way I don't need much meta data, especially a
> file spread this way into a backup net needs not to be maintained. And also the
> restauration of files (reading of files) is not guaranteed to be fast in this
> P2P backup net, since you need to post a search to the net, because there is no
> mgr that knows where what parts of a file are. But my main goal is not the
> read performance...
How would you handle consistency management?
> Perhaps this also rises up some ideas for PVFS.
>
> Finally:
> Please don't accuse me of forcing you to do some work for me, I just suggested
> some ideas, not saying this is the only way to do it or the best way. Especially
> the second part about the P2P net backup I just wrote to inform you of the back-
> ground for my ideas. I've spent some time thinkin about distributed storing. If
> it doesn't fit into your concept, it doesn't. Feel free to make use of all these
> ideas and feel free to ask me, if something isn't clear.
I certainly wouldn't accurse you of forcing us to do anything. Thanks
much for the comments. I wish you well in your P2P work.
Regards,
Rob
More information about the PVFS-developers
mailing list