[Pvfs2-users] PVFS2 vs. lustre for HPC
dmlb2000 at gmail.com
Mon May 14 14:37:08 EDT 2007
On 5/14/07, Michael Sternberg <sternberg at anl.gov> wrote:
> I am designing a relatively small HPC cluster for use in materials
> modeling and experimental data analysis. It will be used in a
> research setting, so while a few target applications are known, the
> eventual set will likely extend much further. The file system use
> will be primarily for traditional "home directory" I/O, not so much
> MPI I/O, though the latter may see a larger presence in the
> application arena.
I have some experience testing both lustre and pvfs2 with wide
striping runs (>=256 nodes). I work at PNNL and we tend to have a
history with lustre but have been exploring other filesystems and will
continue to do so in the future.
> For the server side, I need failover and extensibility, which is
> where NFS falls short (in addition to its lame concurrency). Of
> course, I'm looking at lustre and PVFS2. In the past, it seemed (to
> me) that PVFS(1) wasn't quite suitable to step in for NFS, but PVFS2
> seems to do that just fine now, and it does MPI I/O well by design.
> PVFS2 also seems to have a somewhat more open development process
> than lustre, is easier to configure, and perhaps therefore does not
> need the rather expensive support contracts. The closest to an
> honest assessment seems to be a recent thread on this mailing list,
Yeah lustre's defiance to the Open Source model of development in the
past is known, but as long as the DOE is wanting open source, lustre
will remain and continues to be an open source project.
> ... and one "nearby":
> Are the opinions expressed therein the current consensus? Are there
> more resources available to address this question (it's got to be a
> FAQ)? Google gets one hit for lustre on pvfs.org, and that's for the
> SC05 Agenda ;-)
> On Tue Dec 12 09:01:52 PST 2006 Robert Latham robl at mcs.anl.gov wrote:
> > What Lustre has done a much better job of than we have
> > is documenting the HA process. This is one of our (PVFS) areas of
> > focus in the near-term.
> > We may not have documented the process in enough detail, but one can
> > definitely set up PVFS servers with links to shared storage and make
> > use of things like IP takeover to deliver resiliancy in the face of
> > disk failure, and have had this ability for several years now (PVFS
> > users can check out 'pvfs2-ha.pdf' in our source for a starting
> > point).
> That's dated June, 2004 -- any updates? BTW, how difficult would
> it be to put that pdf directly on the web? ;-)
Well I can give you my take on both pvfs2 and lustre (current versions
of both are pvfs-2.6.3 and lustre-1.6.0).
Configuration and Management:
I'm honestly more used to lustre in this area since most of the large
filesystems in the lab use lustre for the large scale filesystems.
With lustre 1.6 all the configuration files went away, there's a
(semi) new server type (mgs) that manages the file system
configuration and does this on the fly when data servers and meta data
servers come online and go off-line. So for a total file system
failure occures when the mgs goes down (i think, there's probably a
way to recover the mgs but I don't know of one yet). However, the
running file system and clients continue to opperate you just can't
connect new clients.
pvfs still uses config files, however (iirc) it can update them across
the cluster when changes are made. There isn't a third type of server
that manages the entire configuration of the cluster there's only the
meta data server and data servers.
Both can perform failover.
pvfs uses userspace processes for the servers and uses openib and
infiniband networks just fine. Also since pvfs is a userspace process
on the server side the underlying file system is important to tune
appropriately. I found xfs to be very tunable and works well as the
file system under pvfs.
The exact opposite for lustre. Lustre is a kernel space filesystem for
both servers and clients and has built-in support for quadrics and
infiniband. Since its kernel based everything, the file system used
on the disks are lustre specific (their modified version of ext3
Both opperate over tcp just fine as well.
Software Architecture Differences.
pvfs is a completely userspace implementation on the server side and a
partial kernel/userspace implementation on the client side. This is
advantagous for kernel development since the filesystem can work
without having to 'fit' the filesystem software into the servers
kernel. Also you can keep a rather 'stock' distribution of linux and
not have to maintain your own kernel. Also from what I've noticed pvfs
doesn't do much caching of the files in their server processes they
leave that up to the underlying file system. Also the system foot
print is rather small since there's only one server process it doesn't
take the servers system over and you can renice the server process to
whatever to give more time to other things.
lustre is a completely kernel based file system both on the client and
server side, however the kernel modules on the client don't need
patches to the default kernel. The lustre servers need a patched
kernel. This is nice since you don't have as many context switches to
userspace and the software stays relatively out of the way so data
coming in can go faster to disk. However, lustre does do caching so
things don't quite go directly to disk. Also the lustre servers are
fairly dedicated for serving lustre you can configure lustre to spawn
multiple threads (50 is about what we run) so that the performance
When Problems Show Up.
pvfs is a userspace process if it dies you can just restart it with
some debugging flags and grab a core file to see what went wrong.
Also, pvfs developers are rather quick on the draw to help.
lustre is kernel based so if the software fails worst case is the
system just freezes and you've got a kernel panic or oopse somewhere
on a console log that you have to find. Also, PNNL has a relationship
with lustre so we actually get fairly good support and we know some of
them and can get help when we need it, but we do pay for that part ;).
Honestly they are about the same with the testing that we've done
here. However our testing is fairly specific and we mostly deal with
wide striping across the file system.
That's my $0.02 about it.
- David Brown
More information about the Pvfs2-users