[Pvfs2-developers] PVFS2 & Hadoop

Sam Lang slang at mcs.anl.gov
Sat Nov 3 15:30:17 EST 2007


Hi Murali,

Overall, the description of HDFS doesn't seem all that compelling.   
Its targeted at certain write-once, read-only workloads, with a heavy  
emphasis on fault-tolerance and load-balancing.  It pretty much  
tosses POSIX consistency for its workloads, and assumes that a single  
client will perform the creation and writing of the file, after  
which, the file will be read-only.  So its able to solve replication  
and data caching pretty easily.  The description talks about  
supporting an append-to-file operation at some point in the future.   
Data is striped (and replicated) over multiple IO nodes, but it uses  
blocks instead of objects to manipulate stripes.  The IO nodes send  
heartbeat messages to the metadata node for failover.

All metadata is stored on a single node, introducing a single point  
of failure, and although you can replicate the metadata on that node  
to avoid metadata corruption, all file metadata operations go through  
that node.  Further, the metadata node communicates with the IO nodes  
for metadata operations (block allocations, etc.).  They assume their  
workloads won't include heavy metadata operations or small access.   
All accesses are assumed to be in multiples of 64MB.

Both HDFS and Ceph seem to focus on load balancing of data  
distribution _within_ a cluster, talking about distances between  
racks and rows of racks and etc.  I guess the clusters common in  
industry differ from the ones we're used to seeing, where we're  
trying to distribute data on pretty much all the storage hardware we  
can get our hands on to improve performance, not just get the data as  
close to the computation as possible.  I guess that has something to  
do with the IO patterns being independent rather than collective.

The metadata operations are stored in a transaction log, but it  
doesn't look like its being used to perform rollbacks on failure.   
 From the design doc:

"The Namenode keeps an image of the entire file system namespace and  
file Blockmap in memory.  This key matadata item is designed to be  
compact, such that a Namenode with 4GB of RAM is plenty to support a  
huge number of files and directories."

Sounds like something Bill Gates would say...

Cheers,
-sam


On Oct 16, 2007, at 12:32 PM, Murali Vilayannur wrote:

> Hi Folks
> Have any of you guys looked at Hadoop and HDFS?
> Hadoop is a distributed computing infrastructure with special
> map&reduce constructs similar to what Google proposed in OSDI04.
> HDFS is their backend cluster file system.
>
> http://wiki.apache.org/lucene-hadoop-data/attachments/ 
> HadoopPresentations/attachments/HDFSDescription.pdf
> Question that I have is how many HPC apps can be rewritten using the
> M&R programming model and whether it makes sense to integrate with the
> Hadoop API to get a larger sample space of apps that can run well on
> pvfs2 other than MPI based ones.?
> Any thoughts?
> If I recall, Avery or RobR  had already done some research on this
> aspect.. or maybe I
> heard from someone else..? Anyhow, it would be good to know from
> someone who know
> more about this.
> thanks,
> Murali
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>



More information about the Pvfs2-developers mailing list