[Pvfs2-users] pvfs2 2.7.0 server process failures

Sam Lang slang at mcs.anl.gov
Mon Nov 19 12:50:22 EST 2007


Hi Ian,

The log doesn't include any errors, so I have to assume the server is  
crashing before writing any to the log.  Is the server compiled with  
debug symbols?  Is there a core dump on the node where the server  
died?  If so, can you send it to me?  You might need to re-configure  
and re-comile the source with debugging symbols enabled:

make clean
CFLAGS=-g ./configure --enable-strict ....
make

Thanks,
-sam

On Nov 19, 2007, at 11:15 AM, Ian E. Morgan wrote:

> I have been investigating pvfs2 for use on a small 10-node cluster,
> and have been having some random failures while simply copying data
> into the filesystem.
>
> 10 servers each sharing 400GB into a 3.6TiB filesystem. Mounted on all
> 10 nodes via pvfs2 kernel module and pvfs2-client. During heavy
> writing to the FS, one instance of pvfs2server (at random) will
> typically die after anywhere from 10-30 minutes.
>
> Each node is handling both data/metadata. During an earliler config
> where only one server handles metadata, it was that one metadata
> node's server than crashed, so I suspent it's related to the metadata
> handling as opposed to the data.
>
> At the moment, my testing has simply been copying 2TiB of data into
> the PVFS2 volume, by having all 10 nodes copy their local data into
> the shared volume.
>
> These nodes have been pretty rock solid until I tried running a
> clustered filesystem. .Having all sorts of trouble with GlusterFS, so
> have been on a hunt for something more stable.
> On the advice of "robl", I have enabled 'pvfs2-set-debugmask -m
> /mnt/rfd verbose'.
>
> Once a node failed again:
>
> 	<robl>	iemorgan: the last few lines should be enough :>
> 	<iemorgan>	the end of the server log for the node that failed is:
> 	<iemorgan>	[D 11/19 11:31] *** starting delayed ops if any (state is
> LIST_PROC_ALLPOSTED)
> [D 11/19 11:31] lebf_encode_rel
> [D 11/19 11:31] op_queue add: 0xb4218780
> [D 11/19 11:31] [BMI CONTROL]: BMI_set_info: set_info: 135678016  
> option: 6
> [D 11/19 11:31] [BMI CONTROL]: BMI_set_info: searching for ref  
> 135678016
> [D 11/19 11:31] flowproto-multiqueue trove_write_callback_fn,  
> error_code: 0, flo
> w: 0x8158420.
> [D 11/19 11:31] [BMI CONTROL]: BMI_set_info: decremented ref  
> 135678016 to: 0
> [D 11/19 11:31] DBPF I/O ops in progress: 0
> [D 11/19 11:31] flowproto completing 0x8158420
> 	<robl>	iemorgan: huh. ok, this all looks cryptic-but-normal to me.
> might be time to bring in the big guns
> (pvfs2-users at beowlf-underground.org mailing list)
>
> So I attach a good size chunk of the tail of the server log from the
> failure node. The log continues right up until the server process
> died.
>
> I hope someone can help narrow down the problem, then maybe we can
> fine-tune the debugmask to a more specific area of interest or
> identify/resolve the problem outright.
>
>
> -- 
> Ian Morgan
> Software Developer
> Teledyne Controls Simulation Ltd.
> 1-5480 Canotek Rd.
> Ottawa, ON  K1J 9H5
> 613-749-6980  
> x354<b9.svr.log.gz>_______________________________________________
> Pvfs2-users mailing list
> Pvfs2-users at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users



More information about the Pvfs2-users mailing list