[PVFS-users] pvfs 1.6.3 causing possible kernel crash

Rob Ross rross at mcs.anl.gov
Tue Dec 21 08:18:41 EST 2004


Hi Shawn,

Can you send us the configure output from the pvfs-kernel build?  I'd like 
to know what options it picked.

Also, when you load the module, are you specifying a specific "buffer=" 
argument?  If you could grab the line out of dmesg that is printed at 
module load time, that would be helpful.

If this is a PVFS problem, it is most likely in the kernel code; otherwise 
you shouldn't get hangs.

I don't understand why you're getting much traffic at all across the
kernel interface with the FLASH code; I think that it is likely that
something is misconfigured there in the I/O stack.  Can you try prepending
a "pvfs:" to your file names and see if that helps?  Alternatively, you
could create an /etc/pvfstab describing the mount point and not mount the
file system at all, eliminating the pvfs-kernel piece from the equation
entirely.

Rob


On Mon, 20 Dec 2004, Shawn Needham wrote:

> Greetings,
> 
> In the past week I have been attempting to run our FLASH application on 
> our cluster using pvfs-1.6.3.
> 
> We have been experiencing crashes on one of our compute nodes that 
> appears to be caused by software with direct access to the kernel. The 
> nodes are freezing, you can no longer ping the interface for the Gb 
> switch, the terminal is blacked out and unresponsive to keyboard input, 
> and no diagnostic information is written to the log files.  The only 2 
> modules loaded on our compute nodes are gm (for a Myrinet's MPICH-gm) 
> and pvfs, which we are running as a service over the Gb switch.
> 
> We have seen the crash occur with the exact same symptoms when trying to 
> run our application on a purely Gb based configuration (mpich/hdf5 p4 
> configuration and pvfs running over Gb based switch) and with the mixed 
> configuration described above. Actually the pure Gb configuration has 
> performed much worse and the crashes have always happened much quicker 
> in terms the amount of progress in our runs.  Some of the Myrinet runs 
> have run to completion (though this hasn't been consistent when trying 
> the exact same run), but I never had any success with the Gb based runs 
> during which neither Myrinet did not have any services running and it's 
> module was not loaded.
> 
> I was wondering if you have heard of pvfs causing problems similar to 
> these. Also if one of the developers would like to have an account on 
> the machine to investigate this further that would be tremendous. I have 
> been working with developers at Myrinet for the past week and their 
> ability to log on has proved beneficial. I'm open to trying pvfs2 if the 
> kernel module is more robust than pvfs-1.6.3 but would like to at least 
> get your insights into the current problem we are having with this 
> installation.
> 
> I'm not sure if this relevant, but on two of our earliest runs some 
> corrupted file were produced right before our runs crashed. These files 
> were written to directories in the pvfs file system.
> 
> I just reinstalled a new kernel this afternoon and rebuilt all the 
> software, scrapped all old pvfs directories and rebuilt them and my 
> Myrinet based run ended up dieing in the same manner as the others.
> 
> I've also appending some information about the  hardware/software 
> configuration of our machine.
> 
> Thanks,
> Shawn
> 
> Here's the software/hardware information.
> 
> Hardware. 17 node (1 master/16 compute) cluster from Aspen  Systems with 
> the Myrinet M3-E32 switch and a Gb switch (24 port HP Procurve 2724, 
> J4897a).
> Nodes:  e7505 chipset, dual 2.8Ghz xeon processors 4GB memory. PCIXE-2 
> dual channel myrinet cards, cards are in 133Mhz PCI  slots, Intel Corp. 
> 82545EM Gigabit Ethernet Controller. Nvidia 5950 Geforce FX Ultra.
> 
> We are running a stock 2.4.27 kernel from kernel.org. The current 
> running kernel is a 2.4.27smp-64GB-noAGPsupport. We initially installed 
> RH9 on this system and tried to take as much of the RedHat features out 
> as possible, but system parameters are still consistent with RH values. 
> Here's /proc/version info.
> 
> [root at ellipse0 ~]# cat /proc/version
> Linux version 2.4.27smp-64gb-noagp (root at ellipse0) (gcc version 3.2.2 
> 20030222 (Red Hat Linux 3.2.2-5)) #1 SMP Mon Dec 20 12:14:14 CST 2004
> 
> Myrinet configuration
> 
> gm-2.1.6     (linux-2.4 - ia32)
> pvfs-1.6.3   (running over Gb switch)
> mpich-1.2.6..13b-gm     * ifc 8.1 (Version 8.1  (l_fc_p_8.1.018))
>     * icc 8.1 (Version 8.1  (l_cc_p_8.1.021))
> hdf5-1.6.2 (built against this mpich)
> 
> Gb configuration (all services running over Gb based switch):
> 
> pvfs-1.6.3
> mpich-1.2.6-p4
>     ifc 8.1      (Version 8.1    Build 20040803Z Package ID: 
> l_fc_p_8.1.018)
>     gcc 3.2.2 (gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5))
> hdf5-1.6.2-p4 (built against mpich-1.2.6-p4)
> 
> I have configured both MPICH libraries with-romio filesupport for the 
> following file systems (pvfs,ufs,nfs)
> 
> 
> _______________________________________________
> PVFS-users mailing list
> PVFS-users at www.beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs-users
> 
> 


More information about the PVFS-users mailing list