[PVFS-users] pvfs 1.6.3 causing possible kernel crash

Shawn Needham shawn at flash.uchicago.edu
Mon Dec 20 22:47:16 EST 2004


Greetings,

In the past week I have been attempting to run our FLASH application on 
our cluster using pvfs-1.6.3.

We have been experiencing crashes on one of our compute nodes that 
appears to be caused by software with direct access to the kernel. The 
nodes are freezing, you can no longer ping the interface for the Gb 
switch, the terminal is blacked out and unresponsive to keyboard input, 
and no diagnostic information is written to the log files.  The only 2 
modules loaded on our compute nodes are gm (for a Myrinet's MPICH-gm) 
and pvfs, which we are running as a service over the Gb switch.

We have seen the crash occur with the exact same symptoms when trying to 
run our application on a purely Gb based configuration (mpich/hdf5 p4 
configuration and pvfs running over Gb based switch) and with the mixed 
configuration described above. Actually the pure Gb configuration has 
performed much worse and the crashes have always happened much quicker 
in terms the amount of progress in our runs.  Some of the Myrinet runs 
have run to completion (though this hasn't been consistent when trying 
the exact same run), but I never had any success with the Gb based runs 
during which neither Myrinet did not have any services running and it's 
module was not loaded.

I was wondering if you have heard of pvfs causing problems similar to 
these. Also if one of the developers would like to have an account on 
the machine to investigate this further that would be tremendous. I have 
been working with developers at Myrinet for the past week and their 
ability to log on has proved beneficial. I'm open to trying pvfs2 if the 
kernel module is more robust than pvfs-1.6.3 but would like to at least 
get your insights into the current problem we are having with this 
installation.

I'm not sure if this relevant, but on two of our earliest runs some 
corrupted file were produced right before our runs crashed. These files 
were written to directories in the pvfs file system.

I just reinstalled a new kernel this afternoon and rebuilt all the 
software, scrapped all old pvfs directories and rebuilt them and my 
Myrinet based run ended up dieing in the same manner as the others.

I've also appending some information about the  hardware/software 
configuration of our machine.

Thanks,
Shawn

Here's the software/hardware information.

Hardware. 17 node (1 master/16 compute) cluster from Aspen  Systems with 
the Myrinet M3-E32 switch and a Gb switch (24 port HP Procurve 2724, 
J4897a).
Nodes:  e7505 chipset, dual 2.8Ghz xeon processors 4GB memory. PCIXE-2 
dual channel myrinet cards, cards are in 133Mhz PCI  slots, Intel Corp. 
82545EM Gigabit Ethernet Controller. Nvidia 5950 Geforce FX Ultra.

We are running a stock 2.4.27 kernel from kernel.org. The current 
running kernel is a 2.4.27smp-64GB-noAGPsupport. We initially installed 
RH9 on this system and tried to take as much of the RedHat features out 
as possible, but system parameters are still consistent with RH values. 
Here's /proc/version info.

[root at ellipse0 ~]# cat /proc/version
Linux version 2.4.27smp-64gb-noagp (root at ellipse0) (gcc version 3.2.2 
20030222 (Red Hat Linux 3.2.2-5)) #1 SMP Mon Dec 20 12:14:14 CST 2004

Myrinet configuration

gm-2.1.6     (linux-2.4 - ia32)
pvfs-1.6.3   (running over Gb switch)
mpich-1.2.6..13b-gm     * ifc 8.1 (Version 8.1  (l_fc_p_8.1.018))
    * icc 8.1 (Version 8.1  (l_cc_p_8.1.021))
hdf5-1.6.2 (built against this mpich)

Gb configuration (all services running over Gb based switch):

pvfs-1.6.3
mpich-1.2.6-p4
    ifc 8.1      (Version 8.1    Build 20040803Z Package ID: 
l_fc_p_8.1.018)
    gcc 3.2.2 (gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5))
hdf5-1.6.2-p4 (built against mpich-1.2.6-p4)

I have configured both MPICH libraries with-romio filesupport for the 
following file systems (pvfs,ufs,nfs)




More information about the PVFS-users mailing list