[PVFS-users] pvfs 1.6.3 causing possible kernel crash
shawn at flash.uchicago.edu
Mon Dec 20 22:47:16 EST 2004
In the past week I have been attempting to run our FLASH application on
our cluster using pvfs-1.6.3.
We have been experiencing crashes on one of our compute nodes that
appears to be caused by software with direct access to the kernel. The
nodes are freezing, you can no longer ping the interface for the Gb
switch, the terminal is blacked out and unresponsive to keyboard input,
and no diagnostic information is written to the log files. The only 2
modules loaded on our compute nodes are gm (for a Myrinet's MPICH-gm)
and pvfs, which we are running as a service over the Gb switch.
We have seen the crash occur with the exact same symptoms when trying to
run our application on a purely Gb based configuration (mpich/hdf5 p4
configuration and pvfs running over Gb based switch) and with the mixed
configuration described above. Actually the pure Gb configuration has
performed much worse and the crashes have always happened much quicker
in terms the amount of progress in our runs. Some of the Myrinet runs
have run to completion (though this hasn't been consistent when trying
the exact same run), but I never had any success with the Gb based runs
during which neither Myrinet did not have any services running and it's
module was not loaded.
I was wondering if you have heard of pvfs causing problems similar to
these. Also if one of the developers would like to have an account on
the machine to investigate this further that would be tremendous. I have
been working with developers at Myrinet for the past week and their
ability to log on has proved beneficial. I'm open to trying pvfs2 if the
kernel module is more robust than pvfs-1.6.3 but would like to at least
get your insights into the current problem we are having with this
I'm not sure if this relevant, but on two of our earliest runs some
corrupted file were produced right before our runs crashed. These files
were written to directories in the pvfs file system.
I just reinstalled a new kernel this afternoon and rebuilt all the
software, scrapped all old pvfs directories and rebuilt them and my
Myrinet based run ended up dieing in the same manner as the others.
I've also appending some information about the hardware/software
configuration of our machine.
Here's the software/hardware information.
Hardware. 17 node (1 master/16 compute) cluster from Aspen Systems with
the Myrinet M3-E32 switch and a Gb switch (24 port HP Procurve 2724,
Nodes: e7505 chipset, dual 2.8Ghz xeon processors 4GB memory. PCIXE-2
dual channel myrinet cards, cards are in 133Mhz PCI slots, Intel Corp.
82545EM Gigabit Ethernet Controller. Nvidia 5950 Geforce FX Ultra.
We are running a stock 2.4.27 kernel from kernel.org. The current
running kernel is a 2.4.27smp-64GB-noAGPsupport. We initially installed
RH9 on this system and tried to take as much of the RedHat features out
as possible, but system parameters are still consistent with RH values.
Here's /proc/version info.
[root at ellipse0 ~]# cat /proc/version
Linux version 2.4.27smp-64gb-noagp (root at ellipse0) (gcc version 3.2.2
20030222 (Red Hat Linux 3.2.2-5)) #1 SMP Mon Dec 20 12:14:14 CST 2004
gm-2.1.6 (linux-2.4 - ia32)
pvfs-1.6.3 (running over Gb switch)
mpich-1.2.6..13b-gm * ifc 8.1 (Version 8.1 (l_fc_p_8.1.018))
* icc 8.1 (Version 8.1 (l_cc_p_8.1.021))
hdf5-1.6.2 (built against this mpich)
Gb configuration (all services running over Gb based switch):
ifc 8.1 (Version 8.1 Build 20040803Z Package ID:
gcc 3.2.2 (gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5))
hdf5-1.6.2-p4 (built against mpich-1.2.6-p4)
I have configured both MPICH libraries with-romio filesupport for the
following file systems (pvfs,ufs,nfs)
More information about the PVFS-users