[Pvfs2-users] unexplainable data corruption

Emmanuel Florac eflorac at intellique.com
Fri May 30 06:40:07 EDT 2008

On a PVFS2 cluster ( 2 machines, 30 TB each) I'm having data corruption
problems. I don't really understand what's happening, because it keeps
showing up at random and not on every file. The main symptom is that
the md5 of the files change sometimes. Apparently it goes worse as time
goes by : when I just restarted the cluster, it hardly occurs; after a
week, most files appeared as complete garbage. It occurs more
frequently on very big files ( from 4 to 30 GB) than smaller ones.

(I'm using dd to read from the filesystem because md5 reads very
slowly from the pvfs mount).

cluster2:/mnt/cluster/BAD# for i in 1 2 3 4 5 6 ; do dd
if=ES14429 bs=1M 2>/dev/null | md5sum ; done
ca02ca0b5814bba6d8a9528d9f624c64  - 
ca02ca0b5814bba6d8a9528d9f624c64  -
ca02ca0b5814bba6d8a9528d9f624c64  -
98c9d2849cadc9578cfa056fe620a070  -
ca02ca0b5814bba6d8a9528d9f624c64  -
ca02ca0b5814bba6d8a9528d9f624c64  -

As you can see, the file appears correct 4 or 5 times out of 6! It
usually cycles thru 2 or 3 different checksums so the errors are
somewhat consistent!. What is going on? There isn't a single message
coming from either pvfs2 client or server, nothing in dmesg, I don't
When I'm running the same script on the other server (cluster1) the
problem is much rarer. Actually right now I keep cheksumming this file
from cluster1 again and again without any error.

Does anyone have any idea about what may be going on? Can it be a
network error? RAM is ECC in this machines, disks are set up in RAID-6.

Emmanuel Florac

