[Pvfs2-users] PVFS 2.6.2 intermittent cmp/diff failure

Mark Van De Vyver mvyver at gmail.com
Fri Mar 2 21:14:54 EST 2007


Hi Rob,
Thanks for taking time to look at this.

I'm still wondering if the warnings I saw when compiling PVFS 2.6.2
are related (see my earlier post on upgrading from 1.5.1 to 2.6.2)

> Are you seeing any block device errors in your dmesg/syslog output?
> Given the fsck that you had to perform previously, it is possible that
> there's a problem with the local disk or FS.

I'm not sure hat lay behind those disk errors - I've changed a power
supply and haven't seen any more.  I've also completely rebuilt the
cluster since then - so that PVFS 2.6.2 was upgraded froma on a fresh
install of 1.5.1.

No errors in dmesg, messages or syslog - I now see some in the pvfs2-
client.log (attached, see below for a description of what I do on the
machines during this time)

> What local FS are you using, by the way?

ext3 on a raid 0, all nodes are partitioned identically.

To elaborate/clarify my earlier post:

I take 3 DVD that have previously been copied to the PVFS area: DVD A,
B, C, thr files on these all passed the diff verification after they
were copied.

I boot the cluster - the two PVFS2 I/O compute nodes start fine,
the frontend hangs when loading the PVFS2-client (server loads [ OK
]).  After a hard reset, then checking the filesystem the frontend
boots fine.

Load  DVD-A on to the frontend and run the script
copy-taq-dvd-monitor.sh, which calls copy-taq-dvd.sh (both attached)
Two binary files fail the verification check, they are deleted,
coppied and verified - both pass.
Now all the files on DVD-A have been verified once.

Reinsert DVD-A on the frontend and rerun the script - all files pass
the verification - no files need to be deleted/recopied/verified.
Now all the files on the PVFS2 area have been _twice_ verified as the
same as the DVD.

Insert DVD-B and DVD-C on the two nodes and DVD-A on the frontend, run
the same script on all three machines.
Now 3 files on DVD-A fail the initial verification.  One of these is
the same that failed the first 'run'.
DVD-B and C have 3-4 binary files that fail.
All failed files are deleted/copied/verified and pass the verification.

Stop the script running on the two compute nodes (DVD-B, C), restart
the script on the frontend to re-process DVD-A.
The diff fails reporting a 'broken pipe', attempts to restart the
script result in the `cp` operation for this file hanging.
At this point I stopped.

I have 3 terminals open showing:
tail -f /var/log/messages
tail -f /var/log/dmsg
tail -f /var/log/syslog
I see no output to these files throughout this exercise.

The pvfs2-client.log files from each machine are attached.

It seems this problem occurs when PVFS2 is under load?

Hope this helps.
Mark

> Thanks,
>
> Rob
>
> Mark Van De Vyver wrote:
> > Thanks Steve,
> > I don't see any problem until I run the diff or cmp and even then
> > these indicate the files are identical if the cmp is run _immediately_
> > after the file copy.
> > cmp and diff only indicate a difference when a file is 'checked' after
> > some other files have been copied-checked.
> >
> > The files are from the NYSE trade and quote (TAQ) DVD's, so they are
> > text stored as binary.
> >
> > You might be able to try the following with a dozen or so large binary
> > files, I have approx 300-400GB stored in the PVFS area.
> >
> > Ideally the following should be run on two or more PVFS2 servers at
> > the same time, apply this to several DVD's that have not been copied
> > to the PVFS area, then reapply the script to the same DVD's after they
> > have been copied.
> > The following is a slightly simplified version of my script - here I
> > don't delete and re-copy when an existing file fails the cmp
> > verification:
> >
> > # untested script start
> > for fn in `ls /dvd/*large.bin|sed -e 's/\/dev\//g'`
> >  do
> >      if [ -f /mnt/pvfs2/${fn} ]
> >        then
> >          # This should 'fail' more frequently than the cmp in the else
> > clause
> >          cmp ${fn} /mnt/pvfs2/${fn}
> >          if [ $? != 0 ]
> >            then
> >              echo "Prexisting copy not exact - more frequent and random?"
> >          fi
> >        else
> >           cp ${fn} /mnt/pvfs2/${fn}
> >           cmp ${fn} /mnt/pvfs2/${fn}
> >           if [ $? != 0 ]
> >              then
> >                echo "    Initial copy not exact - less frequent and random"
> >        fi
> >  done
> > # untested script end
> >
> > Regards
> > Mark
> >
> > On 3/2/07, Steve <steve at bov.nu> wrote:
> >> My setup is a little different in that at the moment I have 2 I/O
> >> services
> >> running on one box, a metadata on another and a client/samba server on a
> >> third. I have moved in the data via samba. We have copied in mp3's and
> >> avi/mpg's as well as large ISO's plus software exe's. Surely after
> >> several
> >> week of use we would notice some problem ?
> >>
> >>
> >>
> >> I do have another box set up as a client that happens to have a dvd ROM
> >> drive in it.
> >>
> >>
> >>
> >> What type of files ? A vob ?
> >>
> >> What sequence of  commands would I need to do you test your problem ?
> >>
> >> If I get a little spare time I could try for U ?
> >>
> >>
> >>
> >> Steve
> >>
> >>
> >>
> >> -------Original Message-------
> >>
> >>
> >>
> >> From: Mark Van De Vyver
> >>
> >> Date: 02/03/2007 08:18:11
> >>
> >> To: Steve
> >>
> >> Subject: Re: [Pvfs2-users] PVFS 2.6.2 intermittent cmp/diff failure
> >>
> >>
> >>
> >> Hi Steve,
> >>
> >>
> >>
> >> > Not sure if this helps any but I have copied over 500gb of media
> >> files to
> >>
> >> > pvfs2 running on old dell's 533 to 866 CPU with very little ram
> >> running on
> >>
> >>
> >> > caos3 beta 3. Although I havent done any checks other than using the
> >> media
> >>
> >>
> >> > I havent noticed any problems.
> >>
> >> >
> >>
> >>
> >>
> >> The failures might be spurious....?
> >>
> >>
> >>
> >> > Could you have problems with the dvd device ?
> >>
> >>
> >>
> >> I doubt it - but it may not be impossible?
> >>
> >> This happens with the DVD drives on all three nodes, and when I just
> >>
> >> Have one node 'working the diif/cmp failures either don't occur or
> >>
> >> Very, very rarely. Start all three nodes 'working' and I see roughly
> >>
> >> 1 out of 2 binary files fail the initial diff/cmp check, but very very
> >>
> >> Few (one every couple of DVD's fail the cmp/diff check immediately
> >>
> >> After the copy is done.....
> >>
> >>
> >>
> >> Thanks
> >>
> >> Mark
> >>
> >>
> >>
> >> >
> >>
> >> > -------Original Message-------
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > From: Mark Van De Vyver
> >>
> >> >
> >>
> >> > Date: 02/03/2007 03:26:40
> >>
> >> >
> >>
> >> > To: pvfs2-users at beowulf-underground.org
> >>
> >> >
> >>
> >> > Subject: [Pvfs2-users] PVFS 2.6.2 intermittent cmp/diff failure
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > Hi,
> >>
> >> >
> >>
> >> > This is a follow up on an earlier email where I reported that PVFS
> >>
> >> >
> >>
> >> > 1.5.1 failed copy binary files from several DVD's.
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > I'm running a 3 node Rocks 4.2.1 Cluster, CentOS4.4, x86_64, nodes are
> >>
> >> >
> >>
> >> > Connected via an unmanaged switch.
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > I have reinstalled the Rocks Cluster (all nodes), including the
> >> PVFS2 Roll
> >>
> >>
> >> >
> >>
> >> > The cluster is set up with the frontend as the metadaat server and the
> >>
> >> >
> >>
> >> > Other two nodes are PVFS2 I/O servers and clients. The /mnt.pvfs2
> >>
> >> >
> >>
> >> > Area is on a 3 disk RAID 0 partition formatted as ext3.
> >>
> >> >
> >>
> >> > After installing I ran the test steps in the "PVFS2 Quick Start
> >>
> >> >
> >>
> >> > Guide". The test steps ran without error.
> >>
> >> >
> >>
> >> > I upgraded to PVFS 2.6.2 on all nodes and re-ran the test steps, again
> >>
> >> >
> >>
> >> > No errors or problems.
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > I build PVFS 2.6.2 with the following:
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > ./configure --with-kernel=</path/to/kernel26/>
> >>
> >> >
> >>
> >> > --enable-kernel-sendfile --prefix=/usr/local/pvfs2/
> >>
> >> >
> >>
> >> > Then type
> >>
> >> >
> >>
> >> > Make all
> >>
> >> >
> >>
> >> > Make kmod_install
> >>
> >> >
> >>
> >> > Make install
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > On each node I have a script that lists the files on the DVD disc
> >>
> >> >
> >>
> >> > Loaded on that node.
> >>
> >> >
> >>
> >> > Each file is copied if it does not exist on the HDD (PVFS area) and
> >>
> >> >
> >>
> >> > The copy is immediately verified:
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > Cp /dvd/file1 /mnt/pvfs2/file1
> >>
> >> >
> >>
> >> > Cmp /dvd/file1 /mnt/pvfs2/file1
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > `cmp` does not report any error.
> >>
> >> >
> >>
> >> > This has been done for 60-70 DVD.
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > If I insert a DVD that has previously been copied my script finds that
> >>
> >> >
> >>
> >> > A file exists in the PVFS area and does a `cmp` with the DVD file, if
> >>
> >> >
> >>
> >> > The file fails this comparison the file is deleted, copied, verified
> >>
> >> >
> >>
> >> > (cmp).
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > I notice that frequently and randomly the previously copied files will
> >>
> >> >
> >>
> >> > Fail the _initial_ `cmp` check if more than one node is 'active', I.e.
> >>
> >> >
> >>
> >> > Processing a DVD.
> >>
> >> >
> >>
> >> > Once deleted and copied the second `cmp` check is passed.
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > Some details:
> >>
> >> >
> >>
> >> > The files do not fail the `cmp` check immediately after being copied -
> >>
> >> >
> >>
> >> > Only when checking a previously copied file.
> >>
> >> >
> >>
> >> > The `cmp` result indicates a different byte at which the files differ.
> >>
> >> >
> >>
> >> > Re-inserting the same dvd several times results if different files
> >>
> >> >
> >>
> >> > Failing the first `cmp` check.
> >>
> >> >
> >>
> >> > The second check (immediately after the copy is finished) is always
> >> passed
> >>
> >>
> >> >
> >>
> >> > This occurs rarely, if at all (I.e. I haven't noticed it), when only
> >>
> >> >
> >>
> >> > One node is processing a DVD.
> >>
> >> >
> >>
> >> > This only occurs with binary files - which are relatively large
> >> 200MB - 2
> >> GB
> >>
> >> >
> >>
> >> >
> >>
> >> > This never occurs with text files - which are also small 100'sKB
> >>
> >> >
> >>
> >> > The pvfs2-client.log file is empty on each node.
> >>
> >> >
> >>
> >> > I have tried using diff and experience the same results.
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > This is similar to an error I was seeing in PVFS 1.5.1 - hence the
> >>
> >> >
> >>
> >> > Upgrade. I've also changed my previous script which `dd` copied the
> >>
> >> >
> >>
> >> > DVD to memory (approx 8GB), then wrote this ISO file to the PVFS2 area
> >>
> >> >
> >>
> >> > - this worked fine for initial copies, but failed for re-copies. At
> >>
> >> >
> >>
> >> > That time I wasn't verifiying the copy, so it was the copy to the
> >>
> >> >
> >>
> >> > PVFS2 area that failed.....
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > Finally, on one occasion when manually running `cmp` on a file I
> >>
> >> >
> >>
> >> > Noticed the following sequence.
> >>
> >> >
> >>
> >> > Cmp file1 file2 (pass)
> >>
> >> >
> >>
> >> > Cmp file1 file2 (pass)
> >>
> >> >
> >>
> >> > Difffile1 file2 (fail)
> >>
> >> >
> >>
> >> > Cmp file1 file2 (fail)
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > Is this known behavior with a known workaround/configuration setting?
> >>
> >> >
> >>
> >> > The behavior I see made me guess a caching or network issue (there are
> >>
> >> >
> >>
> >> > No other machines on the cluster network).
> >>
> >> >
> >>
> >> > Can anyone suggest PVFS configuration settings that will make PVFS more
> >>
> >> > robust.
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > I'm not a programmer or linux guru - I just spent this summer
> >>
> >> >
> >>
> >> > Converting from winxp...
> >>
> >> >
> >>
> >> > I'm happy to explore some possible fixes, but don't assume too much :)
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > Thanks in advance
> >>
> >> >
> >>
> >> > Mark
> >>
> >> >
> >>
> >> > _______________________________________________
> >>
> >> >
> >>
> >> > Pvfs2-users mailing list
> >>
> >> >
> >>
> >> > Pvfs2-users at beowulf-underground.org
> >>
> >> >
> >>
> >> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >>
> >>
> > _______________________________________________
> > Pvfs2-users mailing list
> > Pvfs2-users at beowulf-underground.org
> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> >
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: copy-taq-dvd-monitor.sh
Type: application/x-sh
Size: 818 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-users/attachments/20070303/66f6dfd8/copy-taq-dvd-monitor-0001.sh
-------------- next part --------------
A non-text attachment was scrubbed...
Name: copy-taq-dvd.sh
Type: application/x-sh
Size: 1866 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-users/attachments/20070303/66f6dfd8/copy-taq-dvd-0001.sh
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pvfs2-client.log.frontend
Type: application/octet-stream
Size: 7879 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-users/attachments/20070303/66f6dfd8/pvfs2-client.log-0003.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pvfs2-client.log.compute-0-0
Type: application/octet-stream
Size: 1626 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-users/attachments/20070303/66f6dfd8/pvfs2-client.log-0004.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pvfs2-client.log.compute-0-1
Type: application/octet-stream
Size: 1626 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-users/attachments/20070303/66f6dfd8/pvfs2-client.log-0005.obj


More information about the Pvfs2-users mailing list