[PVFS-users] pvfs mgr crash
Franco M. Bladilo
bladilo at rice.edu
Fri Jul 1 13:17:58 EDT 2005
Rob,
I recreated it from scratch :
1. Killed all iods and mgr.
2. Deleted /pvfs-meta on io1 and /pvfs-data on all io nodes.
3. Re-mounted /pvfs-meta on the new drive , cd'ed into it and ran
mkmgrconf.
4. Restarted iods and mgr.
I've been creating some files with dd but here's another example :
[root at n11 shared.scratch]# mount | grep pvfs
io1:/pvfs-meta on /shared.scratch type pvfs (rw)
[root at n11 /]# cd /shared.scratch/
[root at n11 shared.scratch]# echo "test file" > testfile2
[root at n11 shared.scratch]# cat testfile2
test file
and on the mgr log I see this :
[W 07/01 12:15] (md_stat.c,58) md_stat: lstat on /pvfs-meta/testfile2
failed
Thanks
Franco.
Rob Ross wrote:
> Hi Franco,
>
> Did you just recreate from scratch (good), or did you copy things over
> (bad)?
>
> How are you creating the files?
>
> Thanks,
>
> Rob
>
> Franco M. Bladilo wrote:
>
>> Rob,
>> I finally re-created the pvfs partition , this time I dedicated a
>> scsi 36GB drive only for /pvfs-meta but now I'm seeing these messages
>> now on the mgr log for every new file that is created:
>>
>> [root at io1 tmp]# tail -f mgrlog.oGrxlw
>> [W 07/01 05:15] (md_stat.c,58) md_stat: lstat on
>> /pvfs-meta/uddin/41886.management/Gau-12975.int failed
>> [W 07/01 05:15] (md_stat.c,58) md_stat: lstat on
>> /pvfs-meta/uddin/41886.management/Gau-12975.int failed
>> [W 07/01 05:15] (md_stat.c,58) md_stat: lstat on
>> /pvfs-meta/uddin/41886.management/Gau-12975.d2e failed
>> [W 07/01 05:15] (md_stat.c,58) md_stat: lstat on
>> /pvfs-meta/uddin/41886.management/Gau-12975.d2e failed
>> [W 07/01 06:40] (md_stat.c,58) md_stat: lstat on
>> /pvfs-meta/epilogue_41894.management.log failed
>> [W 07/01 10:36] (md_stat.c,58) md_stat: lstat on
>> /pvfs-meta/epilogue_41896.management.log failed
>> [W 07/01 10:54] (md_stat.c,58) md_stat: lstat on
>> /pvfs-meta/epilogue_41897.management.log failed
>> [W 07/01 11:27] (md_stat.c,58) md_stat: lstat on
>> /pvfs-meta/epilogue_41898.management.log failed
>> [W 07/01 11:32] (md_stat.c,58) md_stat: lstat on
>> /pvfs-meta/file-24GB failed
>> [W 07/01 11:33] (md_stat.c,58) md_stat: lstat on /pvfs-meta/dmesg
>> failed
>> [W 07/01 11:35] (md_stat.c,58) md_stat: lstat on
>> /pvfs-meta/epilogue_41899.management.log failed
>>
>> The filesystem seems to be working fine but are these messages
>> something to be alarmed?
>>
>> Here's the pvfs dirs information:
>>
>> [root at io1 /]# ls -ld pvfs-meta/
>> drwxr-xr-x 5 root root 4096 Jul 1 11:35 pvfs-meta/
>> [root at io1 /]# ls -ld pvfs-data/
>> drwx------ 103 nobody nobody 12288 Jun 29 11:04 pvfs-data/
>> [root at io1 /]# mount | grep pvfs
>> /dev/sdb1 on /pvfs-meta type ext3 (rw)
>> /dev/md0 on /pvfs-data type ext3 (rw)
>> io1:/pvfs-meta on /shared.scratch/ type pvfs (rw)
>>
>> Thanks
>>
>> Franco.
>>
>> Rob Ross wrote:
>>
>>> Hi Franco,
>>>
>>> Well, that's an easy thing for me to suggest, but I don't know how
>>> much data you have on there.
>>>
>>> What I believe has happened is that someone attempted to create new
>>> files on the file system while the local metadata file system was
>>> full. The mgr was unable to store the metadata because there was no
>>> space, but it didn't remove the metadata files as it should have
>>> (that is a bug). So there are some partial files that never really
>>> got data into them sitting out there causing trouble.
>>>
>>> Another option for cleanup would be to remove all the zero-length
>>> files in the metadata directory (while the mgr is not running).
>>> Then delete all the saved data files off the iods; you'll want to
>>> check for them off and on for the next couple of weeks as more will
>>> be created.
>>>
>>> That should clean up the file system for the most part. It's
>>> possible that you'll have some bunk directories for the same reason,
>>> which are a little more difficult to clean up.
>>>
>>> Let me know if we can help more,
>>>
>>> Rob
>>>
>>> Franco M. Bladilo wrote:
>>>
>>>> Rob,
>>>> It's possible that the / partition may have temporarly filled up,
>>>> we can't tell for sure since we weren't monitoring free space on
>>>> this machine.
>>>> At this point, would you recommend re-creating the pvfs partition
>>>> and start from 0?
>>>>
>>>> Thanks
>>>> Franco.
>>>>
>>>> Rob Ross wrote:
>>>>
>>>>> Hi Franco,
>>>>>
>>>>> Ok, well, the file system running out of space might have done
>>>>> it. Is it possible that the filesystem used for metadata was out
>>>>> of space for some period of time?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Rob
>>>>>
>>>>> Franco M. Bladilo wrote:
>>>>>
>>>>>> Rob,
>>>>>> All the files affected are empty , have zero size, example :
>>>>>>
>>>>>> -rwxr--r-- 1 root root 0 Jun 27 06:37
>>>>>> epilogue_41172.management.log
>>>>>> -rwxr--r-- 1 root root 0 Jun 26 23:36
>>>>>> epilogue_41260.management.log
>>>>>>
>>>>>> The filesystem is ext3 , it resides on a local scsi disk and is
>>>>>> mounted as /. There are no suspicious kernel messages and the
>>>>>> hard drive looks heatlhy , it passes all smartctl
>>>>>> tests.
>>>>>>
>>>>>> Filesystem Size Used Avail Use% Mounted on
>>>>>> /dev/sda6 980M 516M 415M 56% /
>>>>>>
>>>>>> The only issue I recall weeks ago was that the pvfs fs became
>>>>>> full and some pvfs clients crashed, I restarted pvfsd on the
>>>>>> affected nodes and the
>>>>>> problem cleared.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Franco.
>>>>>>
>>>>>>
>>>>>> Rob Ross wrote:
>>>>>>
>>>>>>> Hi Franco,
>>>>>>>
>>>>>>> First, thanks for the very thorough bug report.
>>>>>>>
>>>>>>> Typically those error messages mean exactly what they say: that
>>>>>>> someone has been messing with the metadata files. It could also
>>>>>>> indicate something is amiss with that local file system.
>>>>>>>
>>>>>>> It would be helpful for you to "ls" a few of those files and see
>>>>>>> what size they are. The mgr uses a very simple check, and it is
>>>>>>> unlikely that it is incorrectly reporting that the files are
>>>>>>> somehow messed up.
>>>>>>>
>>>>>>> If there is any data at all in the files, it would be helpful to
>>>>>>> see what it is.
>>>>>>>
>>>>>>> It might be a good idea to run a fsck on the file system holding
>>>>>>> your metadata. Something seems wrong there. Is that just a
>>>>>>> single, local disk?
>>>>>>>
>>>>>>> The iods are behaving as they should I think. Let's concentrate
>>>>>>> on that mgr for now.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Rob
>>>>>>>
>>>>>>> Franco M. Bladilo wrote:
>>>>>>>
>>>>>>>> After working flawlessly for almost 3 months we are having
>>>>>>>> problems with the pvfs mgr, this is the output on mgr.log :
>>>>>>>>
>>>>>>>> [I 06/27 16:17] ----- Log Level Changing -----
>>>>>>>> [I 06/27 16:17] Current Logging Level includes :
>>>>>>>> [I 06/27 16:17] New Logging Level includes : CRITICAL WARNING
>>>>>>>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>>>>>>>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>>>>>>>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>>>>>>>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>>>>>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file
>>>>>>>> /pvfs-meta/epilogue_41172.management.log is not the correct
>>>>>>>> size. This is usually due to running a newer mgr on an old
>>>>>>>> PVFS file system or someone mucking with the files in the
>>>>>>>> metadata directory (which they should not do). Aborting!
>>>>>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>>>>> errno : [22]
>>>>>>>> errno msg : [Invalid argument]
>>>>>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file
>>>>>>>> /pvfs-meta/epilogue_41260.management.log is not the correct
>>>>>>>> size. This is usually due to running a newer mgr on an old
>>>>>>>> PVFS file system or someone mucking with the files in the
>>>>>>>> metadata directory (which they should not do). Aborting!
>>>>>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>>>>> errno : [22]
>>>>>>>> errno msg : [Invalid argument]
>>>>>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file
>>>>>>>> /pvfs-meta/epilogue_41172.management.log is not the correct
>>>>>>>> size. This is usually due to running a newer mgr on an old
>>>>>>>> PVFS file system or someone mucking with the files in the
>>>>>>>> metadata directory (which they should not do). Aborting!
>>>>>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>>>>> errno : [22]
>>>>>>>> errno msg : [Invalid argument]
>>>>>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file
>>>>>>>> /pvfs-meta/epilogue_41260.management.log is not the correct
>>>>>>>> size. This is usually due to running a newer mgr on an old
>>>>>>>> PVFS file system or someone mucking with the files in the
>>>>>>>> metadata directory (which they should not do). Aborting!
>>>>>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>>>>> errno : [22]
>>>>>>>> errno msg : [Invalid argument]
>>>>>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file
>>>>>>>> /pvfs-meta/epilogue_41172.management.log is not the correct
>>>>>>>> size. This is usually due to running a newer mgr on an old
>>>>>>>> PVFS file system or someone mucking with the files in the
>>>>>>>> metadata directory (which they should not do). Aborting!
>>>>>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>>>>> errno : [22]
>>>>>>>> errno msg : [Invalid argument]
>>>>>>>> ...
>>>>>>>> It continues with hundreds of file entries until it reaches
>>>>>>>> this point and crashes:
>>>>>>>>
>>>>>>>> [C 06/28 04:06] (metaio.c,88) meta_open: Metadata file
>>>>>>>> /pvfs-meta/dtabakov/50-50-LF-c200-r0.2-2.5-by-0.1-f-0.1-1.0-by-0.1-initAcpt-crap/new-s-50-r-2.30-f-0.20--120-of-200
>>>>>>>> is not the correct size. This is usually due to running a
>>>>>>>> newer mgr on an old PVFS file system or someone mucking with
>>>>>>>> the files in the metadata directory (which they should not
>>>>>>>> do). Aborting!
>>>>>>>> [C 06/28 04:06] (md_stat.c,118) md_stat, meta_open
>>>>>>>> errno : [22]
>>>>>>>> errno msg : [Invalid argument]
>>>>>>>> [C 06/28 04:06] (mgr.c,2576) Received signal=[11]
>>>>>>>> [C 06/28 04:06] (mgr.c,2578)
>>>>>>>> OPEN FILES:
>>>>>>>> [C 06/28 04:06] (mgr.c,2585) Current working directory: [/]
>>>>>>>> [C 06/28 04:06] (mgr.c,2587) pid: [30259]
>>>>>>>> [C 06/28 04:06] (mgr.c,2594) rlim_cur (RLIMIT_CORE): [0]
>>>>>>>> [C 06/28 04:06] (mgr.c,2595) rlim_max (RLIMIT_CORE): [-1]
>>>>>>>>
>>>>>>>> After restarting the mgr , any operations on the pvfs mounted
>>>>>>>> filesystem will complain about corrupted/non-existant files :
>>>>>>>>
>>>>>>>> [root at io1 shared.scratch]# ls -la
>>>>>>>> ls: epilogue_41172.management.log: Invalid argument
>>>>>>>> ls: epilogue_41260.management.log: Invalid argument
>>>>>>>> total 308521
>>>>>>>> drwxrwxrwx 1 root root 20480 Jun 28 10:13 .
>>>>>>>> drwxr-xr-x 24 root root 4096 May 16 14:05 ..
>>>>>>>> drwxr-xr-x 1 juanp scisim 4096 Jun 10 02:50
>>>>>>>> 40045.management
>>>>>>>> drwxr-xr-x 1 juanp scisim 4096 Jun 11 06:43
>>>>>>>> 40046.management
>>>>>>>>
>>>>>>>> Here's the log on the iods when the crash happened :
>>>>>>>>
>>>>>>>> [root at io1 tmp]# cat iolog.0Vr60K
>>>>>>>>
>>>>>>>> [I 06/27 16:18] ----- Log Level Changing -----
>>>>>>>> [I 06/27 16:18] Current Logging Level includes :
>>>>>>>> [I 06/27 16:18] New Logging Level includes : CRITICAL WARNING
>>>>>>>> [W 06/27 16:18] (iod.c,289) socket=[5] hung up
>>>>>>>> [W 06/27 16:18] (iod.c,289) socket=[5] hung up
>>>>>>>> [W 06/27 16:31] (iod.c,697) open: 064/f49958.0 exists (flags =
>>>>>>>> c2); saving
>>>>>>>> [W 06/27 17:25] (iod.c,697) open: 066/f49960.0 exists (flags =
>>>>>>>> c2); saving
>>>>>>>> [W 06/27 17:50] (iod.c,697) open: 067/f49961.0 exists (flags =
>>>>>>>> c2); saving
>>>>>>>> [W 06/27 18:15] (iod.c,697) open: 069/f49963.0 exists (flags =
>>>>>>>> c2); saving
>>>>>>>> [W 06/27 18:36] (iod.c,697) open: 070/f49964.0 exists (flags =
>>>>>>>> c2); saving
>>>>>>>> [W 06/27 18:41] (iod.c,697) open: 072/f49966.0 exists (flags =
>>>>>>>> c2); saving
>>>>>>>> [W 06/27 18:56] (iod.c,697) open: 073/f49967.0 exists (flags =
>>>>>>>> c2); saving
>>>>>>>> [W 06/27 19:01] (iod.c,697) open: 074/f49968.0 exists (flags =
>>>>>>>> c2); saving
>>>>>>>> [W 06/27 19:05] (iod.c,697) open: 076/f49970.0 exists (flags =
>>>>>>>> c2); saving
>>>>>>>> [W 06/28 04:06] (iod.c,289) socket=[5] hung up
>>>>>>>>
>>>>>>>> There were no hardware failures and all clients,iods and mgr
>>>>>>>> run the same pvfs version (1.6.3) on ia64 based system.
>>>>>>>>
>>>>>>>> Any ideas?
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>>
>>>>>>>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>>>
>>> _______________________________________________
>>> PVFS-users mailing list
>>> PVFS-users at www.beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs-users
>>>
>>> .
>>>
>>
>>
>
> .
>
--
Franco Bladilo
Linux/HPCC Administrator
Research Computing Group
Rice University
bladilo at rice.edu
More information about the PVFS-users
mailing list