[PVFS-users] pvfs mgr crash

Rob Ross rross at mcs.anl.gov
Fri Jul 1 15:39:37 EDT 2005


I'm guessing those messages are due to the shell trying to see if the 
file exists (to remove it) before starting the redirection.  You should 
ignore them.

Regards,

Rob

Franco M. Bladilo wrote:
> Rob,
> I recreated it from scratch :
> 1. Killed all iods and mgr.
> 2. Deleted /pvfs-meta on io1 and /pvfs-data on all io nodes.
> 3. Re-mounted /pvfs-meta on the new drive , cd'ed into it and  ran 
> mkmgrconf.
> 4. Restarted iods and mgr.
> 
> I've been creating some files with dd but here's another example :
> [root at n11 shared.scratch]# mount | grep pvfs
> io1:/pvfs-meta on /shared.scratch type pvfs (rw)
> [root at n11 /]# cd /shared.scratch/
> [root at n11 shared.scratch]# echo "test file" > testfile2
> [root at n11 shared.scratch]# cat testfile2
> test file
> 
> and on the mgr log I see this :
> [W 07/01 12:15] (md_stat.c,58)  md_stat: lstat on /pvfs-meta/testfile2 
> failed
> 
> Thanks
> 
> Franco.
> 
> 
> Rob Ross wrote:
> 
>> Hi Franco,
>>
>> Did you just recreate from scratch (good), or did you copy things over 
>> (bad)?
>>
>> How are you creating the files?
>>
>> Thanks,
>>
>> Rob
>>
>> Franco M. Bladilo wrote:
>>
>>> Rob,
>>> I finally re-created the pvfs partition , this time I dedicated a 
>>> scsi 36GB drive only for /pvfs-meta but now I'm seeing these messages
>>> now on the mgr log for every new file that is created:
>>>
>>> [root at io1 tmp]# tail -f mgrlog.oGrxlw
>>> [W 07/01 05:15] (md_stat.c,58)  md_stat: lstat on 
>>> /pvfs-meta/uddin/41886.management/Gau-12975.int failed
>>> [W 07/01 05:15] (md_stat.c,58)  md_stat: lstat on 
>>> /pvfs-meta/uddin/41886.management/Gau-12975.int failed
>>> [W 07/01 05:15] (md_stat.c,58)  md_stat: lstat on 
>>> /pvfs-meta/uddin/41886.management/Gau-12975.d2e failed
>>> [W 07/01 05:15] (md_stat.c,58)  md_stat: lstat on 
>>> /pvfs-meta/uddin/41886.management/Gau-12975.d2e failed
>>> [W 07/01 06:40] (md_stat.c,58)  md_stat: lstat on 
>>> /pvfs-meta/epilogue_41894.management.log failed
>>> [W 07/01 10:36] (md_stat.c,58)  md_stat: lstat on 
>>> /pvfs-meta/epilogue_41896.management.log failed
>>> [W 07/01 10:54] (md_stat.c,58)  md_stat: lstat on 
>>> /pvfs-meta/epilogue_41897.management.log failed
>>> [W 07/01 11:27] (md_stat.c,58)  md_stat: lstat on 
>>> /pvfs-meta/epilogue_41898.management.log failed
>>> [W 07/01 11:32] (md_stat.c,58)  md_stat: lstat on 
>>> /pvfs-meta/file-24GB failed
>>> [W 07/01 11:33] (md_stat.c,58)  md_stat: lstat on /pvfs-meta/dmesg 
>>> failed
>>> [W 07/01 11:35] (md_stat.c,58)  md_stat: lstat on 
>>> /pvfs-meta/epilogue_41899.management.log failed
>>>
>>> The filesystem seems to be working fine but are these messages 
>>> something to be alarmed?
>>>
>>> Here's the pvfs dirs information:
>>>
>>> [root at io1 /]# ls -ld pvfs-meta/
>>> drwxr-xr-x    5 root     root         4096 Jul  1 11:35 pvfs-meta/
>>> [root at io1 /]# ls -ld pvfs-data/
>>> drwx------  103 nobody   nobody      12288 Jun 29 11:04 pvfs-data/
>>> [root at io1 /]# mount | grep pvfs
>>> /dev/sdb1 on /pvfs-meta type ext3 (rw)
>>> /dev/md0 on /pvfs-data type ext3 (rw)
>>> io1:/pvfs-meta on /shared.scratch/ type pvfs (rw)
>>>
>>> Thanks
>>>
>>> Franco.
>>>
>>> Rob Ross wrote:
>>>
>>>> Hi Franco,
>>>>
>>>> Well, that's an easy thing for me to suggest, but I don't know how 
>>>> much data you have on there.
>>>>
>>>> What I believe has happened is that someone attempted to create new 
>>>> files on the file system while the local metadata file system was 
>>>> full.  The mgr was unable to store the metadata because there was no 
>>>> space, but it didn't remove the metadata files as it should have 
>>>> (that is a bug).  So there are some partial files that never really 
>>>> got data into them sitting out there causing trouble.
>>>>
>>>> Another option for cleanup would be to remove all the zero-length 
>>>> files in the metadata directory (while the mgr is not running).  
>>>> Then delete all the saved data files off the iods; you'll want to 
>>>> check for them off and on for the next couple of weeks as more will 
>>>> be created.
>>>>
>>>> That should clean up the file system for the most part.  It's 
>>>> possible that you'll have some bunk directories for the same reason, 
>>>> which are a little more difficult to clean up.
>>>>
>>>> Let me know if we can help more,
>>>>
>>>> Rob
>>>>
>>>> Franco M. Bladilo wrote:
>>>>
>>>>> Rob,
>>>>> It's possible that the / partition may have temporarly filled up, 
>>>>> we can't tell for sure since we weren't monitoring free space on 
>>>>> this machine.
>>>>> At this point, would you recommend re-creating the pvfs partition 
>>>>> and start from 0?
>>>>>
>>>>> Thanks
>>>>> Franco.
>>>>>
>>>>> Rob Ross wrote:
>>>>>
>>>>>> Hi Franco,
>>>>>>
>>>>>> Ok, well, the file system running out of space might have done 
>>>>>> it.  Is it possible that the filesystem used for metadata was out 
>>>>>> of space for some period of time?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Rob
>>>>>>
>>>>>> Franco M. Bladilo wrote:
>>>>>>
>>>>>>> Rob,
>>>>>>> All the files affected are empty , have zero size, example :
>>>>>>>
>>>>>>> -rwxr--r--    1 root     root            0 Jun 27 06:37 
>>>>>>> epilogue_41172.management.log
>>>>>>> -rwxr--r--    1 root     root            0 Jun 26 23:36 
>>>>>>> epilogue_41260.management.log
>>>>>>>
>>>>>>> The filesystem is ext3 , it resides on a local scsi disk  and is 
>>>>>>> mounted as /. There are no suspicious kernel messages and the 
>>>>>>> hard drive looks heatlhy , it passes all smartctl
>>>>>>> tests.
>>>>>>>
>>>>>>> Filesystem            Size  Used Avail Use% Mounted on
>>>>>>> /dev/sda6             980M  516M  415M  56% /
>>>>>>>
>>>>>>> The only issue I recall weeks ago was that the pvfs fs became 
>>>>>>> full and some pvfs clients crashed, I restarted pvfsd on the 
>>>>>>> affected nodes and the
>>>>>>> problem cleared.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Franco.
>>>>>>>
>>>>>>>
>>>>>>> Rob Ross wrote:
>>>>>>>
>>>>>>>> Hi Franco,
>>>>>>>>
>>>>>>>> First, thanks for the very thorough bug report.
>>>>>>>>
>>>>>>>> Typically those error messages mean exactly what they say: that 
>>>>>>>> someone has been messing with the metadata files.  It could also 
>>>>>>>> indicate something is amiss with that local file system.
>>>>>>>>
>>>>>>>> It would be helpful for you to "ls" a few of those files and see 
>>>>>>>> what size they are.  The mgr uses a very simple check, and it is 
>>>>>>>> unlikely that it is incorrectly reporting that the files are 
>>>>>>>> somehow messed up.
>>>>>>>>
>>>>>>>> If there is any data at all in the files, it would be helpful to 
>>>>>>>> see what it is.
>>>>>>>>
>>>>>>>> It might be a good idea to run a fsck on the file system holding 
>>>>>>>> your metadata.  Something seems wrong there.  Is that just a 
>>>>>>>> single, local disk?
>>>>>>>>
>>>>>>>> The iods are behaving as they should I think.  Let's concentrate 
>>>>>>>> on that mgr for now.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Rob
>>>>>>>>
>>>>>>>> Franco M. Bladilo wrote:
>>>>>>>>
>>>>>>>>> After working flawlessly for almost 3 months we are having 
>>>>>>>>> problems with the pvfs mgr, this is  the output on mgr.log :
>>>>>>>>>
>>>>>>>>> [I 06/27 16:17] ----- Log Level Changing -----
>>>>>>>>> [I 06/27 16:17] Current Logging Level includes :
>>>>>>>>> [I 06/27 16:17] New     Logging Level includes : CRITICAL  WARNING
>>>>>>>>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>>>>>>>>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>>>>>>>>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>>>>>>>>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>>>>>>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file 
>>>>>>>>> /pvfs-meta/epilogue_41172.management.log is not the correct 
>>>>>>>>> size.  This is usually due to running a newer mgr on an old 
>>>>>>>>> PVFS file system or someone mucking with the files in the 
>>>>>>>>> metadata directory (which they should not do).  Aborting!
>>>>>>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>>>>>>        errno     : [22]
>>>>>>>>>        errno msg : [Invalid argument]
>>>>>>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file 
>>>>>>>>> /pvfs-meta/epilogue_41260.management.log is not the correct 
>>>>>>>>> size.  This is usually due to running a newer mgr on an old 
>>>>>>>>> PVFS file system or someone mucking with the files in the 
>>>>>>>>> metadata directory (which they should not do).  Aborting!
>>>>>>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>>>>>>        errno     : [22]
>>>>>>>>>        errno msg : [Invalid argument]
>>>>>>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file 
>>>>>>>>> /pvfs-meta/epilogue_41172.management.log is not the correct 
>>>>>>>>> size.  This is usually due to running a newer mgr on an old 
>>>>>>>>> PVFS file system or someone mucking with the files in the 
>>>>>>>>> metadata directory (which they should not do).  Aborting!
>>>>>>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>>>>>>        errno     : [22]
>>>>>>>>>        errno msg : [Invalid argument]
>>>>>>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file 
>>>>>>>>> /pvfs-meta/epilogue_41260.management.log is not the correct 
>>>>>>>>> size.  This is usually due to running a newer mgr on an old 
>>>>>>>>> PVFS file system or someone mucking with the files in the 
>>>>>>>>> metadata directory (which they should not do).  Aborting!
>>>>>>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>>>>>>        errno     : [22]
>>>>>>>>>        errno msg : [Invalid argument]
>>>>>>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file 
>>>>>>>>> /pvfs-meta/epilogue_41172.management.log is not the correct 
>>>>>>>>> size.  This is usually due to running a newer mgr on an old 
>>>>>>>>> PVFS file system or someone mucking with the files in the 
>>>>>>>>> metadata directory (which they should not do).  Aborting!
>>>>>>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>>>>>>        errno     : [22]
>>>>>>>>>        errno msg : [Invalid argument]
>>>>>>>>> ...
>>>>>>>>> It continues with hundreds of file entries until it reaches 
>>>>>>>>> this point and crashes:
>>>>>>>>>
>>>>>>>>> [C 06/28 04:06] (metaio.c,88) meta_open: Metadata file 
>>>>>>>>> /pvfs-meta/dtabakov/50-50-LF-c200-r0.2-2.5-by-0.1-f-0.1-1.0-by-0.1-initAcpt-crap/new-s-50-r-2.30-f-0.20--120-of-200 
>>>>>>>>> is not the correct size.  This is usually due to running a 
>>>>>>>>> newer mgr on an old PVFS file system or someone mucking with 
>>>>>>>>> the files in the metadata directory (which they should not 
>>>>>>>>> do).  Aborting!
>>>>>>>>> [C 06/28 04:06] (md_stat.c,118) md_stat, meta_open
>>>>>>>>>        errno     : [22]
>>>>>>>>>        errno msg : [Invalid argument]
>>>>>>>>> [C 06/28 04:06] (mgr.c,2576) Received signal=[11]
>>>>>>>>> [C 06/28 04:06] (mgr.c,2578)
>>>>>>>>> OPEN FILES:
>>>>>>>>> [C 06/28 04:06] (mgr.c,2585) Current working directory: [/]
>>>>>>>>> [C 06/28 04:06] (mgr.c,2587) pid: [30259]
>>>>>>>>> [C 06/28 04:06] (mgr.c,2594) rlim_cur (RLIMIT_CORE): [0]
>>>>>>>>> [C 06/28 04:06] (mgr.c,2595) rlim_max (RLIMIT_CORE): [-1]
>>>>>>>>>
>>>>>>>>> After restarting the mgr , any operations on the pvfs mounted 
>>>>>>>>> filesystem will complain about corrupted/non-existant files :
>>>>>>>>>
>>>>>>>>> [root at io1 shared.scratch]# ls -la
>>>>>>>>> ls: epilogue_41172.management.log: Invalid argument
>>>>>>>>> ls: epilogue_41260.management.log: Invalid argument
>>>>>>>>> total 308521
>>>>>>>>> drwxrwxrwx    1 root     root        20480 Jun 28 10:13 .
>>>>>>>>> drwxr-xr-x   24 root     root         4096 May 16 14:05 ..
>>>>>>>>> drwxr-xr-x    1 juanp    scisim       4096 Jun 10 02:50 
>>>>>>>>> 40045.management
>>>>>>>>> drwxr-xr-x    1 juanp    scisim       4096 Jun 11 06:43 
>>>>>>>>> 40046.management
>>>>>>>>>
>>>>>>>>> Here's the log on the iods when the crash happened :
>>>>>>>>>
>>>>>>>>> [root at io1 tmp]# cat iolog.0Vr60K
>>>>>>>>>
>>>>>>>>> [I 06/27 16:18] ----- Log Level Changing -----
>>>>>>>>> [I 06/27 16:18] Current Logging Level includes :
>>>>>>>>> [I 06/27 16:18] New     Logging Level includes : CRITICAL  WARNING
>>>>>>>>> [W 06/27 16:18] (iod.c,289) socket=[5] hung up
>>>>>>>>> [W 06/27 16:18] (iod.c,289) socket=[5] hung up
>>>>>>>>> [W 06/27 16:31] (iod.c,697)  open: 064/f49958.0 exists (flags = 
>>>>>>>>> c2); saving
>>>>>>>>> [W 06/27 17:25] (iod.c,697)  open: 066/f49960.0 exists (flags = 
>>>>>>>>> c2); saving
>>>>>>>>> [W 06/27 17:50] (iod.c,697)  open: 067/f49961.0 exists (flags = 
>>>>>>>>> c2); saving
>>>>>>>>> [W 06/27 18:15] (iod.c,697)  open: 069/f49963.0 exists (flags = 
>>>>>>>>> c2); saving
>>>>>>>>> [W 06/27 18:36] (iod.c,697)  open: 070/f49964.0 exists (flags = 
>>>>>>>>> c2); saving
>>>>>>>>> [W 06/27 18:41] (iod.c,697)  open: 072/f49966.0 exists (flags = 
>>>>>>>>> c2); saving
>>>>>>>>> [W 06/27 18:56] (iod.c,697)  open: 073/f49967.0 exists (flags = 
>>>>>>>>> c2); saving
>>>>>>>>> [W 06/27 19:01] (iod.c,697)  open: 074/f49968.0 exists (flags = 
>>>>>>>>> c2); saving
>>>>>>>>> [W 06/27 19:05] (iod.c,697)  open: 076/f49970.0 exists (flags = 
>>>>>>>>> c2); saving
>>>>>>>>> [W 06/28 04:06] (iod.c,289) socket=[5] hung up
>>>>>>>>>
>>>>>>>>> There were no hardware failures and all clients,iods and mgr 
>>>>>>>>> run the same pvfs version (1.6.3) on ia64 based system.
>>>>>>>>>
>>>>>>>>> Any ideas?
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>>
>>>>>>>>
>>>>>>>> .
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> .
>>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> PVFS-users mailing list
>>>> PVFS-users at www.beowulf-underground.org
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs-users
>>>>
>>>> .
>>>>
>>>
>>>
>>
>> .
>>
> 
> 


More information about the PVFS-users mailing list