[PVFS-users] pvfs mgr crash

Rob Ross rross at mcs.anl.gov
Fri Jul 1 13:03:22 EDT 2005


Hi Franco,

Did you just recreate from scratch (good), or did you copy things over 
(bad)?

How are you creating the files?

Thanks,

Rob

Franco M. Bladilo wrote:
> Rob,
> I finally re-created the pvfs partition , this time I dedicated a scsi 
> 36GB drive only for /pvfs-meta but now I'm seeing these messages
> now on the mgr log for every new file that is created:
> 
> [root at io1 tmp]# tail -f mgrlog.oGrxlw
> [W 07/01 05:15] (md_stat.c,58)  md_stat: lstat on 
> /pvfs-meta/uddin/41886.management/Gau-12975.int failed
> [W 07/01 05:15] (md_stat.c,58)  md_stat: lstat on 
> /pvfs-meta/uddin/41886.management/Gau-12975.int failed
> [W 07/01 05:15] (md_stat.c,58)  md_stat: lstat on 
> /pvfs-meta/uddin/41886.management/Gau-12975.d2e failed
> [W 07/01 05:15] (md_stat.c,58)  md_stat: lstat on 
> /pvfs-meta/uddin/41886.management/Gau-12975.d2e failed
> [W 07/01 06:40] (md_stat.c,58)  md_stat: lstat on 
> /pvfs-meta/epilogue_41894.management.log failed
> [W 07/01 10:36] (md_stat.c,58)  md_stat: lstat on 
> /pvfs-meta/epilogue_41896.management.log failed
> [W 07/01 10:54] (md_stat.c,58)  md_stat: lstat on 
> /pvfs-meta/epilogue_41897.management.log failed
> [W 07/01 11:27] (md_stat.c,58)  md_stat: lstat on 
> /pvfs-meta/epilogue_41898.management.log failed
> [W 07/01 11:32] (md_stat.c,58)  md_stat: lstat on /pvfs-meta/file-24GB 
> failed
> [W 07/01 11:33] (md_stat.c,58)  md_stat: lstat on /pvfs-meta/dmesg failed
> [W 07/01 11:35] (md_stat.c,58)  md_stat: lstat on 
> /pvfs-meta/epilogue_41899.management.log failed
> 
> The filesystem seems to be working fine but are these messages something 
> to be alarmed?
> 
> Here's the pvfs dirs information:
> 
> [root at io1 /]# ls -ld pvfs-meta/
> drwxr-xr-x    5 root     root         4096 Jul  1 11:35 pvfs-meta/
> [root at io1 /]# ls -ld pvfs-data/
> drwx------  103 nobody   nobody      12288 Jun 29 11:04 pvfs-data/
> [root at io1 /]# mount | grep pvfs
> /dev/sdb1 on /pvfs-meta type ext3 (rw)
> /dev/md0 on /pvfs-data type ext3 (rw)
> io1:/pvfs-meta on /shared.scratch/ type pvfs (rw)
> 
> Thanks
> 
> Franco.
> 
> Rob Ross wrote:
> 
>> Hi Franco,
>>
>> Well, that's an easy thing for me to suggest, but I don't know how 
>> much data you have on there.
>>
>> What I believe has happened is that someone attempted to create new 
>> files on the file system while the local metadata file system was 
>> full.  The mgr was unable to store the metadata because there was no 
>> space, but it didn't remove the metadata files as it should have (that 
>> is a bug).  So there are some partial files that never really got data 
>> into them sitting out there causing trouble.
>>
>> Another option for cleanup would be to remove all the zero-length 
>> files in the metadata directory (while the mgr is not running).  Then 
>> delete all the saved data files off the iods; you'll want to check for 
>> them off and on for the next couple of weeks as more will be created.
>>
>> That should clean up the file system for the most part.  It's possible 
>> that you'll have some bunk directories for the same reason, which are 
>> a little more difficult to clean up.
>>
>> Let me know if we can help more,
>>
>> Rob
>>
>> Franco M. Bladilo wrote:
>>
>>> Rob,
>>> It's possible that the / partition may have temporarly filled up, we 
>>> can't tell for sure since we weren't monitoring free space on this 
>>> machine.
>>> At this point, would you recommend re-creating the pvfs partition and 
>>> start from 0?
>>>
>>> Thanks
>>> Franco.
>>>
>>> Rob Ross wrote:
>>>
>>>> Hi Franco,
>>>>
>>>> Ok, well, the file system running out of space might have done it.  
>>>> Is it possible that the filesystem used for metadata was out of 
>>>> space for some period of time?
>>>>
>>>> Thanks,
>>>>
>>>> Rob
>>>>
>>>> Franco M. Bladilo wrote:
>>>>
>>>>> Rob,
>>>>> All the files affected are empty , have zero size, example :
>>>>>
>>>>> -rwxr--r--    1 root     root            0 Jun 27 06:37 
>>>>> epilogue_41172.management.log
>>>>> -rwxr--r--    1 root     root            0 Jun 26 23:36 
>>>>> epilogue_41260.management.log
>>>>>
>>>>> The filesystem is ext3 , it resides on a local scsi disk  and is 
>>>>> mounted as /. There are no suspicious kernel messages and the hard 
>>>>> drive looks heatlhy , it passes all smartctl
>>>>> tests.
>>>>>
>>>>> Filesystem            Size  Used Avail Use% Mounted on
>>>>> /dev/sda6             980M  516M  415M  56% /
>>>>>
>>>>> The only issue I recall weeks ago was that the pvfs fs became full 
>>>>> and some pvfs clients crashed, I restarted pvfsd on the affected 
>>>>> nodes and the
>>>>> problem cleared.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Franco.
>>>>>
>>>>>
>>>>> Rob Ross wrote:
>>>>>
>>>>>> Hi Franco,
>>>>>>
>>>>>> First, thanks for the very thorough bug report.
>>>>>>
>>>>>> Typically those error messages mean exactly what they say: that 
>>>>>> someone has been messing with the metadata files.  It could also 
>>>>>> indicate something is amiss with that local file system.
>>>>>>
>>>>>> It would be helpful for you to "ls" a few of those files and see 
>>>>>> what size they are.  The mgr uses a very simple check, and it is 
>>>>>> unlikely that it is incorrectly reporting that the files are 
>>>>>> somehow messed up.
>>>>>>
>>>>>> If there is any data at all in the files, it would be helpful to 
>>>>>> see what it is.
>>>>>>
>>>>>> It might be a good idea to run a fsck on the file system holding 
>>>>>> your metadata.  Something seems wrong there.  Is that just a 
>>>>>> single, local disk?
>>>>>>
>>>>>> The iods are behaving as they should I think.  Let's concentrate 
>>>>>> on that mgr for now.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Rob
>>>>>>
>>>>>> Franco M. Bladilo wrote:
>>>>>>
>>>>>>> After working flawlessly for almost 3 months we are having 
>>>>>>> problems with the pvfs mgr, this is  the output on mgr.log :
>>>>>>>
>>>>>>> [I 06/27 16:17] ----- Log Level Changing -----
>>>>>>> [I 06/27 16:17] Current Logging Level includes :
>>>>>>> [I 06/27 16:17] New     Logging Level includes : CRITICAL  WARNING
>>>>>>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>>>>>>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>>>>>>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>>>>>>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>>>>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file 
>>>>>>> /pvfs-meta/epilogue_41172.management.log is not the correct 
>>>>>>> size.  This is usually due to running a newer mgr on an old PVFS 
>>>>>>> file system or someone mucking with the files in the metadata 
>>>>>>> directory (which they should not do).  Aborting!
>>>>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>>>>        errno     : [22]
>>>>>>>        errno msg : [Invalid argument]
>>>>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file 
>>>>>>> /pvfs-meta/epilogue_41260.management.log is not the correct 
>>>>>>> size.  This is usually due to running a newer mgr on an old PVFS 
>>>>>>> file system or someone mucking with the files in the metadata 
>>>>>>> directory (which they should not do).  Aborting!
>>>>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>>>>        errno     : [22]
>>>>>>>        errno msg : [Invalid argument]
>>>>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file 
>>>>>>> /pvfs-meta/epilogue_41172.management.log is not the correct 
>>>>>>> size.  This is usually due to running a newer mgr on an old PVFS 
>>>>>>> file system or someone mucking with the files in the metadata 
>>>>>>> directory (which they should not do).  Aborting!
>>>>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>>>>        errno     : [22]
>>>>>>>        errno msg : [Invalid argument]
>>>>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file 
>>>>>>> /pvfs-meta/epilogue_41260.management.log is not the correct 
>>>>>>> size.  This is usually due to running a newer mgr on an old PVFS 
>>>>>>> file system or someone mucking with the files in the metadata 
>>>>>>> directory (which they should not do).  Aborting!
>>>>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>>>>        errno     : [22]
>>>>>>>        errno msg : [Invalid argument]
>>>>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file 
>>>>>>> /pvfs-meta/epilogue_41172.management.log is not the correct 
>>>>>>> size.  This is usually due to running a newer mgr on an old PVFS 
>>>>>>> file system or someone mucking with the files in the metadata 
>>>>>>> directory (which they should not do).  Aborting!
>>>>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>>>>        errno     : [22]
>>>>>>>        errno msg : [Invalid argument]
>>>>>>> ...
>>>>>>> It continues with hundreds of file entries until it reaches this 
>>>>>>> point and crashes:
>>>>>>>
>>>>>>> [C 06/28 04:06] (metaio.c,88) meta_open: Metadata file 
>>>>>>> /pvfs-meta/dtabakov/50-50-LF-c200-r0.2-2.5-by-0.1-f-0.1-1.0-by-0.1-initAcpt-crap/new-s-50-r-2.30-f-0.20--120-of-200 
>>>>>>> is not the correct size.  This is usually due to running a newer 
>>>>>>> mgr on an old PVFS file system or someone mucking with the files 
>>>>>>> in the metadata directory (which they should not do).  Aborting!
>>>>>>> [C 06/28 04:06] (md_stat.c,118) md_stat, meta_open
>>>>>>>        errno     : [22]
>>>>>>>        errno msg : [Invalid argument]
>>>>>>> [C 06/28 04:06] (mgr.c,2576) Received signal=[11]
>>>>>>> [C 06/28 04:06] (mgr.c,2578)
>>>>>>> OPEN FILES:
>>>>>>> [C 06/28 04:06] (mgr.c,2585) Current working directory: [/]
>>>>>>> [C 06/28 04:06] (mgr.c,2587) pid: [30259]
>>>>>>> [C 06/28 04:06] (mgr.c,2594) rlim_cur (RLIMIT_CORE): [0]
>>>>>>> [C 06/28 04:06] (mgr.c,2595) rlim_max (RLIMIT_CORE): [-1]
>>>>>>>
>>>>>>> After restarting the mgr , any operations on the pvfs mounted 
>>>>>>> filesystem will complain about corrupted/non-existant files :
>>>>>>>
>>>>>>> [root at io1 shared.scratch]# ls -la
>>>>>>> ls: epilogue_41172.management.log: Invalid argument
>>>>>>> ls: epilogue_41260.management.log: Invalid argument
>>>>>>> total 308521
>>>>>>> drwxrwxrwx    1 root     root        20480 Jun 28 10:13 .
>>>>>>> drwxr-xr-x   24 root     root         4096 May 16 14:05 ..
>>>>>>> drwxr-xr-x    1 juanp    scisim       4096 Jun 10 02:50 
>>>>>>> 40045.management
>>>>>>> drwxr-xr-x    1 juanp    scisim       4096 Jun 11 06:43 
>>>>>>> 40046.management
>>>>>>>
>>>>>>> Here's the log on the iods when the crash happened :
>>>>>>>
>>>>>>> [root at io1 tmp]# cat iolog.0Vr60K
>>>>>>>
>>>>>>> [I 06/27 16:18] ----- Log Level Changing -----
>>>>>>> [I 06/27 16:18] Current Logging Level includes :
>>>>>>> [I 06/27 16:18] New     Logging Level includes : CRITICAL  WARNING
>>>>>>> [W 06/27 16:18] (iod.c,289) socket=[5] hung up
>>>>>>> [W 06/27 16:18] (iod.c,289) socket=[5] hung up
>>>>>>> [W 06/27 16:31] (iod.c,697)  open: 064/f49958.0 exists (flags = 
>>>>>>> c2); saving
>>>>>>> [W 06/27 17:25] (iod.c,697)  open: 066/f49960.0 exists (flags = 
>>>>>>> c2); saving
>>>>>>> [W 06/27 17:50] (iod.c,697)  open: 067/f49961.0 exists (flags = 
>>>>>>> c2); saving
>>>>>>> [W 06/27 18:15] (iod.c,697)  open: 069/f49963.0 exists (flags = 
>>>>>>> c2); saving
>>>>>>> [W 06/27 18:36] (iod.c,697)  open: 070/f49964.0 exists (flags = 
>>>>>>> c2); saving
>>>>>>> [W 06/27 18:41] (iod.c,697)  open: 072/f49966.0 exists (flags = 
>>>>>>> c2); saving
>>>>>>> [W 06/27 18:56] (iod.c,697)  open: 073/f49967.0 exists (flags = 
>>>>>>> c2); saving
>>>>>>> [W 06/27 19:01] (iod.c,697)  open: 074/f49968.0 exists (flags = 
>>>>>>> c2); saving
>>>>>>> [W 06/27 19:05] (iod.c,697)  open: 076/f49970.0 exists (flags = 
>>>>>>> c2); saving
>>>>>>> [W 06/28 04:06] (iod.c,289) socket=[5] hung up
>>>>>>>
>>>>>>> There were no hardware failures and all clients,iods and mgr run 
>>>>>>> the same pvfs version (1.6.3) on ia64 based system.
>>>>>>>
>>>>>>> Any ideas?
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>>
>>>>>>
>>>>>> .
>>>>>>
>>>>>
>>>>>
>>>>
>>>> .
>>>>
>>>
>>>
>> _______________________________________________
>> PVFS-users mailing list
>> PVFS-users at www.beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs-users
>>
>> .
>>
> 
> 


More information about the PVFS-users mailing list