[PVFS-users] pvfs mgr crash

Franco M. Bladilo bladilo at rice.edu
Tue Jun 28 16:53:09 EDT 2005


Rob,
It's possible that the / partition may have temporarly filled up, we 
can't tell for sure since we weren't monitoring free space on this machine.
At this point, would you recommend re-creating the pvfs partition and 
start from 0?

Thanks
Franco.
 
Rob Ross wrote:

> Hi Franco,
>
> Ok, well, the file system running out of space might have done it.  Is 
> it possible that the filesystem used for metadata was out of space for 
> some period of time?
>
> Thanks,
>
> Rob
>
> Franco M. Bladilo wrote:
>
>> Rob,
>> All the files affected are empty , have zero size, example :
>>
>> -rwxr--r--    1 root     root            0 Jun 27 06:37 
>> epilogue_41172.management.log
>> -rwxr--r--    1 root     root            0 Jun 26 23:36 
>> epilogue_41260.management.log
>>
>> The filesystem is ext3 , it resides on a local scsi disk  and is 
>> mounted as /. There are no suspicious kernel messages and the hard 
>> drive looks heatlhy , it passes all smartctl
>> tests.
>>
>> Filesystem            Size  Used Avail Use% Mounted on
>> /dev/sda6             980M  516M  415M  56% /
>>
>> The only issue I recall weeks ago was that the pvfs fs became full 
>> and some pvfs clients crashed, I restarted pvfsd on the affected 
>> nodes and the
>> problem cleared.
>>
>> Thanks
>>
>> Franco.
>>
>>
>> Rob Ross wrote:
>>
>>> Hi Franco,
>>>
>>> First, thanks for the very thorough bug report.
>>>
>>> Typically those error messages mean exactly what they say: that 
>>> someone has been messing with the metadata files.  It could also 
>>> indicate something is amiss with that local file system.
>>>
>>> It would be helpful for you to "ls" a few of those files and see 
>>> what size they are.  The mgr uses a very simple check, and it is 
>>> unlikely that it is incorrectly reporting that the files are somehow 
>>> messed up.
>>>
>>> If there is any data at all in the files, it would be helpful to see 
>>> what it is.
>>>
>>> It might be a good idea to run a fsck on the file system holding 
>>> your metadata.  Something seems wrong there.  Is that just a single, 
>>> local disk?
>>>
>>> The iods are behaving as they should I think.  Let's concentrate on 
>>> that mgr for now.
>>>
>>> Regards,
>>>
>>> Rob
>>>
>>> Franco M. Bladilo wrote:
>>>
>>>> After working flawlessly for almost 3 months we are having problems 
>>>> with the pvfs mgr, this is  the output on mgr.log :
>>>>
>>>> [I 06/27 16:17] ----- Log Level Changing -----
>>>> [I 06/27 16:17] Current Logging Level includes :
>>>> [I 06/27 16:17] New     Logging Level includes : CRITICAL  WARNING
>>>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>>>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>>>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>>>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file 
>>>> /pvfs-meta/epilogue_41172.management.log is not the correct size.  
>>>> This is usually due to running a newer mgr on an old PVFS file 
>>>> system or someone mucking with the files in the metadata directory 
>>>> (which they should not do).  Aborting!
>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>        errno     : [22]
>>>>        errno msg : [Invalid argument]
>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file 
>>>> /pvfs-meta/epilogue_41260.management.log is not the correct size.  
>>>> This is usually due to running a newer mgr on an old PVFS file 
>>>> system or someone mucking with the files in the metadata directory 
>>>> (which they should not do).  Aborting!
>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>        errno     : [22]
>>>>        errno msg : [Invalid argument]
>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file 
>>>> /pvfs-meta/epilogue_41172.management.log is not the correct size.  
>>>> This is usually due to running a newer mgr on an old PVFS file 
>>>> system or someone mucking with the files in the metadata directory 
>>>> (which they should not do).  Aborting!
>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>        errno     : [22]
>>>>        errno msg : [Invalid argument]
>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file 
>>>> /pvfs-meta/epilogue_41260.management.log is not the correct size.  
>>>> This is usually due to running a newer mgr on an old PVFS file 
>>>> system or someone mucking with the files in the metadata directory 
>>>> (which they should not do).  Aborting!
>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>        errno     : [22]
>>>>        errno msg : [Invalid argument]
>>>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file 
>>>> /pvfs-meta/epilogue_41172.management.log is not the correct size.  
>>>> This is usually due to running a newer mgr on an old PVFS file 
>>>> system or someone mucking with the files in the metadata directory 
>>>> (which they should not do).  Aborting!
>>>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>>>>        errno     : [22]
>>>>        errno msg : [Invalid argument]
>>>> ...
>>>> It continues with hundreds of file entries until it reaches this 
>>>> point and crashes:
>>>>
>>>> [C 06/28 04:06] (metaio.c,88) meta_open: Metadata file 
>>>> /pvfs-meta/dtabakov/50-50-LF-c200-r0.2-2.5-by-0.1-f-0.1-1.0-by-0.1-initAcpt-crap/new-s-50-r-2.30-f-0.20--120-of-200 
>>>> is not the correct size.  This is usually due to running a newer 
>>>> mgr on an old PVFS file system or someone mucking with the files in 
>>>> the metadata directory (which they should not do).  Aborting!
>>>> [C 06/28 04:06] (md_stat.c,118) md_stat, meta_open
>>>>        errno     : [22]
>>>>        errno msg : [Invalid argument]
>>>> [C 06/28 04:06] (mgr.c,2576) Received signal=[11]
>>>> [C 06/28 04:06] (mgr.c,2578)
>>>> OPEN FILES:
>>>> [C 06/28 04:06] (mgr.c,2585) Current working directory: [/]
>>>> [C 06/28 04:06] (mgr.c,2587) pid: [30259]
>>>> [C 06/28 04:06] (mgr.c,2594) rlim_cur (RLIMIT_CORE): [0]
>>>> [C 06/28 04:06] (mgr.c,2595) rlim_max (RLIMIT_CORE): [-1]
>>>>
>>>> After restarting the mgr , any operations on the pvfs mounted 
>>>> filesystem will complain about corrupted/non-existant files :
>>>>
>>>> [root at io1 shared.scratch]# ls -la
>>>> ls: epilogue_41172.management.log: Invalid argument
>>>> ls: epilogue_41260.management.log: Invalid argument
>>>> total 308521
>>>> drwxrwxrwx    1 root     root        20480 Jun 28 10:13 .
>>>> drwxr-xr-x   24 root     root         4096 May 16 14:05 ..
>>>> drwxr-xr-x    1 juanp    scisim       4096 Jun 10 02:50 
>>>> 40045.management
>>>> drwxr-xr-x    1 juanp    scisim       4096 Jun 11 06:43 
>>>> 40046.management
>>>>
>>>> Here's the log on the iods when the crash happened :
>>>>
>>>> [root at io1 tmp]# cat iolog.0Vr60K
>>>>
>>>> [I 06/27 16:18] ----- Log Level Changing -----
>>>> [I 06/27 16:18] Current Logging Level includes :
>>>> [I 06/27 16:18] New     Logging Level includes : CRITICAL  WARNING
>>>> [W 06/27 16:18] (iod.c,289) socket=[5] hung up
>>>> [W 06/27 16:18] (iod.c,289) socket=[5] hung up
>>>> [W 06/27 16:31] (iod.c,697)  open: 064/f49958.0 exists (flags = 
>>>> c2); saving
>>>> [W 06/27 17:25] (iod.c,697)  open: 066/f49960.0 exists (flags = 
>>>> c2); saving
>>>> [W 06/27 17:50] (iod.c,697)  open: 067/f49961.0 exists (flags = 
>>>> c2); saving
>>>> [W 06/27 18:15] (iod.c,697)  open: 069/f49963.0 exists (flags = 
>>>> c2); saving
>>>> [W 06/27 18:36] (iod.c,697)  open: 070/f49964.0 exists (flags = 
>>>> c2); saving
>>>> [W 06/27 18:41] (iod.c,697)  open: 072/f49966.0 exists (flags = 
>>>> c2); saving
>>>> [W 06/27 18:56] (iod.c,697)  open: 073/f49967.0 exists (flags = 
>>>> c2); saving
>>>> [W 06/27 19:01] (iod.c,697)  open: 074/f49968.0 exists (flags = 
>>>> c2); saving
>>>> [W 06/27 19:05] (iod.c,697)  open: 076/f49970.0 exists (flags = 
>>>> c2); saving
>>>> [W 06/28 04:06] (iod.c,289) socket=[5] hung up
>>>>
>>>> There were no hardware failures and all clients,iods and mgr run 
>>>> the same pvfs version (1.6.3) on ia64 based system.
>>>>
>>>> Any ideas?
>>>>
>>>> Thanks in advance,
>>>>
>>>
>>> .
>>>
>>
>>
>
> .
>


-- 
Franco Bladilo
Linux/HPCC Administrator
Research Computing Group
Rice University
bladilo at rice.edu




More information about the PVFS-users mailing list