[PVFS-users] pvfs mgr crash
Franco M. Bladilo
bladilo at rice.edu
Tue Jun 28 15:16:11 EDT 2005
Rob,
All the files affected are empty , have zero size, example :
-rwxr--r-- 1 root root 0 Jun 27 06:37
epilogue_41172.management.log
-rwxr--r-- 1 root root 0 Jun 26 23:36
epilogue_41260.management.log
The filesystem is ext3 , it resides on a local scsi disk and is mounted
as /. There are no suspicious kernel messages and the hard drive looks
heatlhy , it passes all smartctl
tests.
Filesystem Size Used Avail Use% Mounted on
/dev/sda6 980M 516M 415M 56% /
The only issue I recall weeks ago was that the pvfs fs became full and
some pvfs clients crashed, I restarted pvfsd on the affected nodes and the
problem cleared.
Thanks
Franco.
Rob Ross wrote:
> Hi Franco,
>
> First, thanks for the very thorough bug report.
>
> Typically those error messages mean exactly what they say: that
> someone has been messing with the metadata files. It could also
> indicate something is amiss with that local file system.
>
> It would be helpful for you to "ls" a few of those files and see what
> size they are. The mgr uses a very simple check, and it is unlikely
> that it is incorrectly reporting that the files are somehow messed up.
>
> If there is any data at all in the files, it would be helpful to see
> what it is.
>
> It might be a good idea to run a fsck on the file system holding your
> metadata. Something seems wrong there. Is that just a single, local
> disk?
>
> The iods are behaving as they should I think. Let's concentrate on
> that mgr for now.
>
> Regards,
>
> Rob
>
> Franco M. Bladilo wrote:
>
>> After working flawlessly for almost 3 months we are having problems
>> with the pvfs mgr, this is the output on mgr.log :
>>
>> [I 06/27 16:17] ----- Log Level Changing -----
>> [I 06/27 16:17] Current Logging Level includes :
>> [I 06/27 16:17] New Logging Level includes : CRITICAL WARNING
>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>> [C 06/27 16:18] (mgr.c,376) socket=[5] closed
>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file
>> /pvfs-meta/epilogue_41172.management.log is not the correct size.
>> This is usually due to running a newer mgr on an old PVFS file system
>> or someone mucking with the files in the metadata directory (which
>> they should not do). Aborting!
>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>> errno : [22]
>> errno msg : [Invalid argument]
>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file
>> /pvfs-meta/epilogue_41260.management.log is not the correct size.
>> This is usually due to running a newer mgr on an old PVFS file system
>> or someone mucking with the files in the metadata directory (which
>> they should not do). Aborting!
>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>> errno : [22]
>> errno msg : [Invalid argument]
>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file
>> /pvfs-meta/epilogue_41172.management.log is not the correct size.
>> This is usually due to running a newer mgr on an old PVFS file system
>> or someone mucking with the files in the metadata directory (which
>> they should not do). Aborting!
>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>> errno : [22]
>> errno msg : [Invalid argument]
>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file
>> /pvfs-meta/epilogue_41260.management.log is not the correct size.
>> This is usually due to running a newer mgr on an old PVFS file system
>> or someone mucking with the files in the metadata directory (which
>> they should not do). Aborting!
>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>> errno : [22]
>> errno msg : [Invalid argument]
>> [C 06/27 16:18] (metaio.c,88) meta_open: Metadata file
>> /pvfs-meta/epilogue_41172.management.log is not the correct size.
>> This is usually due to running a newer mgr on an old PVFS file system
>> or someone mucking with the files in the metadata directory (which
>> they should not do). Aborting!
>> [C 06/27 16:18] (md_stat.c,118) md_stat, meta_open
>> errno : [22]
>> errno msg : [Invalid argument]
>> ...
>> It continues with hundreds of file entries until it reaches this
>> point and crashes:
>>
>> [C 06/28 04:06] (metaio.c,88) meta_open: Metadata file
>> /pvfs-meta/dtabakov/50-50-LF-c200-r0.2-2.5-by-0.1-f-0.1-1.0-by-0.1-initAcpt-crap/new-s-50-r-2.30-f-0.20--120-of-200
>> is not the correct size. This is usually due to running a newer mgr
>> on an old PVFS file system or someone mucking with the files in the
>> metadata directory (which they should not do). Aborting!
>> [C 06/28 04:06] (md_stat.c,118) md_stat, meta_open
>> errno : [22]
>> errno msg : [Invalid argument]
>> [C 06/28 04:06] (mgr.c,2576) Received signal=[11]
>> [C 06/28 04:06] (mgr.c,2578)
>> OPEN FILES:
>> [C 06/28 04:06] (mgr.c,2585) Current working directory: [/]
>> [C 06/28 04:06] (mgr.c,2587) pid: [30259]
>> [C 06/28 04:06] (mgr.c,2594) rlim_cur (RLIMIT_CORE): [0]
>> [C 06/28 04:06] (mgr.c,2595) rlim_max (RLIMIT_CORE): [-1]
>>
>> After restarting the mgr , any operations on the pvfs mounted
>> filesystem will complain about corrupted/non-existant files :
>>
>> [root at io1 shared.scratch]# ls -la
>> ls: epilogue_41172.management.log: Invalid argument
>> ls: epilogue_41260.management.log: Invalid argument
>> total 308521
>> drwxrwxrwx 1 root root 20480 Jun 28 10:13 .
>> drwxr-xr-x 24 root root 4096 May 16 14:05 ..
>> drwxr-xr-x 1 juanp scisim 4096 Jun 10 02:50 40045.management
>> drwxr-xr-x 1 juanp scisim 4096 Jun 11 06:43 40046.management
>>
>> Here's the log on the iods when the crash happened :
>>
>> [root at io1 tmp]# cat iolog.0Vr60K
>>
>> [I 06/27 16:18] ----- Log Level Changing -----
>> [I 06/27 16:18] Current Logging Level includes :
>> [I 06/27 16:18] New Logging Level includes : CRITICAL WARNING
>> [W 06/27 16:18] (iod.c,289) socket=[5] hung up
>> [W 06/27 16:18] (iod.c,289) socket=[5] hung up
>> [W 06/27 16:31] (iod.c,697) open: 064/f49958.0 exists (flags = c2);
>> saving
>> [W 06/27 17:25] (iod.c,697) open: 066/f49960.0 exists (flags = c2);
>> saving
>> [W 06/27 17:50] (iod.c,697) open: 067/f49961.0 exists (flags = c2);
>> saving
>> [W 06/27 18:15] (iod.c,697) open: 069/f49963.0 exists (flags = c2);
>> saving
>> [W 06/27 18:36] (iod.c,697) open: 070/f49964.0 exists (flags = c2);
>> saving
>> [W 06/27 18:41] (iod.c,697) open: 072/f49966.0 exists (flags = c2);
>> saving
>> [W 06/27 18:56] (iod.c,697) open: 073/f49967.0 exists (flags = c2);
>> saving
>> [W 06/27 19:01] (iod.c,697) open: 074/f49968.0 exists (flags = c2);
>> saving
>> [W 06/27 19:05] (iod.c,697) open: 076/f49970.0 exists (flags = c2);
>> saving
>> [W 06/28 04:06] (iod.c,289) socket=[5] hung up
>>
>> There were no hardware failures and all clients,iods and mgr run the
>> same pvfs version (1.6.3) on ia64 based system.
>>
>> Any ideas?
>>
>> Thanks in advance,
>>
>
> .
>
--
Franco Bladilo
Linux/HPCC Administrator
Research Computing Group
Rice University
bladilo at rice.edu
More information about the PVFS-users
mailing list