[Pvfs2-users] I/O server won't start

Kevin Harms harms at alcf.anl.gov
Tue Feb 9 17:02:06 EST 2010


Eric,

   so i take it the "repaired" database allowed the pvfs2-server to  
start? Based on this it looks like perhaps it suffered a fatal error  
soon after since pvfs2-fsck command could not connect to it. What does  
teh pvfs-2 server log say?

kevin

On Feb 9, 2010, at 2:28 PM, Eric J. Walter wrote:

>
> Kevin,
>
> Hi, I have done what you have said and repeated the db_dump and  
> db_load.
>
> The db_verify of dataspace_attributes.db produces no errors and the  
> pvfs2-server starts with no
> errors.  Unfortunately, the clients can't seem to communicate with  
> the servers after mounting:
>
> >>> /share/apps/pvfs-2.8.1/bin/pvfs2-fsck -v -m /mnt/pvfs2
> [E 15:20:09.068943] job_time_mgr_expire: job time out: cancelling  
> bmi operation, job_id: 12.
> [E 15:20:09.069756] Warning: msgpair failed to ib://pvfs-2:3335,  
> will retry: Connection timed out
> [E 15:20:09.069808] *** msgpairarray_completion_fn: msgpair to  
> server [UNKNOWN] failed: Connection timed out
> [E 15:20:09.069829] *** Non-BMI  
> failure 
> .                                                               [E  
> 15:20:09.069859] ERROR: could not initialize any file systems in / 
> etc/pvfs2tab.                     PVFS_util_init_defaults: No such  
> device (error class: 0)
> This same thing happens for any command (e.g. pvfs2-ls pvfs-statfs   
> etc.)
>
> Perhaps there is something I am missing?
>
> Eric
>
>
> Kevin Harms wrote:
>> Eric,
>>
>>  I'm not sure what is wrong with your .db exactly but to use  
>> db_load, it needs to be modified to add the keys back in the  
>> correct "sorted" order. Where "sorted" means in the order PVFS  
>> expects. You need to modify db_load.c to something like this:
>>
>> if ((ret = dbp->set_bt_compare(dbp,  
>> PINT_trove_dbpf_ds_attr_compare)) != 0) {
>>        dbp->err(dbp, ret, "DB->set_bt_compare");
>>        goto err;
>> }
>>
>> Then paste the PINT_trove_dbpf_ds_attr_compare function and  
>> associated data structure definitions into the db_load.c source as  
>> well. You should get the db_load.c from your particular version of  
>> bdb you're using.
>>
>> kevin
>>
>> On Feb 8, 2010, at 7:16 PM, Eric J. Walter wrote:
>>
>>>
>>>
>>> Hi,
>>>
>>> I have a problem starting up an I/O node.  It is one of 3 servers  
>>> that
>>> we run v2.8.1 on
>>> over Inifiniband.  It is not used for metadata.   After a finding  
>>> a file
>>> which
>>> had '?--?--?' like permissions, I decided to restart the pvfs  
>>> servers
>>> and remount all
>>> of the clients.  Now, one of the three I/O nodes can't start it's
>>> pvfs2-server.
>>> The other two start correctly.
>>>
>>> Here is the server log from the problem server:
>>>
>>> [D 02/08 19:40] PVFS2 Server version 2.8.1 starting.
>>> [E 02/08 19:40] dbpf_dspace_iterate_handles_op_svc: Invalid argument
>>> [E 02/08 19:40] Error adding handle range
>>> 1537228672809129303 
>>> -3074457345618258602,6148914691236517203-7686143364045646502
>>> to filesystem pvfs2-fs
>>> [E 02/08 19:40] Error: Could not initialize server interfaces;  
>>> aborting.
>>> [E 02/08 19:40] Error: Could not initialize server; aborting.
>>>
>>> I am also using db4-4.2.52-7.1 of the DB software.  Reading  
>>> through the
>>> previous
>>> mailing lists discussions, I found that running db_recover on  
>>> the .db
>>> files (after backing them up) could be helpful.  The only .db file  
>>> which
>>> has any problems with verify is
>>> dataspace_attributes.db on the problem I/O node.  Here is what it  
>>> reports:
>>>
>>>>> # db_verify -o dataspace_attributes.db
>>> db_verify: Page 865: item 57 of unrecognizable type
>>> db_verify: Page 865: gap between items at offset 1376
>>> db_verify: Page 865: item order check unsafe: skipping
>>> db_verify: DB->verify: dataspace_attributes.db: DB_VERIFY_BAD:  
>>> Database
>>> verification failed
>>>
>>> So I tried db_recover -v in the same directory and in the directory
>>> above (I am not sure where to run it) and all I get is:
>>>
>>> db_recover: Finding last valid log LSN: file: 1 offset 28
>>>
>>> and a small binary file named "log.0000000001".
>>>
>>> This step seems to do nothing, i.e. the db_verify report doesn't  
>>> change
>>> after this.
>>>
>>> I have also tried db_dump -r followed by db_load and this also  
>>> does not
>>> change the
>>> db_verify output.
>>>
>>> Is there anything else I can do except wipe the filesystem and  
>>> rebuild?
>>>
>>> Thanks for any help I can get.
>>>
>>> Eric J. Walter
>>> Department of Physics
>>> College of William and Mary
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Pvfs2-users mailing list
>>> Pvfs2-users at beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2909 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-users/attachments/20100209/c3108dca/smime.bin


More information about the Pvfs2-users mailing list