[Pvfs2-users] I/O server won't start

Eric J. Walter ejwalt at wm.edu
Wed Feb 17 16:08:05 EST 2010


Dear Kevin,

Thanks a lot.  With your suggestion and by commenting out exiting due to 
errors in a few places in the code, I was able to retrieve
essentially all of the data from my PVFS partition. 

Thanks again to you, Phil and the mailing list archives.

Regards,

Eric





Kevin Harms wrote:
> Eric,
>
>   I discussed it with Phil and it looks like the dataspace has a 
> handle that isn't part of the defined handle range in the config file. 
> Here are a couple of possible fixes.
>
>   In trove_check_handle_ranges() you could have it just continue after 
> printing the error. That should still give you a shot to recover your 
> data. Another method would be to tweak some code to try to delete the 
> handle from db itself.
>
> kevin
>
>
> On Feb 9, 2010, at 4:24 PM, Eric J. Walter wrote:
>
>>
>> Hi Kevin,
>>
>> Yes, it appears that the "repair" of the database allowed the server 
>> to start-up.  Here is what happens when I start it with
>> "EventLogging" set to "all" in the fs.conf file:
>>
>> [D 02/09 17:10] [DBPF THREAD]: STARTING TROVE SERVICE ROUTINE 
>> (DSPACE_ITERATE_HANDLES)
>> [D 02/09 17:10] [DBPF THREAD]: FINISHED TROVE SERVICE ROUTINE 
>> (DSPACE_ITERATE_HANDLES) (ret: 1)
>> [D 02/09 17:10] op_queue add: 0x61fdb0
>> [D 02/09 17:10] could not remove handle 3074457345618248967
>> [D 02/09 17:10] op_queue add: 0x61fdb0
>>
>> This repeats over and over again until I stop the server.
>>
>> Only the "broken" server log says this.  The other logs just have:
>>
>> [P 02/09 17:16] Start times (hr:min:sec):  17:16:49.553  
>> 17:16:48.533  17:16:47.513  17:16:46.493  17:16:45.473  17:16:44.452
>> [P 02/09 17:16] Intervals (hr:min:sec)  :  00:00:01.020  
>> 00:00:01.020  00:00:01.020  00:00:01.020  00:00:01.020  00:00:01.021
>> [P 02/09 17:17] bytes read              :             0             
>> 0             0             0      0             0
>> [P 02/09 17:17] bytes written           :             0             
>> 0             0             0      0             0
>> [P 02/09 17:17] metadata reads          :             0             
>> 0             0             0      0             0
>> [P 02/09 17:17] metadata writes         :             0             
>> 0             0             0      0             0
>> [P 02/09 17:17] metadata dspace ops     :             0             
>> 0             0             0      0             0
>> [P 02/09 17:17] metadata keyval ops     :             2             
>> 2             2             2      2             2
>> [P 02/09 17:17] request scheduler       :             0             
>> 0             0             0      0             0
>> [D 02/09 17:17] [SM Exiting]: (0x6a04c0) perf_update_sm:do_work 
>> (error code: 0), (action: DEFERRED)
>> [D 02/09 17:17] [SM Entering]: (0x6a1830) job_timer_sm:do_work 
>> (status: 0)
>>
>> Thanks again,
>>
>> Eric
>>
>>
>>
>> Kevin Harms wrote:
>>> Eric,
>>>
>>>  so i take it the "repaired" database allowed the pvfs2-server to 
>>> start? Based on this it looks like perhaps it suffered a fatal error 
>>> soon after since pvfs2-fsck command could not connect to it. What 
>>> does teh pvfs-2 server log say?
>>>
>>> kevin
>>>
>>> On Feb 9, 2010, at 2:28 PM, Eric J. Walter wrote:
>>>
>>>>
>>>> Kevin,
>>>>
>>>> Hi, I have done what you have said and repeated the db_dump and 
>>>> db_load.
>>>>
>>>> The db_verify of dataspace_attributes.db produces no errors and the 
>>>> pvfs2-server starts with no
>>>> errors.  Unfortunately, the clients can't seem to communicate with 
>>>> the servers after mounting:
>>>>
>>>> >>> /share/apps/pvfs-2.8.1/bin/pvfs2-fsck -v -m /mnt/pvfs2
>>>> [E 15:20:09.068943] job_time_mgr_expire: job time out: cancelling 
>>>> bmi operation, job_id: 12.
>>>> [E 15:20:09.069756] Warning: msgpair failed to ib://pvfs-2:3335, 
>>>> will retry: Connection timed out
>>>> [E 15:20:09.069808] *** msgpairarray_completion_fn: msgpair to 
>>>> server [UNKNOWN] failed: Connection timed out
>>>> [E 15:20:09.069829] *** Non-BMI 
>>>> failure.                                                               
>>>> [E 15:20:09.069859] ERROR: could not initialize any file systems in 
>>>> /etc/pvfs2tab.                     PVFS_util_init_defaults: No such 
>>>> device (error class: 0)
>>>> This same thing happens for any command (e.g. pvfs2-ls pvfs-statfs  
>>>> etc.)
>>>>
>>>> Perhaps there is something I am missing?
>>>>
>>>> Eric
>>>>
>>>>
>>>> Kevin Harms wrote:
>>>>> Eric,
>>>>>
>>>>> I'm not sure what is wrong with your .db exactly but to use 
>>>>> db_load, it needs to be modified to add the keys back in the 
>>>>> correct "sorted" order. Where "sorted" means in the order PVFS 
>>>>> expects. You need to modify db_load.c to something like this:
>>>>>
>>>>> if ((ret = dbp->set_bt_compare(dbp, 
>>>>> PINT_trove_dbpf_ds_attr_compare)) != 0) {
>>>>>       dbp->err(dbp, ret, "DB->set_bt_compare");
>>>>>       goto err;
>>>>> }
>>>>>
>>>>> Then paste the PINT_trove_dbpf_ds_attr_compare function and 
>>>>> associated data structure definitions into the db_load.c source as 
>>>>> well. You should get the db_load.c from your particular version of 
>>>>> bdb you're using.
>>>>>
>>>>> kevin
>>>>>
>>>>> On Feb 8, 2010, at 7:16 PM, Eric J. Walter wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a problem starting up an I/O node.  It is one of 3 servers 
>>>>>> that
>>>>>> we run v2.8.1 on
>>>>>> over Inifiniband.  It is not used for metadata.   After a finding 
>>>>>> a file
>>>>>> which
>>>>>> had '?--?--?' like permissions, I decided to restart the pvfs 
>>>>>> servers
>>>>>> and remount all
>>>>>> of the clients.  Now, one of the three I/O nodes can't start it's
>>>>>> pvfs2-server.
>>>>>> The other two start correctly.
>>>>>>
>>>>>> Here is the server log from the problem server:
>>>>>>
>>>>>> [D 02/08 19:40] PVFS2 Server version 2.8.1 starting.
>>>>>> [E 02/08 19:40] dbpf_dspace_iterate_handles_op_svc: Invalid argument
>>>>>> [E 02/08 19:40] Error adding handle range
>>>>>> 1537228672809129303-3074457345618258602,6148914691236517203-7686143364045646502 
>>>>>>
>>>>>> to filesystem pvfs2-fs
>>>>>> [E 02/08 19:40] Error: Could not initialize server interfaces; 
>>>>>> aborting.
>>>>>> [E 02/08 19:40] Error: Could not initialize server; aborting.
>>>>>>
>>>>>> I am also using db4-4.2.52-7.1 of the DB software.  Reading 
>>>>>> through the
>>>>>> previous
>>>>>> mailing lists discussions, I found that running db_recover on the 
>>>>>> .db
>>>>>> files (after backing them up) could be helpful.  The only .db 
>>>>>> file which
>>>>>> has any problems with verify is
>>>>>> dataspace_attributes.db on the problem I/O node.  Here is what it 
>>>>>> reports:
>>>>>>
>>>>>>>> # db_verify -o dataspace_attributes.db
>>>>>> db_verify: Page 865: item 57 of unrecognizable type
>>>>>> db_verify: Page 865: gap between items at offset 1376
>>>>>> db_verify: Page 865: item order check unsafe: skipping
>>>>>> db_verify: DB->verify: dataspace_attributes.db: DB_VERIFY_BAD: 
>>>>>> Database
>>>>>> verification failed
>>>>>>
>>>>>> So I tried db_recover -v in the same directory and in the directory
>>>>>> above (I am not sure where to run it) and all I get is:
>>>>>>
>>>>>> db_recover: Finding last valid log LSN: file: 1 offset 28
>>>>>>
>>>>>> and a small binary file named "log.0000000001".
>>>>>>
>>>>>> This step seems to do nothing, i.e. the db_verify report doesn't 
>>>>>> change
>>>>>> after this.
>>>>>>
>>>>>> I have also tried db_dump -r followed by db_load and this also 
>>>>>> does not
>>>>>> change the
>>>>>> db_verify output.
>>>>>>
>>>>>> Is there anything else I can do except wipe the filesystem and 
>>>>>> rebuild?
>>>>>>
>>>>>> Thanks for any help I can get.
>>>>>>
>>>>>> Eric J. Walter
>>>>>> Department of Physics
>>>>>> College of William and Mary
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pvfs2-users mailing list
>>>>>> Pvfs2-users at beowulf-underground.org
>>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>>>
>>>>
>>>
>>
>



More information about the Pvfs2-users mailing list