[Pvfs2-users] I/O server won't start

Kevin Harms harms at alcf.anl.gov
Wed Feb 10 17:16:29 EST 2010


Eric,

   I discussed it with Phil and it looks like the dataspace has a  
handle that isn't part of the defined handle range in the config file.  
Here are a couple of possible fixes.

   In trove_check_handle_ranges() you could have it just continue  
after printing the error. That should still give you a shot to recover  
your data. Another method would be to tweak some code to try to delete  
the handle from db itself.

kevin


On Feb 9, 2010, at 4:24 PM, Eric J. Walter wrote:

>
> Hi Kevin,
>
> Yes, it appears that the "repair" of the database allowed the server  
> to start-up.  Here is what happens when I start it with
> "EventLogging" set to "all" in the fs.conf file:
>
> [D 02/09 17:10] [DBPF THREAD]: STARTING TROVE SERVICE ROUTINE  
> (DSPACE_ITERATE_HANDLES)
> [D 02/09 17:10] [DBPF THREAD]: FINISHED TROVE SERVICE ROUTINE  
> (DSPACE_ITERATE_HANDLES) (ret: 1)
> [D 02/09 17:10] op_queue add: 0x61fdb0
> [D 02/09 17:10] could not remove handle 3074457345618248967
> [D 02/09 17:10] op_queue add: 0x61fdb0
>
> This repeats over and over again until I stop the server.
>
> Only the "broken" server log says this.  The other logs just have:
>
> [P 02/09 17:16] Start times (hr:min:sec):  17:16:49.553   
> 17:16:48.533  17:16:47.513  17:16:46.493  17:16:45.473  17:16:44.452
> [P 02/09 17:16] Intervals (hr:min:sec)  :  00:00:01.020   
> 00:00:01.020  00:00:01.020  00:00:01.020  00:00:01.020  00:00:01.021
> [P 02/09 17:17] bytes read              :             0              
> 0             0             0      0             0
> [P 02/09 17:17] bytes written           :             0              
> 0             0             0      0             0
> [P 02/09 17:17] metadata reads          :             0              
> 0             0             0      0             0
> [P 02/09 17:17] metadata writes         :             0              
> 0             0             0      0             0
> [P 02/09 17:17] metadata dspace ops     :             0              
> 0             0             0      0             0
> [P 02/09 17:17] metadata keyval ops     :             2              
> 2             2             2      2             2
> [P 02/09 17:17] request scheduler       :             0              
> 0             0             0      0             0
> [D 02/09 17:17] [SM Exiting]: (0x6a04c0) perf_update_sm:do_work  
> (error code: 0), (action: DEFERRED)
> [D 02/09 17:17] [SM Entering]: (0x6a1830) job_timer_sm:do_work  
> (status: 0)
>
> Thanks again,
>
> Eric
>
>
>
> Kevin Harms wrote:
>> Eric,
>>
>>  so i take it the "repaired" database allowed the pvfs2-server to  
>> start? Based on this it looks like perhaps it suffered a fatal  
>> error soon after since pvfs2-fsck command could not connect to it.  
>> What does teh pvfs-2 server log say?
>>
>> kevin
>>
>> On Feb 9, 2010, at 2:28 PM, Eric J. Walter wrote:
>>
>>>
>>> Kevin,
>>>
>>> Hi, I have done what you have said and repeated the db_dump and  
>>> db_load.
>>>
>>> The db_verify of dataspace_attributes.db produces no errors and  
>>> the pvfs2-server starts with no
>>> errors.  Unfortunately, the clients can't seem to communicate with  
>>> the servers after mounting:
>>>
>>> >>> /share/apps/pvfs-2.8.1/bin/pvfs2-fsck -v -m /mnt/pvfs2
>>> [E 15:20:09.068943] job_time_mgr_expire: job time out: cancelling  
>>> bmi operation, job_id: 12.
>>> [E 15:20:09.069756] Warning: msgpair failed to ib://pvfs-2:3335,  
>>> will retry: Connection timed out
>>> [E 15:20:09.069808] *** msgpairarray_completion_fn: msgpair to  
>>> server [UNKNOWN] failed: Connection timed out
>>> [E 15:20:09.069829] *** Non-BMI  
>>> failure 
>>> .                                                               [E  
>>> 15:20:09.069859] ERROR: could not initialize any file systems in / 
>>> etc/pvfs2tab.                     PVFS_util_init_defaults: No such  
>>> device (error class: 0)
>>> This same thing happens for any command (e.g. pvfs2-ls pvfs- 
>>> statfs  etc.)
>>>
>>> Perhaps there is something I am missing?
>>>
>>> Eric
>>>
>>>
>>> Kevin Harms wrote:
>>>> Eric,
>>>>
>>>> I'm not sure what is wrong with your .db exactly but to use  
>>>> db_load, it needs to be modified to add the keys back in the  
>>>> correct "sorted" order. Where "sorted" means in the order PVFS  
>>>> expects. You need to modify db_load.c to something like this:
>>>>
>>>> if ((ret = dbp->set_bt_compare(dbp,  
>>>> PINT_trove_dbpf_ds_attr_compare)) != 0) {
>>>>       dbp->err(dbp, ret, "DB->set_bt_compare");
>>>>       goto err;
>>>> }
>>>>
>>>> Then paste the PINT_trove_dbpf_ds_attr_compare function and  
>>>> associated data structure definitions into the db_load.c source  
>>>> as well. You should get the db_load.c from your particular  
>>>> version of bdb you're using.
>>>>
>>>> kevin
>>>>
>>>> On Feb 8, 2010, at 7:16 PM, Eric J. Walter wrote:
>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have a problem starting up an I/O node.  It is one of 3  
>>>>> servers that
>>>>> we run v2.8.1 on
>>>>> over Inifiniband.  It is not used for metadata.   After a  
>>>>> finding a file
>>>>> which
>>>>> had '?--?--?' like permissions, I decided to restart the pvfs  
>>>>> servers
>>>>> and remount all
>>>>> of the clients.  Now, one of the three I/O nodes can't start it's
>>>>> pvfs2-server.
>>>>> The other two start correctly.
>>>>>
>>>>> Here is the server log from the problem server:
>>>>>
>>>>> [D 02/08 19:40] PVFS2 Server version 2.8.1 starting.
>>>>> [E 02/08 19:40] dbpf_dspace_iterate_handles_op_svc: Invalid  
>>>>> argument
>>>>> [E 02/08 19:40] Error adding handle range
>>>>> 1537228672809129303 
>>>>> -3074457345618258602,6148914691236517203-7686143364045646502
>>>>> to filesystem pvfs2-fs
>>>>> [E 02/08 19:40] Error: Could not initialize server interfaces;  
>>>>> aborting.
>>>>> [E 02/08 19:40] Error: Could not initialize server; aborting.
>>>>>
>>>>> I am also using db4-4.2.52-7.1 of the DB software.  Reading  
>>>>> through the
>>>>> previous
>>>>> mailing lists discussions, I found that running db_recover on  
>>>>> the .db
>>>>> files (after backing them up) could be helpful.  The only .db  
>>>>> file which
>>>>> has any problems with verify is
>>>>> dataspace_attributes.db on the problem I/O node.  Here is what  
>>>>> it reports:
>>>>>
>>>>>>> # db_verify -o dataspace_attributes.db
>>>>> db_verify: Page 865: item 57 of unrecognizable type
>>>>> db_verify: Page 865: gap between items at offset 1376
>>>>> db_verify: Page 865: item order check unsafe: skipping
>>>>> db_verify: DB->verify: dataspace_attributes.db: DB_VERIFY_BAD:  
>>>>> Database
>>>>> verification failed
>>>>>
>>>>> So I tried db_recover -v in the same directory and in the  
>>>>> directory
>>>>> above (I am not sure where to run it) and all I get is:
>>>>>
>>>>> db_recover: Finding last valid log LSN: file: 1 offset 28
>>>>>
>>>>> and a small binary file named "log.0000000001".
>>>>>
>>>>> This step seems to do nothing, i.e. the db_verify report doesn't  
>>>>> change
>>>>> after this.
>>>>>
>>>>> I have also tried db_dump -r followed by db_load and this also  
>>>>> does not
>>>>> change the
>>>>> db_verify output.
>>>>>
>>>>> Is there anything else I can do except wipe the filesystem and  
>>>>> rebuild?
>>>>>
>>>>> Thanks for any help I can get.
>>>>>
>>>>> Eric J. Walter
>>>>> Department of Physics
>>>>> College of William and Mary
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pvfs2-users mailing list
>>>>> Pvfs2-users at beowulf-underground.org
>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>>
>>>
>>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2909 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-users/attachments/20100210/7e20257b/smime.bin


More information about the Pvfs2-users mailing list