[Pvfs2-developers] server crash on startup with millions of files
Sam Lang
slang at mcs.anl.gov
Thu Mar 1 11:04:44 EST 2007
On Mar 1, 2007, at 10:00 AM, Sam Lang wrote:
>
> On Mar 1, 2007, at 9:52 AM, Phil Carns wrote:
>
>> Sam Lang wrote:
>>> On Feb 28, 2007, at 6:54 AM, Phil Carns wrote:
>>>> I know that you guys still have some ongoing discussion about
>>>> the long
>>>> range design for tracking handles, but I have another item about
>>>> the
>>>> current implementation that might be of interest.
>>>>
>>>> Most of the remaining startup performance problem (after Sam's
>>>> optimization patches) appears to be a result of how the db is
>>>> ordered.
>>>> If I modify the attr db's comparison function so that it has a "<"
>>>> rather than ">", then all of the preads during startup go in order
>>>> through the db rather than backwards. This takes the startup
>>>> time on a
>>>> cold db down to just 34 seconds. Previously it was 2 minutes
>>>> 22 seconds.
>>>>
>>>> It still could be faster, but that seems to be the biggest part
>>>> of the
>>>> time. I imagine the rest of it is just the access size (4 KB at
>>>> a time) that might be tunable through some berkeley db settings.
>>>>
>>>> The downside of making that particular change to the comparison
>>>> method is that it breaks storage space compatibility.
>>>>
>>>> I wonder if it might be possible to accomplish the same thing in
>>>> the
>>>> current db format by modifying iterate_handles() to just run
>>>> the cursor
>>>> backwards (using DB_PREV instead of DB_NEXT)? That wouldn't hurt
>>>> storage space compability (if it works), but I don't know if it
>>>> makes any difference to callers of that function what order the
>>>> handles come out in.
>>> It doesn't matter to the caller. You'll also need to set the
>>> cursor to the last position in the db with DB_LAST. Does
>>> DB_PREV work with DB_MULTIPLE though? Its not clear from the
>>> above, does the improvement to 34 seconds occur with MULTIPLE or
>>> without?
>>> I mentioned previously that the dspace db gets opened with the
>>> RECNUM flag. I don't think that's necessary, and removing it
>>> will invariably improve performance, but we need a way to return
>>> the position for iterate_handles. The easiest thing to do is
>>> turn PVFS_ds_position into a uint64_t (currently its only
>>> uint32_t). That breaks interfaces and protocols though.
>>
>> I don't know if the PREV approach would work with MULTIPLE or
>> not. The 34 second times (with inverted comparison function) were
>> run with your MULTIPLE patches applied. I didn't try it without
>> the patches.
>
> I couldn't find anything in the berkeley db about DB_MULTIPLE_KEY
> and DB_PREV not being allowed, but when tried it returns an error
> about Illegal flag combinations.
I guess the doc does say its not allowed:
------
The DB_MULTIPLE_KEY flag may only be used with the DB_CURRENT,
DB_FIRST, DB_GET_BOTH, DB_GET_BOTH_RANGE, DB_NEXT, DB_NEXT_DUP,
DB_NEXT_NODUP, DB_SET, DB_SET_RANGE, and DB_SET_RECNO options.
------
It seems strange that they have that restriction. Most of the btree
implementation seems symmetric except for that.
-sam
> So our option is to either use DB_PREV without DB_MULTIPLE (no
> storage format changes), or change the comparison function and
> storage format so that we can use DB_NEXT with DB_MULTIPLE_KEY.
>
> Checking the storage format version and providing the appropriate
> comparison function wouldn't be hard though, and wouldn't require
> any "migration" of the old to new format. Older formats wouldn't
> benefit from the performance improvements though.
>
> -sam
>
>>
>> -Phil
>>
>
More information about the Pvfs2-developers
mailing list