[Pvfs2-developers] server crash on startup with millions of files

Sam Lang slang at mcs.anl.gov
Thu Mar 1 11:04:44 EST 2007


On Mar 1, 2007, at 10:00 AM, Sam Lang wrote:

>
> On Mar 1, 2007, at 9:52 AM, Phil Carns wrote:
>
>> Sam Lang wrote:
>>> On Feb 28, 2007, at 6:54 AM, Phil Carns wrote:
>>>> I know that you guys still have some ongoing discussion about  
>>>> the long
>>>> range design for tracking handles, but I have another item about  
>>>> the
>>>> current implementation that might be of interest.
>>>>
>>>> Most of the remaining startup performance problem (after Sam's
>>>> optimization patches) appears to be a result of how the db is  
>>>> ordered.
>>>> If I modify the attr db's comparison function so that it has a "<"
>>>> rather than ">", then all of the preads during startup go in order
>>>> through the db rather than backwards.  This takes the startup  
>>>> time  on a
>>>> cold db down to just 34 seconds.  Previously it was 2 minutes  
>>>> 22  seconds.
>>>>
>>>> It still could be faster, but that seems to be the biggest part  
>>>> of the
>>>> time. I imagine the rest of it is just the access size (4 KB at  
>>>> a  time) that might be tunable through some berkeley db settings.
>>>>
>>>> The downside of making that particular change to the comparison   
>>>> method is that it breaks storage space compatibility.
>>>>
>>>> I wonder if it might be possible to accomplish the same thing in  
>>>> the
>>>> current db format by modifying iterate_handles() to just run  
>>>> the  cursor
>>>> backwards (using DB_PREV instead of DB_NEXT)?  That wouldn't hurt
>>>> storage space compability (if it works), but I don't know if it   
>>>> makes any difference to callers of that function what order the   
>>>> handles come out in.
>>> It doesn't matter to the caller.  You'll also need to set the  
>>> cursor  to the last position in the db with DB_LAST.  Does  
>>> DB_PREV work with  DB_MULTIPLE though?  Its not clear from the  
>>> above, does the  improvement to 34 seconds occur with MULTIPLE or  
>>> without?
>>> I mentioned previously that the dspace db gets opened with the  
>>> RECNUM  flag.  I don't think that's necessary, and removing it  
>>> will  invariably improve performance, but we need a way to return  
>>> the  position for iterate_handles.  The easiest thing to do is  
>>> turn  PVFS_ds_position into a uint64_t (currently its only  
>>> uint32_t).  That  breaks interfaces and protocols though.
>>
>> I don't know if the PREV approach would work with MULTIPLE or  
>> not.  The 34 second times (with inverted comparison function) were  
>> run with your MULTIPLE patches applied.  I didn't try it without  
>> the patches.
>
> I couldn't find anything in the berkeley db about DB_MULTIPLE_KEY  
> and DB_PREV not being allowed, but when tried it returns an error  
> about Illegal flag combinations.

I guess the doc does say its not allowed:

------
The DB_MULTIPLE_KEY flag may only be used with the DB_CURRENT,  
DB_FIRST, DB_GET_BOTH, DB_GET_BOTH_RANGE, DB_NEXT, DB_NEXT_DUP,  
DB_NEXT_NODUP, DB_SET, DB_SET_RANGE, and DB_SET_RECNO options.
------

It seems strange that they have that restriction.  Most of the btree  
implementation seems symmetric except for that.

-sam

>   So our option is to either use DB_PREV without DB_MULTIPLE (no  
> storage format changes), or change the comparison function and  
> storage format so that we can use DB_NEXT with DB_MULTIPLE_KEY.
>
> Checking the storage format version and providing the appropriate  
> comparison function wouldn't be hard though, and wouldn't require  
> any "migration" of the old to new format.  Older formats wouldn't  
> benefit from the performance improvements though.
>
> -sam
>
>>
>> -Phil
>>
>



More information about the Pvfs2-developers mailing list