[Pvfs2-developers] handle ledger

Pete Wyckoff pw at osc.edu
Tue Jan 29 14:42:17 EST 2008


slang at mcs.anl.gov wrote on Tue, 29 Jan 2008 13:32 -0600:
> On Jan 28, 2008, at 6:43 PM, Pete Wyckoff wrote:
>> slang at mcs.anl.gov wrote on Mon, 28 Jan 2008 16:38 -0600:
>>> Attached patch disables the handle ledger.  For those not familiar, the
>>> handle ledger is an in-memory structure that maintains allocated handles
>>> for a given server.  I'm disabling it because reading the entire database
>>> each time the server loads is extremely expensive for large filesystems.
>>> Instead of choosing a handle from the ledger, the patch picks one 
>>> randomly.
>>> This means we have to deal with collisions now, but because of our large
>>> handle space, they only occur every 100 billion times or so.
>>>
>>> I didn't blow away the handle allocation code entirely...I just disabled
>>> the calls that we had been using to invoke the handle ledger, and added
>>> some functionality that picks a random handle from a given range.  In the
>>> dspace code, I modified the create function to continue up to 32 times if 
>>> a
>>> collision with an already existing handle occurs.
>>
>> Great change.  Never liked that myself either.  Some comments.
>>
>>> diff -u -a -p -r1.152 dbpf-dspace.c
>>> --- src/io/trove/trove-dbpf/dbpf-dspace.c	8 Nov 2007 21:48:22 -0000	1.152
>>> +++ src/io/trove/trove-dbpf/dbpf-dspace.c	28 Jan 2008 21:55:49 -0000
>> [..]
>>> +    } while(ret != DB_NOTFOUND && ++attempts > 
>>> MAX_HANDLE_ALLOC_ATTEMPTS);
>>
>> Uh, maybe <.
>
> Are you arguing for increasing the max number of attempts, or just retrying 
> forever?

Maybe I misunderstand the termination condition for the loop.  You
want it to keep trying until attempts gets up to a certain value.
Just the < is backwards.  If I'm spacing and you're sure this is
right, ignore me.

>>> +    rfd = open("/dev/urandom", O_RDONLY, 0);
>>> +    if(rfd < 0)
>>> +    {
>>> +        return -PVFS_EINVAL;
>>> +    }
>>
>> Painted ourselves into a linux-specific corner here.  Maybe have the
>> usual time() etc. srand option here too if open fails.
>>
>>> +    random_r(&trove_handle_random_data, &r1);
>>> +    i = r1 % extent_array->extent_count;
>>
>> May want a feature test for this.  Not sure if POSIX has gotten
>> itself into all the OSes on which people may run servers.
>
> Right, I was concerned with making sure I got a good seed here.  It needs 
> to generate both a very large random sequence from the seed, as well as not 
> pick the same seed over and over on server startup.  Using initstate_r with 
> an array size of 256 makes the values returned by random_r much more 
> random, and passing the current time ensures that the seed will be 
> different on each server startup.
>
> If we use the more primitive forms of getting a random number, its just 
> more likely to get repeated values for handles. Is that acceptable?  Does 
> it become the user's problem his random handle values aren't so random?

Yeah.  It will just run through the same set of allocated handles,
taking a long time that first time for people with lousy RNGs.  Then
it will fall into an unallocated space and continue normally.  As
long as there is a configure test for random_r, we can fall back to
lrand48() and friends or even ancient srand/rand.  /dev/urandom
test must be at runtime, with graceful fallback to a seed made up of
hostname[0:255] | time() << 29 | coll_id << 63 | ... any other
random stuff you can get your hands on in that routine easily.

I thought about proposing just doing linear allocation.  Find the
highest handle, allocate +1 on that.  That's what we do with the
OSDs, using a 1-element cache to remember the last handle allocated.
This works nicely until you first fill up your handle space and have
to wrap, then can go bad if you hit a run of undeleted old handles.

I've no idea what the cost is to run the RNG.  Presumably it is very
fast.  In which case just doing it all the time like you have it is
perfect.

		-- Pete


More information about the Pvfs2-developers mailing list