[Pvfs2-developers] server crash on startup with millions of files

Phil Carns pcarns at wastedcycles.org
Fri Feb 23 10:16:36 EST 2007


I get an error about DB_BUFFER_SMALL being undefined in this patch. 
Should it have the same #ifdefs wrapped around it as are currently in 
dbpf-keyval.c?  It is using ENOMEM if DB_BUFFER_SMALL isnt' defined.

-Phil



Phil Carns wrote:
> Thanks Sam!  We will give these patches a try and report back.
> 
> -Phil
> 
> Sam Lang wrote:
> 
>>
>> Hi Phil,
>>
>> Attached mult.patch implements iterating over the dspace db using  
>> DB_MULTIPLE_KEY.  This may allow for the db get call to do larger  
>> reads from your SAN.  I was seeing slightly better performance with  
>> local disk after creating 20K files in a fresh storage space.  Doing  
>> strace doesn't show fewer mmaps or larger reads though, so I'm not  
>> sure how berkeley db pulls in its pages.  Anyway, if it helps improve  
>> performance for you guys, I can clean it up a bit and commit it.  I  
>> don't think anything uses dspace_iterate_handles besides that ledger  
>> handle management code.
>>
>> You can fiddle the MAX_NUM_VERIFY_HANDLE_COUNT value to set how many  
>> handles to get at a time.  Right now its set to 4096.  Keep in mind  
>> that this requires a much larger buffer allocated in  
>> dbpf_dspace_iterate_handles_op_svc, since we have to get keys and  
>> values, so essentially we do a get with a buffer that's 4096*(sizeof 
>> (handle) + sizeof(stored_attr)), which ends up being about 300K.
>>
>> I also attached a patch (server-start.patch) that prints out the  
>> start message as well as ready message after server initialization  
>> has completed.  If you set the Logstamp to usec, you'll be able to  
>> see the time it takes to initialize the server.  Also, this might  
>> help in knowing when you can mount the clients, although, hopefully  
>> at some point we'll be able to add the zero-conf stuff and then we  
>> can return EAGAIN or something.
>>
>> I'm not sure its time to replace the ledger code.  It seems to work  
>> ok, and to fix the slowness you're seeing would mean switching to  
>> some kind of range tree that could be serialized to disk so that we  
>> wouldn't have to iterate through the entire dspace db on startup.   
>> That opens up the possibility of the dspace db and the ledger-on-disk  
>> getting out of sync, which I'd rather avoid.
>>
>> We could hand out new handles by choosing one randomly, and then  
>> checking if its in the DB, getting rid of the need for a ledger  
>> entirely, but I assume this idea was already scratched to avoid the  
>> potential costs at creation time, especially as the filesystem grows.
>>
>> -sam
>>
>>
>>
>> On Feb 20, 2007, at 11:23 AM, Phil Carns wrote:
>>
>>> Robert Latham wrote:
>>>
>>>> On Tue, Feb 20, 2007 at 07:29:16AM -0500, Phil Carns wrote:
>>>>
>>>>> Oh, and one other detail; the memory usage of the servers looks  
>>>>> fine during startup, so this doesn't appear to be a memory leak.   
>>>>> There is quite a bit of CPU work, but I am guessing that is just  
>>>>> berkeley db keeping busy in the iteration function.
>>>>
>>>>
>>>> How long does it take to scan 1.4 million files on startup?
>>>> ==rob
>>>
>>>
>>>
>>> That's an interesting issue :)
>>>
>>> A few observations:
>>>
>>> - we were looking at this on SAN; the results may be different on  
>>> local disks
>>>
>>> - the db files are on the order of 500 MB for this particular setup
>>>
>>> - the time to scan varies depending on if the db files are hot in  
>>> the Linux buffer cache
>>>
>>> If we start the daemon right after killing another one that just  did 
>>> the same scan, then the process is CPU intensive, but fast  (about 5 
>>> seconds).  If we unmount/mount the SAN between the two  runs so that 
>>> the buffer cache is cleared, then it is very slow  (about 5 minutes).
>>>
>>> An interesting trick is to use dd with a healthy buffer size to  read 
>>> the .db files and throw the output into /dev/null before  starting 
>>> the servers.  This only takes a few seconds, and makes it  so that 
>>> the scan consistently finishes in just a few seconds as  well.  I 
>>> think the reason is just that it forces the db data into  the Linux 
>>> buffer cache using an efficient access pattern so that  berkeley db 
>>> doesn't have to wait on disk latency for whatever small  accesses it 
>>> is performing.
>>>
>>> This seems to indicate that berkeley db's access pattern generated  
>>> by PVFS2 for this case isn't very friendly, at least to SANs that  
>>> aren't specifically tuned for it.
>>>
>>> The 5 minute scan time is a problem, because it makes it hard to  
>>> tell when you will actually be able to mount the file system after  
>>> the daemons appear to have started.  We would be happy to try out  
>>> any optimizations here :)
>>>
>>> -Phil
>>>
>>> _______________________________________________
>>> Pvfs2-developers mailing list
>>> Pvfs2-developers at beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>>
>>
> 
> 



More information about the Pvfs2-developers mailing list