[Pvfs2-developers] server crash on startup with millions of files

Sam Lang slang at mcs.anl.gov
Fri Feb 23 09:46:55 EST 2007


On Feb 23, 2007, at 9:16 AM, Phil Carns wrote:

> I get an error about DB_BUFFER_SMALL being undefined in this patch.  
> Should it have the same #ifdefs wrapped around it as are currently  
> in dbpf-keyval.c?  It is using ENOMEM if DB_BUFFER_SMALL isnt'  
> defined.

Yeah.  The patch needs a bit of cleanup if it works.  I assumed you'd  
be using one of the newer versions of berkeley DB, as I'm not sure  
when DB_MULTIPLE was added.

-sam

>
> -Phil
>
>
>
> Phil Carns wrote:
>> Thanks Sam!  We will give these patches a try and report back.
>> -Phil
>> Sam Lang wrote:
>>>
>>> Hi Phil,
>>>
>>> Attached mult.patch implements iterating over the dspace db  
>>> using  DB_MULTIPLE_KEY.  This may allow for the db get call to do  
>>> larger  reads from your SAN.  I was seeing slightly better  
>>> performance with  local disk after creating 20K files in a fresh  
>>> storage space.  Doing  strace doesn't show fewer mmaps or larger  
>>> reads though, so I'm not  sure how berkeley db pulls in its  
>>> pages.  Anyway, if it helps improve  performance for you guys, I  
>>> can clean it up a bit and commit it.  I  don't think anything  
>>> uses dspace_iterate_handles besides that ledger  handle  
>>> management code.
>>>
>>> You can fiddle the MAX_NUM_VERIFY_HANDLE_COUNT value to set how  
>>> many  handles to get at a time.  Right now its set to 4096.  Keep  
>>> in mind  that this requires a much larger buffer allocated in   
>>> dbpf_dspace_iterate_handles_op_svc, since we have to get keys  
>>> and  values, so essentially we do a get with a buffer that's 4096* 
>>> (sizeof (handle) + sizeof(stored_attr)), which ends up being  
>>> about 300K.
>>>
>>> I also attached a patch (server-start.patch) that prints out the   
>>> start message as well as ready message after server  
>>> initialization  has completed.  If you set the Logstamp to usec,  
>>> you'll be able to  see the time it takes to initialize the  
>>> server.  Also, this might  help in knowing when you can mount the  
>>> clients, although, hopefully  at some point we'll be able to add  
>>> the zero-conf stuff and then we  can return EAGAIN or something.
>>>
>>> I'm not sure its time to replace the ledger code.  It seems to  
>>> work  ok, and to fix the slowness you're seeing would mean  
>>> switching to  some kind of range tree that could be serialized to  
>>> disk so that we  wouldn't have to iterate through the entire  
>>> dspace db on startup.   That opens up the possibility of the  
>>> dspace db and the ledger-on-disk  getting out of sync, which I'd  
>>> rather avoid.
>>>
>>> We could hand out new handles by choosing one randomly, and then   
>>> checking if its in the DB, getting rid of the need for a ledger   
>>> entirely, but I assume this idea was already scratched to avoid  
>>> the  potential costs at creation time, especially as the  
>>> filesystem grows.
>>>
>>> -sam
>>>
>>>
>>>
>>> On Feb 20, 2007, at 11:23 AM, Phil Carns wrote:
>>>
>>>> Robert Latham wrote:
>>>>
>>>>> On Tue, Feb 20, 2007 at 07:29:16AM -0500, Phil Carns wrote:
>>>>>
>>>>>> Oh, and one other detail; the memory usage of the servers  
>>>>>> looks  fine during startup, so this doesn't appear to be a  
>>>>>> memory leak.   There is quite a bit of CPU work, but I am  
>>>>>> guessing that is just  berkeley db keeping busy in the  
>>>>>> iteration function.
>>>>>
>>>>>
>>>>> How long does it take to scan 1.4 million files on startup?
>>>>> ==rob
>>>>
>>>>
>>>>
>>>> That's an interesting issue :)
>>>>
>>>> A few observations:
>>>>
>>>> - we were looking at this on SAN; the results may be different  
>>>> on  local disks
>>>>
>>>> - the db files are on the order of 500 MB for this particular setup
>>>>
>>>> - the time to scan varies depending on if the db files are hot  
>>>> in  the Linux buffer cache
>>>>
>>>> If we start the daemon right after killing another one that  
>>>> just  did the same scan, then the process is CPU intensive, but  
>>>> fast  (about 5 seconds).  If we unmount/mount the SAN between  
>>>> the two  runs so that the buffer cache is cleared, then it is  
>>>> very slow  (about 5 minutes).
>>>>
>>>> An interesting trick is to use dd with a healthy buffer size to   
>>>> read the .db files and throw the output into /dev/null before   
>>>> starting the servers.  This only takes a few seconds, and makes  
>>>> it  so that the scan consistently finishes in just a few seconds  
>>>> as  well.  I think the reason is just that it forces the db data  
>>>> into  the Linux buffer cache using an efficient access pattern  
>>>> so that  berkeley db doesn't have to wait on disk latency for  
>>>> whatever small  accesses it is performing.
>>>>
>>>> This seems to indicate that berkeley db's access pattern  
>>>> generated  by PVFS2 for this case isn't very friendly, at least  
>>>> to SANs that  aren't specifically tuned for it.
>>>>
>>>> The 5 minute scan time is a problem, because it makes it hard  
>>>> to  tell when you will actually be able to mount the file system  
>>>> after  the daemons appear to have started.  We would be happy to  
>>>> try out  any optimizations here :)
>>>>
>>>> -Phil
>>>>
>>>> _______________________________________________
>>>> Pvfs2-developers mailing list
>>>> Pvfs2-developers at beowulf-underground.org
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2- 
>>>> developers
>>>>
>>>
>



More information about the Pvfs2-developers mailing list