[Pvfs2-developers] server crash on startup with millions of files
Phil Carns
pcarns at wastedcycles.org
Fri Feb 23 10:16:36 EST 2007
I get an error about DB_BUFFER_SMALL being undefined in this patch.
Should it have the same #ifdefs wrapped around it as are currently in
dbpf-keyval.c? It is using ENOMEM if DB_BUFFER_SMALL isnt' defined.
-Phil
Phil Carns wrote:
> Thanks Sam! We will give these patches a try and report back.
>
> -Phil
>
> Sam Lang wrote:
>
>>
>> Hi Phil,
>>
>> Attached mult.patch implements iterating over the dspace db using
>> DB_MULTIPLE_KEY. This may allow for the db get call to do larger
>> reads from your SAN. I was seeing slightly better performance with
>> local disk after creating 20K files in a fresh storage space. Doing
>> strace doesn't show fewer mmaps or larger reads though, so I'm not
>> sure how berkeley db pulls in its pages. Anyway, if it helps improve
>> performance for you guys, I can clean it up a bit and commit it. I
>> don't think anything uses dspace_iterate_handles besides that ledger
>> handle management code.
>>
>> You can fiddle the MAX_NUM_VERIFY_HANDLE_COUNT value to set how many
>> handles to get at a time. Right now its set to 4096. Keep in mind
>> that this requires a much larger buffer allocated in
>> dbpf_dspace_iterate_handles_op_svc, since we have to get keys and
>> values, so essentially we do a get with a buffer that's 4096*(sizeof
>> (handle) + sizeof(stored_attr)), which ends up being about 300K.
>>
>> I also attached a patch (server-start.patch) that prints out the
>> start message as well as ready message after server initialization
>> has completed. If you set the Logstamp to usec, you'll be able to
>> see the time it takes to initialize the server. Also, this might
>> help in knowing when you can mount the clients, although, hopefully
>> at some point we'll be able to add the zero-conf stuff and then we
>> can return EAGAIN or something.
>>
>> I'm not sure its time to replace the ledger code. It seems to work
>> ok, and to fix the slowness you're seeing would mean switching to
>> some kind of range tree that could be serialized to disk so that we
>> wouldn't have to iterate through the entire dspace db on startup.
>> That opens up the possibility of the dspace db and the ledger-on-disk
>> getting out of sync, which I'd rather avoid.
>>
>> We could hand out new handles by choosing one randomly, and then
>> checking if its in the DB, getting rid of the need for a ledger
>> entirely, but I assume this idea was already scratched to avoid the
>> potential costs at creation time, especially as the filesystem grows.
>>
>> -sam
>>
>>
>>
>> On Feb 20, 2007, at 11:23 AM, Phil Carns wrote:
>>
>>> Robert Latham wrote:
>>>
>>>> On Tue, Feb 20, 2007 at 07:29:16AM -0500, Phil Carns wrote:
>>>>
>>>>> Oh, and one other detail; the memory usage of the servers looks
>>>>> fine during startup, so this doesn't appear to be a memory leak.
>>>>> There is quite a bit of CPU work, but I am guessing that is just
>>>>> berkeley db keeping busy in the iteration function.
>>>>
>>>>
>>>> How long does it take to scan 1.4 million files on startup?
>>>> ==rob
>>>
>>>
>>>
>>> That's an interesting issue :)
>>>
>>> A few observations:
>>>
>>> - we were looking at this on SAN; the results may be different on
>>> local disks
>>>
>>> - the db files are on the order of 500 MB for this particular setup
>>>
>>> - the time to scan varies depending on if the db files are hot in
>>> the Linux buffer cache
>>>
>>> If we start the daemon right after killing another one that just did
>>> the same scan, then the process is CPU intensive, but fast (about 5
>>> seconds). If we unmount/mount the SAN between the two runs so that
>>> the buffer cache is cleared, then it is very slow (about 5 minutes).
>>>
>>> An interesting trick is to use dd with a healthy buffer size to read
>>> the .db files and throw the output into /dev/null before starting
>>> the servers. This only takes a few seconds, and makes it so that
>>> the scan consistently finishes in just a few seconds as well. I
>>> think the reason is just that it forces the db data into the Linux
>>> buffer cache using an efficient access pattern so that berkeley db
>>> doesn't have to wait on disk latency for whatever small accesses it
>>> is performing.
>>>
>>> This seems to indicate that berkeley db's access pattern generated
>>> by PVFS2 for this case isn't very friendly, at least to SANs that
>>> aren't specifically tuned for it.
>>>
>>> The 5 minute scan time is a problem, because it makes it hard to
>>> tell when you will actually be able to mount the file system after
>>> the daemons appear to have started. We would be happy to try out
>>> any optimizations here :)
>>>
>>> -Phil
>>>
>>> _______________________________________________
>>> Pvfs2-developers mailing list
>>> Pvfs2-developers at beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>>
>>
>
>
More information about the Pvfs2-developers
mailing list