[Pvfs2-developers] server crash on startup with millions of files
Sam Lang
slang at mcs.anl.gov
Fri Feb 23 09:46:55 EST 2007
On Feb 23, 2007, at 9:16 AM, Phil Carns wrote:
> I get an error about DB_BUFFER_SMALL being undefined in this patch.
> Should it have the same #ifdefs wrapped around it as are currently
> in dbpf-keyval.c? It is using ENOMEM if DB_BUFFER_SMALL isnt'
> defined.
Yeah. The patch needs a bit of cleanup if it works. I assumed you'd
be using one of the newer versions of berkeley DB, as I'm not sure
when DB_MULTIPLE was added.
-sam
>
> -Phil
>
>
>
> Phil Carns wrote:
>> Thanks Sam! We will give these patches a try and report back.
>> -Phil
>> Sam Lang wrote:
>>>
>>> Hi Phil,
>>>
>>> Attached mult.patch implements iterating over the dspace db
>>> using DB_MULTIPLE_KEY. This may allow for the db get call to do
>>> larger reads from your SAN. I was seeing slightly better
>>> performance with local disk after creating 20K files in a fresh
>>> storage space. Doing strace doesn't show fewer mmaps or larger
>>> reads though, so I'm not sure how berkeley db pulls in its
>>> pages. Anyway, if it helps improve performance for you guys, I
>>> can clean it up a bit and commit it. I don't think anything
>>> uses dspace_iterate_handles besides that ledger handle
>>> management code.
>>>
>>> You can fiddle the MAX_NUM_VERIFY_HANDLE_COUNT value to set how
>>> many handles to get at a time. Right now its set to 4096. Keep
>>> in mind that this requires a much larger buffer allocated in
>>> dbpf_dspace_iterate_handles_op_svc, since we have to get keys
>>> and values, so essentially we do a get with a buffer that's 4096*
>>> (sizeof (handle) + sizeof(stored_attr)), which ends up being
>>> about 300K.
>>>
>>> I also attached a patch (server-start.patch) that prints out the
>>> start message as well as ready message after server
>>> initialization has completed. If you set the Logstamp to usec,
>>> you'll be able to see the time it takes to initialize the
>>> server. Also, this might help in knowing when you can mount the
>>> clients, although, hopefully at some point we'll be able to add
>>> the zero-conf stuff and then we can return EAGAIN or something.
>>>
>>> I'm not sure its time to replace the ledger code. It seems to
>>> work ok, and to fix the slowness you're seeing would mean
>>> switching to some kind of range tree that could be serialized to
>>> disk so that we wouldn't have to iterate through the entire
>>> dspace db on startup. That opens up the possibility of the
>>> dspace db and the ledger-on-disk getting out of sync, which I'd
>>> rather avoid.
>>>
>>> We could hand out new handles by choosing one randomly, and then
>>> checking if its in the DB, getting rid of the need for a ledger
>>> entirely, but I assume this idea was already scratched to avoid
>>> the potential costs at creation time, especially as the
>>> filesystem grows.
>>>
>>> -sam
>>>
>>>
>>>
>>> On Feb 20, 2007, at 11:23 AM, Phil Carns wrote:
>>>
>>>> Robert Latham wrote:
>>>>
>>>>> On Tue, Feb 20, 2007 at 07:29:16AM -0500, Phil Carns wrote:
>>>>>
>>>>>> Oh, and one other detail; the memory usage of the servers
>>>>>> looks fine during startup, so this doesn't appear to be a
>>>>>> memory leak. There is quite a bit of CPU work, but I am
>>>>>> guessing that is just berkeley db keeping busy in the
>>>>>> iteration function.
>>>>>
>>>>>
>>>>> How long does it take to scan 1.4 million files on startup?
>>>>> ==rob
>>>>
>>>>
>>>>
>>>> That's an interesting issue :)
>>>>
>>>> A few observations:
>>>>
>>>> - we were looking at this on SAN; the results may be different
>>>> on local disks
>>>>
>>>> - the db files are on the order of 500 MB for this particular setup
>>>>
>>>> - the time to scan varies depending on if the db files are hot
>>>> in the Linux buffer cache
>>>>
>>>> If we start the daemon right after killing another one that
>>>> just did the same scan, then the process is CPU intensive, but
>>>> fast (about 5 seconds). If we unmount/mount the SAN between
>>>> the two runs so that the buffer cache is cleared, then it is
>>>> very slow (about 5 minutes).
>>>>
>>>> An interesting trick is to use dd with a healthy buffer size to
>>>> read the .db files and throw the output into /dev/null before
>>>> starting the servers. This only takes a few seconds, and makes
>>>> it so that the scan consistently finishes in just a few seconds
>>>> as well. I think the reason is just that it forces the db data
>>>> into the Linux buffer cache using an efficient access pattern
>>>> so that berkeley db doesn't have to wait on disk latency for
>>>> whatever small accesses it is performing.
>>>>
>>>> This seems to indicate that berkeley db's access pattern
>>>> generated by PVFS2 for this case isn't very friendly, at least
>>>> to SANs that aren't specifically tuned for it.
>>>>
>>>> The 5 minute scan time is a problem, because it makes it hard
>>>> to tell when you will actually be able to mount the file system
>>>> after the daemons appear to have started. We would be happy to
>>>> try out any optimizations here :)
>>>>
>>>> -Phil
>>>>
>>>> _______________________________________________
>>>> Pvfs2-developers mailing list
>>>> Pvfs2-developers at beowulf-underground.org
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-
>>>> developers
>>>>
>>>
>
More information about the Pvfs2-developers
mailing list