[Pvfs2-developers] server crash on startup with millions of files

Sam Lang slang at mcs.anl.gov
Thu Feb 22 18:54:12 EST 2007


Hi Phil,

Attached mult.patch implements iterating over the dspace db using  
DB_MULTIPLE_KEY.  This may allow for the db get call to do larger  
reads from your SAN.  I was seeing slightly better performance with  
local disk after creating 20K files in a fresh storage space.  Doing  
strace doesn't show fewer mmaps or larger reads though, so I'm not  
sure how berkeley db pulls in its pages.  Anyway, if it helps improve  
performance for you guys, I can clean it up a bit and commit it.  I  
don't think anything uses dspace_iterate_handles besides that ledger  
handle management code.

You can fiddle the MAX_NUM_VERIFY_HANDLE_COUNT value to set how many  
handles to get at a time.  Right now its set to 4096.  Keep in mind  
that this requires a much larger buffer allocated in  
dbpf_dspace_iterate_handles_op_svc, since we have to get keys and  
values, so essentially we do a get with a buffer that's 4096*(sizeof 
(handle) + sizeof(stored_attr)), which ends up being about 300K.

I also attached a patch (server-start.patch) that prints out the  
start message as well as ready message after server initialization  
has completed.  If you set the Logstamp to usec, you'll be able to  
see the time it takes to initialize the server.  Also, this might  
help in knowing when you can mount the clients, although, hopefully  
at some point we'll be able to add the zero-conf stuff and then we  
can return EAGAIN or something.

I'm not sure its time to replace the ledger code.  It seems to work  
ok, and to fix the slowness you're seeing would mean switching to  
some kind of range tree that could be serialized to disk so that we  
wouldn't have to iterate through the entire dspace db on startup.   
That opens up the possibility of the dspace db and the ledger-on-disk  
getting out of sync, which I'd rather avoid.

We could hand out new handles by choosing one randomly, and then  
checking if its in the DB, getting rid of the need for a ledger  
entirely, but I assume this idea was already scratched to avoid the  
potential costs at creation time, especially as the filesystem grows.

-sam

-------------- next part --------------
A non-text attachment was scrubbed...
Name: mult.patch
Type: application/octet-stream
Size: 5769 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-developers/attachments/20070222/d27df28a/mult.obj
-------------- next part --------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: server-start.patch
Type: application/octet-stream
Size: 5180 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-developers/attachments/20070222/d27df28a/server-start.obj
-------------- next part --------------

On Feb 20, 2007, at 11:23 AM, Phil Carns wrote:

> Robert Latham wrote:
>> On Tue, Feb 20, 2007 at 07:29:16AM -0500, Phil Carns wrote:
>>> Oh, and one other detail; the memory usage of the servers looks  
>>> fine during startup, so this doesn't appear to be a memory leak.   
>>> There is quite a bit of CPU work, but I am guessing that is just  
>>> berkeley db keeping busy in the iteration function.
>> How long does it take to scan 1.4 million files on startup?
>> ==rob
>
> That's an interesting issue :)
>
> A few observations:
>
> - we were looking at this on SAN; the results may be different on  
> local disks
>
> - the db files are on the order of 500 MB for this particular setup
>
> - the time to scan varies depending on if the db files are hot in  
> the Linux buffer cache
>
> If we start the daemon right after killing another one that just  
> did the same scan, then the process is CPU intensive, but fast  
> (about 5 seconds).  If we unmount/mount the SAN between the two  
> runs so that the buffer cache is cleared, then it is very slow  
> (about 5 minutes).
>
> An interesting trick is to use dd with a healthy buffer size to  
> read the .db files and throw the output into /dev/null before  
> starting the servers.  This only takes a few seconds, and makes it  
> so that the scan consistently finishes in just a few seconds as  
> well.  I think the reason is just that it forces the db data into  
> the Linux buffer cache using an efficient access pattern so that  
> berkeley db doesn't have to wait on disk latency for whatever small  
> accesses it is performing.
>
> This seems to indicate that berkeley db's access pattern generated  
> by PVFS2 for this case isn't very friendly, at least to SANs that  
> aren't specifically tuned for it.
>
> The 5 minute scan time is a problem, because it makes it hard to  
> tell when you will actually be able to mount the file system after  
> the daemons appear to have started.  We would be happy to try out  
> any optimizations here :)
>
> -Phil
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>



More information about the Pvfs2-developers mailing list