[Pvfs2-developers] server crash on startup with millions of files

Sam Lang slang at mcs.anl.gov
Fri Feb 23 13:12:14 EST 2007


On Feb 23, 2007, at 12:16 PM, Phil Carns wrote:

> I have a little bit more information to add to the puzzle.  I just  
> ran the newly patched server with strace -ff -o and then dug  
> through the output a little bit.
>
> The attributes db is 472272896 bytes, according to ls -al.
>
> The strace output has a lot of pread operations in it.  Oddly  
> enough, they seem to go backwards through this db.  If I grep/cut  
> out the offsets from all of the pread operations, at some point  
> they go through a pattern that looks like this:
>
> 472244224
> 472207360
> 472170496
> 472133632
> 472096768
> 472059904
> 472023040
> 471986176
> 471949312
> ...
> 442368
> 405504
> 368640
> 331776
> 294912
> 258048
> 221184
> 184320
> 147456
> 110592
> 73728
> 36864
>
> Then, it seems to start over again, this time with a smaller  
> difference between the offsets:
>
> 472268800
> 472264704
> 472260608
> 472256512
> 472252416
> 472248320
> 472240128
> 472236032
> 472231936
> ...
> 57344
> 53248
> 49152
> 45056
> 40960
> 32768
> 28672
> 24576
> 20480
> 16384
> 12288
>
> Altogether there are over 100,000 preads.  I assume since they  
> start at nearly the size of the attributes db that this is what  
> they are reading.
>
> It seems odd that it would go through the entire file backwards  
> twice. I'm guessing that probably isn't very friendly to the  
> whatever caching/prefetching/etc. is going in in the kernel and  
> storage devices.
>
> I don't know what would cause this, unless it is somehow related to  
> the access method and/or comparison functions being used.  If so,  
> maybe it could be overcome with a secondary index that is somehow  
> laid out more favorably for cursors?  I'm grasping at berkeley db  
> voodoo at this point :)

Yeah that is odd.  Setting the cursor for each call to  
iterate_handles may be the reason for it starting over.  Do you know  
how many times it starts over?  The number of times iterate_handles  
is called will be (# of files / 4096).

Maybe it has to do with setting the iterator with the RECNUM flag,  
which we set so that we can keep track of positions over the  
iterate_handles call.  Since we already use the handles to sort the  
entries, maybe the two are conflicting with each other.  The berkeley  
db doc does mention that RECNUM will hinder performance, but only on  
writes:

--
Configuring a Btree for record numbers should not be done lightly.  
While often useful, it may significantly slow down the speed at which  
items can be stored into the database, and can severely impact  
application throughput. Generally it should be avoided in trees with  
a need for high write concurrency.
--

If we could return the handle as the position, we could get rid of  
the RECNUM flag and set the cursor with the last handle, but the  
position field is only uint32_t.  Its really annoying that we only  
use the first 32 bits of the PVFS_handle right now too.  Can we  
change that PVFS_ds_position type to be 64 bit?

-sam

>
> I also have one more data point.  I mentioned in an earlier email  
> that simply "warming up" the db files in the buffer cache before  
> starting the servers was enough to reduce the startup time to a few  
> seconds.  For some reason, that is not necessarily true on the  
> environment that I am now testing on.  Now I see this:
>
> - stock server, cached db files: 1 minute, 46 seconds average
>
> - patched server, cached db files: 7 seconds average
>
> The speedup from pre-caching isn't nearly as dramatic here unless I  
> use Sam's new and improved code.
>
> -Phil
>
>
> Phil Carns wrote:
>> Ok, I have tried several iterations both with and without these  
>> patches.  The test system is again using a SAN, this time with a  
>> dataspace_attributes.db file of about 451 MB on a particular  
>> server. I'm not sure how many files are on the file system; I just  
>> cranked out files on it until the db file looked big enough to get  
>> good measurements on the startup time.  I was able to turn on the  
>> "trove,server" logging mask along with the "usec" timestamp to see  
>> the scan time on both versions without any logging occuring during  
>> the actual scan itself.
>> for example:
>> [D 10:00:46.541646] dbpf collection 752900094 - Setting collection  
>> handle ranges to 4-536870914,4294967292-4831838202
>> [D 10:04:19.414723] dbpf collection 752900094 - Setting  
>> HIGH_WATERMARK to -1
>> If I unmount between each server start, the original version takes  
>> an average of 3 minutes, 17 seconds to complete the scan.
>> The patched version takes an average of 2 minutes, 22 seconds to  
>> complete the same scan.
>> This is definitely a big improvement- almost 30% in my test case.
>> -Phil
>> Phil Carns wrote:
>>> Thanks Sam!  We will give these patches a try and report back.
>>>
>>> -Phil
>>>
>>> Sam Lang wrote:
>>>
>>>>
>>>> Hi Phil,
>>>>
>>>> Attached mult.patch implements iterating over the dspace db  
>>>> using  DB_MULTIPLE_KEY.  This may allow for the db get call to  
>>>> do larger  reads from your SAN.  I was seeing slightly better  
>>>> performance with  local disk after creating 20K files in a fresh  
>>>> storage space.  Doing  strace doesn't show fewer mmaps or larger  
>>>> reads though, so I'm not  sure how berkeley db pulls in its  
>>>> pages.  Anyway, if it helps improve  performance for you guys, I  
>>>> can clean it up a bit and commit it.  I  don't think anything  
>>>> uses dspace_iterate_handles besides that ledger  handle  
>>>> management code.
>>>>
>>>> You can fiddle the MAX_NUM_VERIFY_HANDLE_COUNT value to set how  
>>>> many  handles to get at a time.  Right now its set to 4096.   
>>>> Keep in mind  that this requires a much larger buffer allocated  
>>>> in  dbpf_dspace_iterate_handles_op_svc, since we have to get  
>>>> keys and  values, so essentially we do a get with a buffer  
>>>> that's 4096*(sizeof (handle) + sizeof(stored_attr)), which ends  
>>>> up being about 300K.
>>>>
>>>> I also attached a patch (server-start.patch) that prints out  
>>>> the  start message as well as ready message after server  
>>>> initialization  has completed.  If you set the Logstamp to usec,  
>>>> you'll be able to  see the time it takes to initialize the  
>>>> server.  Also, this might  help in knowing when you can mount  
>>>> the clients, although, hopefully  at some point we'll be able to  
>>>> add the zero-conf stuff and then we  can return EAGAIN or  
>>>> something.
>>>>
>>>> I'm not sure its time to replace the ledger code.  It seems to  
>>>> work  ok, and to fix the slowness you're seeing would mean  
>>>> switching to  some kind of range tree that could be serialized  
>>>> to disk so that we  wouldn't have to iterate through the entire  
>>>> dspace db on startup.   That opens up the possibility of the  
>>>> dspace db and the ledger-on-disk  getting out of sync, which I'd  
>>>> rather avoid.
>>>>
>>>> We could hand out new handles by choosing one randomly, and  
>>>> then  checking if its in the DB, getting rid of the need for a  
>>>> ledger  entirely, but I assume this idea was already scratched  
>>>> to avoid the  potential costs at creation time, especially as  
>>>> the filesystem grows.
>>>>
>>>> -sam
>>>>
>>>>
>>>>
>>>> On Feb 20, 2007, at 11:23 AM, Phil Carns wrote:
>>>>
>>>>> Robert Latham wrote:
>>>>>
>>>>>> On Tue, Feb 20, 2007 at 07:29:16AM -0500, Phil Carns wrote:
>>>>>>
>>>>>>> Oh, and one other detail; the memory usage of the servers  
>>>>>>> looks  fine during startup, so this doesn't appear to be a  
>>>>>>> memory leak.   There is quite a bit of CPU work, but I am  
>>>>>>> guessing that is just  berkeley db keeping busy in the  
>>>>>>> iteration function.
>>>>>>
>>>>>>
>>>>>>
>>>>>> How long does it take to scan 1.4 million files on startup?
>>>>>> ==rob
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> That's an interesting issue :)
>>>>>
>>>>> A few observations:
>>>>>
>>>>> - we were looking at this on SAN; the results may be different  
>>>>> on  local disks
>>>>>
>>>>> - the db files are on the order of 500 MB for this particular  
>>>>> setup
>>>>>
>>>>> - the time to scan varies depending on if the db files are hot  
>>>>> in  the Linux buffer cache
>>>>>
>>>>> If we start the daemon right after killing another one that  
>>>>> just  did the same scan, then the process is CPU intensive, but  
>>>>> fast  (about 5 seconds).  If we unmount/mount the SAN between  
>>>>> the two  runs so that the buffer cache is cleared, then it is  
>>>>> very slow  (about 5 minutes).
>>>>>
>>>>> An interesting trick is to use dd with a healthy buffer size  
>>>>> to  read the .db files and throw the output into /dev/null  
>>>>> before  starting the servers.  This only takes a few seconds,  
>>>>> and makes it  so that the scan consistently finishes in just a  
>>>>> few seconds as  well.  I think the reason is just that it  
>>>>> forces the db data into  the Linux buffer cache using an  
>>>>> efficient access pattern so that  berkeley db doesn't have to  
>>>>> wait on disk latency for whatever small  accesses it is  
>>>>> performing.
>>>>>
>>>>> This seems to indicate that berkeley db's access pattern  
>>>>> generated  by PVFS2 for this case isn't very friendly, at least  
>>>>> to SANs that  aren't specifically tuned for it.
>>>>>
>>>>> The 5 minute scan time is a problem, because it makes it hard  
>>>>> to  tell when you will actually be able to mount the file  
>>>>> system after  the daemons appear to have started.  We would be  
>>>>> happy to try out  any optimizations here :)
>>>>>
>>>>> -Phil
>>>>>
>>>>> _______________________________________________
>>>>> Pvfs2-developers mailing list
>>>>> Pvfs2-developers at beowulf-underground.org
>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2- 
>>>>> developers
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> Pvfs2-developers mailing list
>>> Pvfs2-developers at beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>



More information about the Pvfs2-developers mailing list