[Pvfs2-developers] patch: alternate AIO implementation

Phil Carns pcarns at wastedcycles.org
Thu Aug 10 18:25:17 EDT 2006


Sam Lang wrote:
> 
> This differs from what Julian did quite a bit I guess.  You've  
> essentially re-written the lio_listio call?  

Pretty much- we were just looking for a way to work around or at least 
confirm a specific limitation in the lio_listio() implementation rather 
than addressing higher levels of the I/O path.  I guess it really isn't 
necessarily a limitation, its just that the implied ordering requirement 
per fd isn't really helpful for what Trove needs to do.

> Does it make sense to  
> factor that out so that it could be used as a separate implementation  
> for others (LD_PRELOAD=libfastaio.so :-))?  I guess the only  advantage 
> pvfs would gain from that would be the performance  comparisons of the 
> two different lio_listio implementations from all  the aio tests out 
> already out there.  A disadvantage might be that  you can't switch 
> between one and the other dynamically at runtime.

I would imagine this is possible, but probably more work than I have 
time to tinker with :)  You could probably even play some trick to make 
it controllable at runtime (looking at the value of a global variable if 
defined, or maybe latching hold of the aio_init() function to control it).

> Julian's implementation changes the actual dbpf_bstream_rw_list call,  
> so its one level up from yours.  With the performance gains you see  
> from spawning a thread on demand instead of pooling, maybe its time  to 
> get rid of the dbpf queue altogether and start spawning a thread  per 
> trove operation?  The advantage we get from queuing being that we  can 
> do our own scheduling, although your numbers are impressive, so  maybe 
> that's not such a good idea.

I really don't know.  The problem here was that glibc wasn't really 
honoring the decision even when trove decided to let something through 
the queue :)  I don't know what the implications would be for berkeley 
db or other operations.  I also don't know that nptl is ubiquitous 
enough to make spawning that many threads the default behavior.

At any rate, I am definitely interested in looking at Julian's 
implementation when it is ready for evaluation.  It may be that his 
stuff renders the lio_listio() workaround a moot point, although we will 
definitely be using this approach in the meantime.

-Phil

> 
> -sam
> 
> On Aug 10, 2006, at 3:37 PM, Phil Carns wrote:
> 
>> Background:
>>
>> We have been a little suspicious of the posix aio performance on  some of
>> our servers. After digging in the glibc code a little, we found a
>> possible problem. Glibc's aio will spawn up to 16 threads by default,
>> but will never assign more than a single thread to a given fd. That
>> thread will then service all operations on that fd sequentially  using a
>> FIFO queue. This means that if several clients are performing I/O  to the
>> same datafile, then all of their I/O requests get pushed to the disk
>> sequentially (and probably not in order by offset).
>>
>> Patch:
>>
>> This patch replaces the lio_listio() calls with a macro called
>> LIO_LISTIO(). You can then toggle what this macro does by using a
>> config file option "TroveAltAIOMode yes|no". If the option is not
>> specified (or is set to no) then the normal code path is taken. If the
>> option is enabled, then it looks at the arguments. If the operation is
>> a single buffer read or write, then it immediately spawns a new  detached
>> thread, services the opertion using p{read/write}, triggers a callback
>> function, and exits. More complex operations are sent to the usual
>> lio_listio() route.
>>
>> This idea is to basically try to get the requests off to the kernel as
>> quickly as possible without queueing so that the kernel can sort  out how
>> to best service them. Trove doesn't care about ordering at that level.
>>
>> Drawbacks:
>>
>> - This option/implementation is only reasonable for systems with NPTL,
>> because of the low thread spawning overhead. Non-NPTL systems will
>> probably find the cost to be higher. As a side note, we tried an
>> implementation that kept a pool of threads and sent operations to  those
>> threads, but we found that the overhead of synchronization and  signaling
>> in this approach was (surprisingly) much higher than the cost of just
>> creating brand new threads on every operation that did not require
>> synchronization.
>> - This implementation only helps contiguous reads or writes as they
>> appear to Trove. You could extend it to work for other patterns by  just
>> doing a series of preads and pwrites to work down the list of buffers,
>> but we did not handle this case.
>>
>> Results:
>>
>> We didn't see a big gain from this approach at first, but since  then we
>> have taken care of some other bottlenecks that make the improvement  more
>> obvious. It also seems that the performance boost varies quite a bit
>> depending on the type of system you run it on. We have some new  servers
>> (results shown below) that benefitted greatly from this optimization.
>>
>> The numbers below show the results from a setup with 16 servers and a
>> variable number of clients and number of processes per client. The
>> benchmark is performing a read only access pattern with 100 MB  buffers.
>> All clients are accessing the same file 40 GB file (we rotate among
>> several to avoid caching). The file is divided into contiguous  regions,
>> one per each process.  We are using local hardware raid at each  
>> server, and gigabit ethernet for communication.
>>
>> Before optimization:
>> client nodes x processes per node - MB/s aggregate throughput
>> --------------------------------------------------------------
>>
>> 1 x 1 - 97.8
>> 1 x 2 - 110.4
>> 1 x 5 - 111.1
>> 12 x 1 - 195.8
>> 12 x 2 - 138.8
>> 25 x 1 - 160.4
>> 25 x 2 - 178.0
>>
>> After optimization:
>> client nodes x processes per node - MB/s aggregate throughput
>> --------------------------------------------------------------
>> 1 x 1 - 93.4
>> 1 x 2 - 109.2
>> 1 x 5 - 108.9
>> 12 x 1 - 443.1
>> 12 x 2 - 502.6
>> 25 x 1 - 496.7
>> 25 x 2 - 550.7
>>
>> To confirm the cause of the problem, we performed a variation on the
>> test where each client read an independent file, rather than the  clients
>> all hitting the same file. Running this benchmark with 12 client  
>> nodes (one process per node) resulted in a consistent 430 MB/s of  
>> aggregate
>> throughput regardless of whether the new AIO path was used or not.  This
>> seems to confirm that the problem is a result of the sequential  queueing
>> that the normal AIO implementation does when multiple requests hit the
>> same file.
>>
>> For these particular machines we were able to double or triple the  read
>> throughput for a parallel application that shared one large file. I am
>> fairly sure that not all of our machines demonstrate this problem  to 
>> such a drastic degree, but we will probably be testing some  other 
>> setups later to get a better idea.
>>


More information about the Pvfs2-developers mailing list