[Pvfs2-developers] patch: alternate AIO implementation
Phil Carns
pcarns at wastedcycles.org
Thu Aug 10 18:25:17 EDT 2006
Sam Lang wrote:
>
> This differs from what Julian did quite a bit I guess. You've
> essentially re-written the lio_listio call?
Pretty much- we were just looking for a way to work around or at least
confirm a specific limitation in the lio_listio() implementation rather
than addressing higher levels of the I/O path. I guess it really isn't
necessarily a limitation, its just that the implied ordering requirement
per fd isn't really helpful for what Trove needs to do.
> Does it make sense to
> factor that out so that it could be used as a separate implementation
> for others (LD_PRELOAD=libfastaio.so :-))? I guess the only advantage
> pvfs would gain from that would be the performance comparisons of the
> two different lio_listio implementations from all the aio tests out
> already out there. A disadvantage might be that you can't switch
> between one and the other dynamically at runtime.
I would imagine this is possible, but probably more work than I have
time to tinker with :) You could probably even play some trick to make
it controllable at runtime (looking at the value of a global variable if
defined, or maybe latching hold of the aio_init() function to control it).
> Julian's implementation changes the actual dbpf_bstream_rw_list call,
> so its one level up from yours. With the performance gains you see
> from spawning a thread on demand instead of pooling, maybe its time to
> get rid of the dbpf queue altogether and start spawning a thread per
> trove operation? The advantage we get from queuing being that we can
> do our own scheduling, although your numbers are impressive, so maybe
> that's not such a good idea.
I really don't know. The problem here was that glibc wasn't really
honoring the decision even when trove decided to let something through
the queue :) I don't know what the implications would be for berkeley
db or other operations. I also don't know that nptl is ubiquitous
enough to make spawning that many threads the default behavior.
At any rate, I am definitely interested in looking at Julian's
implementation when it is ready for evaluation. It may be that his
stuff renders the lio_listio() workaround a moot point, although we will
definitely be using this approach in the meantime.
-Phil
>
> -sam
>
> On Aug 10, 2006, at 3:37 PM, Phil Carns wrote:
>
>> Background:
>>
>> We have been a little suspicious of the posix aio performance on some of
>> our servers. After digging in the glibc code a little, we found a
>> possible problem. Glibc's aio will spawn up to 16 threads by default,
>> but will never assign more than a single thread to a given fd. That
>> thread will then service all operations on that fd sequentially using a
>> FIFO queue. This means that if several clients are performing I/O to the
>> same datafile, then all of their I/O requests get pushed to the disk
>> sequentially (and probably not in order by offset).
>>
>> Patch:
>>
>> This patch replaces the lio_listio() calls with a macro called
>> LIO_LISTIO(). You can then toggle what this macro does by using a
>> config file option "TroveAltAIOMode yes|no". If the option is not
>> specified (or is set to no) then the normal code path is taken. If the
>> option is enabled, then it looks at the arguments. If the operation is
>> a single buffer read or write, then it immediately spawns a new detached
>> thread, services the opertion using p{read/write}, triggers a callback
>> function, and exits. More complex operations are sent to the usual
>> lio_listio() route.
>>
>> This idea is to basically try to get the requests off to the kernel as
>> quickly as possible without queueing so that the kernel can sort out how
>> to best service them. Trove doesn't care about ordering at that level.
>>
>> Drawbacks:
>>
>> - This option/implementation is only reasonable for systems with NPTL,
>> because of the low thread spawning overhead. Non-NPTL systems will
>> probably find the cost to be higher. As a side note, we tried an
>> implementation that kept a pool of threads and sent operations to those
>> threads, but we found that the overhead of synchronization and signaling
>> in this approach was (surprisingly) much higher than the cost of just
>> creating brand new threads on every operation that did not require
>> synchronization.
>> - This implementation only helps contiguous reads or writes as they
>> appear to Trove. You could extend it to work for other patterns by just
>> doing a series of preads and pwrites to work down the list of buffers,
>> but we did not handle this case.
>>
>> Results:
>>
>> We didn't see a big gain from this approach at first, but since then we
>> have taken care of some other bottlenecks that make the improvement more
>> obvious. It also seems that the performance boost varies quite a bit
>> depending on the type of system you run it on. We have some new servers
>> (results shown below) that benefitted greatly from this optimization.
>>
>> The numbers below show the results from a setup with 16 servers and a
>> variable number of clients and number of processes per client. The
>> benchmark is performing a read only access pattern with 100 MB buffers.
>> All clients are accessing the same file 40 GB file (we rotate among
>> several to avoid caching). The file is divided into contiguous regions,
>> one per each process. We are using local hardware raid at each
>> server, and gigabit ethernet for communication.
>>
>> Before optimization:
>> client nodes x processes per node - MB/s aggregate throughput
>> --------------------------------------------------------------
>>
>> 1 x 1 - 97.8
>> 1 x 2 - 110.4
>> 1 x 5 - 111.1
>> 12 x 1 - 195.8
>> 12 x 2 - 138.8
>> 25 x 1 - 160.4
>> 25 x 2 - 178.0
>>
>> After optimization:
>> client nodes x processes per node - MB/s aggregate throughput
>> --------------------------------------------------------------
>> 1 x 1 - 93.4
>> 1 x 2 - 109.2
>> 1 x 5 - 108.9
>> 12 x 1 - 443.1
>> 12 x 2 - 502.6
>> 25 x 1 - 496.7
>> 25 x 2 - 550.7
>>
>> To confirm the cause of the problem, we performed a variation on the
>> test where each client read an independent file, rather than the clients
>> all hitting the same file. Running this benchmark with 12 client
>> nodes (one process per node) resulted in a consistent 430 MB/s of
>> aggregate
>> throughput regardless of whether the new AIO path was used or not. This
>> seems to confirm that the problem is a result of the sequential queueing
>> that the normal AIO implementation does when multiple requests hit the
>> same file.
>>
>> For these particular machines we were able to double or triple the read
>> throughput for a parallel application that shared one large file. I am
>> fairly sure that not all of our machines demonstrate this problem to
>> such a drastic degree, but we will probably be testing some other
>> setups later to get a better idea.
>>
More information about the Pvfs2-developers
mailing list