[Pvfs2-developers] patch: alternate AIO implementation

Phil Carns pcarns at wastedcycles.org
Fri Sep 1 12:50:52 EDT 2006


Hi Sam,

There is actually a bug in this patch, but I don't have a patch to 
correct it yet because I am not sure what the best approach is.

The fundamental problem is that on Linux you have to define _GNU_SOURCE 
to get the correct definition for pread and pwrite when using 
O_LARGEFILE support.  We are not currently defining _GNU_SOURCE in 
pvfs2, so as a work around I just put the definitions for pread/pwrite 
in the .c file.  Unfortunately, this doesn't actually work right on 32 
bit systems that are compiled with large file support.  Those hard coded 
prototypes don't pick up the 64 bit offset changes, and therefore the 
code path fails when you access beyond the 2 GB boundary on a single 
data file.

I have no idea why glibc/linux makes you define _GNU_SOURCE for pread 
pwrite- I would have assumed they would be much more standard than that. 
  Most OS's give you those prototypes by default.

Anyway... there are a couple of options.  I would be happy to implement 
any of these:

- modify configure.in to define _GNU_SOURCE on linux systems.  This is 
very clean.  The only minor downside is that it might allow developers 
to accidentally commit code that isn't portable that would have caused a 
compile error before.

- change pread/pwrite in dbpf-bstream.c to explicitly be 
pread64/pwrite64.  The downside is that without _GNU_SOURCE defined we 
still have to give our own prototypes for pread64/pwrite64.

- define _GNU_SOURCE only in dbpf-bstream.c.  This might be the least 
intrusive option.

Any suggestions?

-Phil

Sam Lang wrote:
> 
> Hi Phil,
> 
> I went ahead and committed this patch to trunk.  The changes are  
> relatively small and you've demonstrated good perf improvements out  of 
> them!  In the longer term I'm going to try to merge Julian's  threaded 
> implementation with O_DIRECT support to trunk at some point  as well, so 
> that we can still have some control over grouping and  scheduling 
> operations.
> 
> -sam
> 
> On Aug 10, 2006, at 3:37 PM, Phil Carns wrote:
> 
>> Background:
>>
>> We have been a little suspicious of the posix aio performance on  some of
>> our servers. After digging in the glibc code a little, we found a
>> possible problem. Glibc's aio will spawn up to 16 threads by default,
>> but will never assign more than a single thread to a given fd. That
>> thread will then service all operations on that fd sequentially  using a
>> FIFO queue. This means that if several clients are performing I/O  to the
>> same datafile, then all of their I/O requests get pushed to the disk
>> sequentially (and probably not in order by offset).
>>
>> Patch:
>>
>> This patch replaces the lio_listio() calls with a macro called
>> LIO_LISTIO(). You can then toggle what this macro does by using a
>> config file option "TroveAltAIOMode yes|no". If the option is not
>> specified (or is set to no) then the normal code path is taken. If the
>> option is enabled, then it looks at the arguments. If the operation is
>> a single buffer read or write, then it immediately spawns a new  detached
>> thread, services the opertion using p{read/write}, triggers a callback
>> function, and exits. More complex operations are sent to the usual
>> lio_listio() route.
>>
>> This idea is to basically try to get the requests off to the kernel as
>> quickly as possible without queueing so that the kernel can sort  out how
>> to best service them. Trove doesn't care about ordering at that level.
>>
>> Drawbacks:
>>
>> - This option/implementation is only reasonable for systems with NPTL,
>> because of the low thread spawning overhead. Non-NPTL systems will
>> probably find the cost to be higher. As a side note, we tried an
>> implementation that kept a pool of threads and sent operations to  those
>> threads, but we found that the overhead of synchronization and  signaling
>> in this approach was (surprisingly) much higher than the cost of just
>> creating brand new threads on every operation that did not require
>> synchronization.
>> - This implementation only helps contiguous reads or writes as they
>> appear to Trove. You could extend it to work for other patterns by  just
>> doing a series of preads and pwrites to work down the list of buffers,
>> but we did not handle this case.
>>
>> Results:
>>
>> We didn't see a big gain from this approach at first, but since  then we
>> have taken care of some other bottlenecks that make the improvement  more
>> obvious. It also seems that the performance boost varies quite a bit
>> depending on the type of system you run it on. We have some new  servers
>> (results shown below) that benefitted greatly from this optimization.
>>
>> The numbers below show the results from a setup with 16 servers and a
>> variable number of clients and number of processes per client. The
>> benchmark is performing a read only access pattern with 100 MB  buffers.
>> All clients are accessing the same file 40 GB file (we rotate among
>> several to avoid caching). The file is divided into contiguous  regions,
>> one per each process.  We are using local hardware raid at each  
>> server, and gigabit ethernet for communication.
>>
>> Before optimization:
>> client nodes x processes per node - MB/s aggregate throughput
>> --------------------------------------------------------------
>>
>> 1 x 1 - 97.8
>> 1 x 2 - 110.4
>> 1 x 5 - 111.1
>> 12 x 1 - 195.8
>> 12 x 2 - 138.8
>> 25 x 1 - 160.4
>> 25 x 2 - 178.0
>>
>> After optimization:
>> client nodes x processes per node - MB/s aggregate throughput
>> --------------------------------------------------------------
>> 1 x 1 - 93.4
>> 1 x 2 - 109.2
>> 1 x 5 - 108.9
>> 12 x 1 - 443.1
>> 12 x 2 - 502.6
>> 25 x 1 - 496.7
>> 25 x 2 - 550.7
>>
>> To confirm the cause of the problem, we performed a variation on the
>> test where each client read an independent file, rather than the  clients
>> all hitting the same file. Running this benchmark with 12 client  
>> nodes (one process per node) resulted in a consistent 430 MB/s of  
>> aggregate
>> throughput regardless of whether the new AIO path was used or not.  This
>> seems to confirm that the problem is a result of the sequential  queueing
>> that the normal AIO implementation does when multiple requests hit the
>> same file.
>>
>> For these particular machines we were able to double or triple the  read
>> throughput for a parallel application that shared one large file. I am
>> fairly sure that not all of our machines demonstrate this problem  to 
>> such a drastic degree, but we will probably be testing some  other 
>> setups later to get a better idea.
>>


More information about the Pvfs2-developers mailing list