[Pvfs2-developers] tuning the 2.6 kernels for write performance
pcarns at wastedcycles.org
Fri Mar 24 10:26:25 EST 2006
As far as the AIO stuff goes, we kicked around an idea here that didn't
really help the workloads that we were looking at in this case, but it
may help something else.
If you look at what aio does, it spawns off threads for each fd up to a
limit that is tunable by calling aio_init() call (using the aio_threads
field) but defaults to 16. A given thread will run through each of its
aiocb arrays calling pread or pwrite as appropriate. Any time it
finishes an array it spawns a new thread in detached mode to trigger the
callback function that the caller told it to use for notification.
There is a certain amount of locking queueing, etc. associated with
doing all of this stuff. Threads time out and exit after a while, then
get re-spawned when someone posts more aio work.
What you could do instead (and what we tried) was to implement a
replacement for lio_listio that is much simpler. Instead of the above
mechanism, when you call lio_listio it just immediately spawns off a
detached thread, which needs no locks or queues. The thread does the
preads or pwrites as needed then invokes the callback function itself
We could get away with something like this in PVFS2 becuase:
- we don't care what order the reads/writes get serviced at the Trove
level, so no need to queue for semantic reasons (the request scheduler
already provides the semantics we require).
- it doesn't cause any extra thread use that wasn't already there- the
normal aio implementation already spawns a new thread for every callback
(and thread creation is relatively cheap these days anyway with NPTL)
- Trove already limits the max number of AIO's in progress to 16, so
there isn't any danger of spawning too many threads
- we don't care about the other notification methods (signals, polling
etc.) and don't use any other significant aio api functions
I guess really at that point there isn't any particular reason to even
bother mimicing the aio API, except that it makes it easy to plug into
the existing trove code.
Our try at this was just a quick hack (someone would have to tinker more
to make sure it propigates error codes, handle array sizes > 1, etc.).
For what we were looking at it wasn't really any faster than normal AIO,
so we shelved the idea for now. I still think it might be interesting
for some workload or another if someone took the time to implement it
right and do more testing.
Avery Ching wrote:
> Phil, I've done some tests for noncontiguous I/O comparing the
> lio_listio, aio_read/aio_write, and normal read/write. In cases where
> there are a lot of noncontiguous regions, lio_listio and aio tend to
> really fall behind. At least 1 order of magnitude slower than normal
> On Fri, 2006-03-24 at 10:49 -0600, Rob Ross wrote:
>>Nice Phil. I saw this exact same sort of stalling eight years ago on
>>grendel at Clemson! But we didn't have alternative schedulers and the
>>like to play with at the time.
>>It might be worth our time to explore the dirty_ratio value a little
>>more in the context of both I/O and metadata tests. Perhaps once the
>>DBPF changes are merged in we can spend some time on this?
>>Phil Carns wrote:
>>>This whole issue started off while trying to debug the PVFS2
>>>stall/timeout problem that ended up being caused by the ext3 reservation
>>>bug... but we found some interesting things along the way.
>>>One of the things we noticed while looking at the problem is that
>>>occasionally a Trove write operation would take much longer than
>>>expected; essentially stalling all I/O for a while. So we wrote some
>>>small benchmark programs to look at the issue outside of PVFS2. These
>>>benchmarks (in the cases shown here) write 8 G of data, 256K at a time.
>>>They show the stall also. We ended up changing some PVFS2 timeouts to
>>>avoid the problem (see earlier email).
>>>We then started trying to figure out why the writes stall sometimes,
>>>because that seemed like a bad thing regardless of whether the timeouts
>>>could handle it or if the kernel bug was fixed :)
>>>These tests look at three possibilities:
>>>A. Is the AIO interface causing delays?
>>>B. Is the linux kernel waiting too long to start writing out its
>>>C. Is the linux kernel disk scheduler appropriate for PVFS2?
>>>To test A:
>>>The benchmark can run in 2 modes. The first uses AIO (as in PVFS2),
>>>allowing a maximum of 16 concurrent writes at a time. The second doesn't
>>>use AIO or threads at all, but instead does each write one at a time
>>>with the pwrite() function.
>>>To test B:
>>>We can change this behavior by adjusting the /proc/sys/vm/dirty* files.
>>>They are documented in the Documentation/filesystems/proc.txt file in
>>>the linux kernel source. The only one that really ended up being
>>>interesting for us (after trial and error) is the dirty_ratio file. The
>>>explanation given in the documentation is: "Contains, as a percentage of
>>>total system memory, the number of pages at which a process which is
>>>generating disk writes will itself start writing out dirty data.". It
>>>defaults to 40, but some of the results below show what happens when it
>>>is set to 1. There is also a dirty_background_ratio file, which
>>>controls when pdflush decides to write out data in the background. That
>>>would seem to be the more desirable tweak, but it didn't have the effect
>>>that dirty_ratio did for some reason.
>>>To test C:
>>>Reboot the machine with different I/O schedulers specified. CFQ
>>>scheduler is the default, but we set it to the AS (anticipatory)
>>>scheduler using "elevator=as" in kernel command line. The other
>>>scheduler options (deadline, noop) didn't change much. The schedulers
>>>also have tunable parameters in /sys/block/<DEVICE>/queue/iosched/*,
>>>but they didn't seem to impact much either. The schedulers are somewhat
>>>documented in the Documentation/block subdirectory in the linux kernel
>>>The results are listed below. The benchmarks show 3 things: The maximum
>>>time that any individual write (during the course of the entire test
>>>run) took, the average individual write time, and then the total
>>>benchmark time. Everything is shown in seconds.
>>>The maximum single write time is what would have shown up as a long
>>>"stall" in the PVFS2 I/O realm, so that is the most interesting value
>>>in terms of our original problem.
>>>A few things to point out:
>>>- the choice of aio/pwrite didn't really matter a whole lot. Individual
>>>aio operations take longer than pwrite, but they are overlapped and end
>>>up giving basically the same overall throughput.
>>>- the io scheduler and buffer cache settings can have a big impact
>>>- this wasn't the point of the test, but in this particular setup the
>>>san is actually a little slower than local disk for writes (this is an
>>>old san setup)
>>>local disk results:
>>>- using the AS scheduler reduced the maximum stall time
>>>significantly and improved total benchmark run time
>>>- setting the dirty ratio to 1 further reduced the maximum stall time,
>>>but also seemed to increase the total benchmark run time a little (maybe
>>>there is a sweet spot between 40 and 1 for this value that doesn't
>>>penalize the throughput as much?)
>>>- the AS scheduler didn't really help
>>>- setting the dirty ratio to 1 reduced the maximum stall time significantly
>>>Maximum single write time
>>> default AS AS,dirty_ratio=1
>>>aio local 30.874424 2.040070 0.907068
>>>pwrite local 28.146439 4.423536 1.052867
>>>aio san 46.486595 46.813606 6.161530
>>>pwrite san 17.991354 10.994622 6.119389
>>>Average single write time
>>> default AS AS,dirty_ratio=1
>>>aio local 0.061520 0.057819 0.064450
>>>pwrite local 0.003711 0.003567 0.004022
>>>aio san 0.095062 0.096853 0.095410
>>>pwrite san 0.005551 0.005713 0.005619
>>>Total benchmark time
>>> default AS AS,dirty_ratio=1
>>>aio local 252.018623 236.855234 264.018140
>>>pwrite local 243.552892 234.140043 263.995362
>>>aio san 389.380213 396.724146 390.813488
>>>pwrite san 364.203958 374.827604 368.691822
>>>These results aren't super scientific- in all cases it is just one test
>>>run per data point and no averaging. We also didn't exhaustively try
>>>many parameter combinations. This is also a write-only test; no telling
>>>what these parameter do to other workloads.
>>>We don't really have time to follow through with this any further, but
>>>it does show that these VM and iosched settings might be interesting to
>>>tune in some cases.
>>>If anyone has any similar experiences to share we would love to hear
>>>Pvfs2-developers mailing list
>>>Pvfs2-developers at beowulf-underground.org
>>Pvfs2-developers mailing list
>>Pvfs2-developers at beowulf-underground.org
More information about the Pvfs2-developers