[Pvfs2-developers] tuning the 2.6 kernels for write performance
Avery Ching
aching at ece.northwestern.edu
Fri Mar 24 12:02:09 EST 2006
Phil, I've done some tests for noncontiguous I/O comparing the
lio_listio, aio_read/aio_write, and normal read/write. In cases where
there are a lot of noncontiguous regions, lio_listio and aio tend to
really fall behind. At least 1 order of magnitude slower than normal
read/write.
Avery
On Fri, 2006-03-24 at 10:49 -0600, Rob Ross wrote:
> Nice Phil. I saw this exact same sort of stalling eight years ago on
> grendel at Clemson! But we didn't have alternative schedulers and the
> like to play with at the time.
>
> It might be worth our time to explore the dirty_ratio value a little
> more in the context of both I/O and metadata tests. Perhaps once the
> DBPF changes are merged in we can spend some time on this?
>
> Rob
>
> Phil Carns wrote:
> > Background:
> >
> > This whole issue started off while trying to debug the PVFS2
> > stall/timeout problem that ended up being caused by the ext3 reservation
> > bug... but we found some interesting things along the way.
> >
> > One of the things we noticed while looking at the problem is that
> > occasionally a Trove write operation would take much longer than
> > expected; essentially stalling all I/O for a while. So we wrote some
> > small benchmark programs to look at the issue outside of PVFS2. These
> > benchmarks (in the cases shown here) write 8 G of data, 256K at a time.
> > They show the stall also. We ended up changing some PVFS2 timeouts to
> > avoid the problem (see earlier email).
> >
> > We then started trying to figure out why the writes stall sometimes,
> > because that seemed like a bad thing regardless of whether the timeouts
> > could handle it or if the kernel bug was fixed :)
> >
> > These tests look at three possibilities:
> >
> > A. Is the AIO interface causing delays?
> > B. Is the linux kernel waiting too long to start writing out its
> > buffer cache?
> > C. Is the linux kernel disk scheduler appropriate for PVFS2?
> >
> > To test A:
> >
> > The benchmark can run in 2 modes. The first uses AIO (as in PVFS2),
> > allowing a maximum of 16 concurrent writes at a time. The second doesn't
> > use AIO or threads at all, but instead does each write one at a time
> > with the pwrite() function.
> >
> > To test B:
> >
> > We can change this behavior by adjusting the /proc/sys/vm/dirty* files.
> > They are documented in the Documentation/filesystems/proc.txt file in
> > the linux kernel source. The only one that really ended up being
> > interesting for us (after trial and error) is the dirty_ratio file. The
> > explanation given in the documentation is: "Contains, as a percentage of
> > total system memory, the number of pages at which a process which is
> > generating disk writes will itself start writing out dirty data.". It
> > defaults to 40, but some of the results below show what happens when it
> > is set to 1. There is also a dirty_background_ratio file, which
> > controls when pdflush decides to write out data in the background. That
> > would seem to be the more desirable tweak, but it didn't have the effect
> > that dirty_ratio did for some reason.
> >
> > To test C:
> >
> > Reboot the machine with different I/O schedulers specified. CFQ
> > scheduler is the default, but we set it to the AS (anticipatory)
> > scheduler using "elevator=as" in kernel command line. The other
> > scheduler options (deadline, noop) didn't change much. The schedulers
> > also have tunable parameters in /sys/block/<DEVICE>/queue/iosched/*,
> > but they didn't seem to impact much either. The schedulers are somewhat
> > documented in the Documentation/block subdirectory in the linux kernel
> > source.
> >
> > The results are listed below. The benchmarks show 3 things: The maximum
> > time that any individual write (during the course of the entire test
> > run) took, the average individual write time, and then the total
> > benchmark time. Everything is shown in seconds.
> >
> > The maximum single write time is what would have shown up as a long
> > "stall" in the PVFS2 I/O realm, so that is the most interesting value
> > in terms of our original problem.
> >
> > A few things to point out:
> >
> > - the choice of aio/pwrite didn't really matter a whole lot. Individual
> > aio operations take longer than pwrite, but they are overlapped and end
> > up giving basically the same overall throughput.
> > - the io scheduler and buffer cache settings can have a big impact
> > - this wasn't the point of the test, but in this particular setup the
> > san is actually a little slower than local disk for writes (this is an
> > old san setup)
> >
> > local disk results:
> > - using the AS scheduler reduced the maximum stall time
> > significantly and improved total benchmark run time
> > - setting the dirty ratio to 1 further reduced the maximum stall time,
> > but also seemed to increase the total benchmark run time a little (maybe
> > there is a sweet spot between 40 and 1 for this value that doesn't
> > penalize the throughput as much?)
> >
> > san results:
> > - the AS scheduler didn't really help
> > - setting the dirty ratio to 1 reduced the maximum stall time significantly
> >
> > Maximum single write time
> > -------------------------
> > default AS AS,dirty_ratio=1
> > aio local 30.874424 2.040070 0.907068
> > pwrite local 28.146439 4.423536 1.052867
> >
> > aio san 46.486595 46.813606 6.161530
> > pwrite san 17.991354 10.994622 6.119389
> >
> > Average single write time
> > -------------------------
> > default AS AS,dirty_ratio=1
> > aio local 0.061520 0.057819 0.064450
> > pwrite local 0.003711 0.003567 0.004022
> >
> > aio san 0.095062 0.096853 0.095410
> > pwrite san 0.005551 0.005713 0.005619
> >
> > Total benchmark time
> > -------------------------
> > default AS AS,dirty_ratio=1
> > aio local 252.018623 236.855234 264.018140
> > pwrite local 243.552892 234.140043 263.995362
> >
> > aio san 389.380213 396.724146 390.813488
> > pwrite san 364.203958 374.827604 368.691822
> >
> > These results aren't super scientific- in all cases it is just one test
> > run per data point and no averaging. We also didn't exhaustively try
> > many parameter combinations. This is also a write-only test; no telling
> > what these parameter do to other workloads.
> >
> > We don't really have time to follow through with this any further, but
> > it does show that these VM and iosched settings might be interesting to
> > tune in some cases.
> >
> > If anyone has any similar experiences to share we would love to hear
> > about it.
> >
> > -Phil
> >
> >
> >
> > _______________________________________________
> > Pvfs2-developers mailing list
> > Pvfs2-developers at beowulf-underground.org
> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> >
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
More information about the Pvfs2-developers
mailing list