[Pvfs2-developers] patch: alternate AIO implementation
Sam Lang
slang at mcs.anl.gov
Fri Aug 11 15:25:57 EDT 2006
Hi Phil,
I went ahead and committed this patch to trunk. The changes are
relatively small and you've demonstrated good perf improvements out
of them! In the longer term I'm going to try to merge Julian's
threaded implementation with O_DIRECT support to trunk at some point
as well, so that we can still have some control over grouping and
scheduling operations.
-sam
On Aug 10, 2006, at 3:37 PM, Phil Carns wrote:
> Background:
>
> We have been a little suspicious of the posix aio performance on
> some of
> our servers. After digging in the glibc code a little, we found a
> possible problem. Glibc's aio will spawn up to 16 threads by default,
> but will never assign more than a single thread to a given fd. That
> thread will then service all operations on that fd sequentially
> using a
> FIFO queue. This means that if several clients are performing I/O
> to the
> same datafile, then all of their I/O requests get pushed to the disk
> sequentially (and probably not in order by offset).
>
> Patch:
>
> This patch replaces the lio_listio() calls with a macro called
> LIO_LISTIO(). You can then toggle what this macro does by using a
> config file option "TroveAltAIOMode yes|no". If the option is not
> specified (or is set to no) then the normal code path is taken. If the
> option is enabled, then it looks at the arguments. If the operation is
> a single buffer read or write, then it immediately spawns a new
> detached
> thread, services the opertion using p{read/write}, triggers a callback
> function, and exits. More complex operations are sent to the usual
> lio_listio() route.
>
> This idea is to basically try to get the requests off to the kernel as
> quickly as possible without queueing so that the kernel can sort
> out how
> to best service them. Trove doesn't care about ordering at that level.
>
> Drawbacks:
>
> - This option/implementation is only reasonable for systems with NPTL,
> because of the low thread spawning overhead. Non-NPTL systems will
> probably find the cost to be higher. As a side note, we tried an
> implementation that kept a pool of threads and sent operations to
> those
> threads, but we found that the overhead of synchronization and
> signaling
> in this approach was (surprisingly) much higher than the cost of just
> creating brand new threads on every operation that did not require
> synchronization.
> - This implementation only helps contiguous reads or writes as they
> appear to Trove. You could extend it to work for other patterns by
> just
> doing a series of preads and pwrites to work down the list of buffers,
> but we did not handle this case.
>
> Results:
>
> We didn't see a big gain from this approach at first, but since
> then we
> have taken care of some other bottlenecks that make the improvement
> more
> obvious. It also seems that the performance boost varies quite a bit
> depending on the type of system you run it on. We have some new
> servers
> (results shown below) that benefitted greatly from this optimization.
>
> The numbers below show the results from a setup with 16 servers and a
> variable number of clients and number of processes per client. The
> benchmark is performing a read only access pattern with 100 MB
> buffers.
> All clients are accessing the same file 40 GB file (we rotate among
> several to avoid caching). The file is divided into contiguous
> regions,
> one per each process. We are using local hardware raid at each
> server, and gigabit ethernet for communication.
>
> Before optimization:
> client nodes x processes per node - MB/s aggregate throughput
> --------------------------------------------------------------
>
> 1 x 1 - 97.8
> 1 x 2 - 110.4
> 1 x 5 - 111.1
> 12 x 1 - 195.8
> 12 x 2 - 138.8
> 25 x 1 - 160.4
> 25 x 2 - 178.0
>
> After optimization:
> client nodes x processes per node - MB/s aggregate throughput
> --------------------------------------------------------------
> 1 x 1 - 93.4
> 1 x 2 - 109.2
> 1 x 5 - 108.9
> 12 x 1 - 443.1
> 12 x 2 - 502.6
> 25 x 1 - 496.7
> 25 x 2 - 550.7
>
> To confirm the cause of the problem, we performed a variation on the
> test where each client read an independent file, rather than the
> clients
> all hitting the same file. Running this benchmark with 12 client
> nodes (one process per node) resulted in a consistent 430 MB/s of
> aggregate
> throughput regardless of whether the new AIO path was used or not.
> This
> seems to confirm that the problem is a result of the sequential
> queueing
> that the normal AIO implementation does when multiple requests hit the
> same file.
>
> For these particular machines we were able to double or triple the
> read
> throughput for a parallel application that shared one large file. I am
> fairly sure that not all of our machines demonstrate this problem
> to such a drastic degree, but we will probably be testing some
> other setups later to get a better idea.
>
> -Phil
> diff -Naur pvfs2/src/common/misc/server-config.c pvfs2-new/src/
> common/misc/server-config.c
> --- pvfs2/src/common/misc/server-config.c 2006-08-02
> 17:13:00.000000000 +0200
> +++ pvfs2-new/src/common/misc/server-config.c 2006-08-03
> 21:57:35.000000000 +0200
> @@ -71,6 +71,7 @@
> static DOTCONF_CB(get_attr_cache_size);
> static DOTCONF_CB(get_attr_cache_max_num_elems);
> static DOTCONF_CB(get_trove_sync_meta);
> +static DOTCONF_CB(get_trove_alt_aio);
> static DOTCONF_CB(get_trove_sync_data);
> static DOTCONF_CB(get_db_cache_size_bytes);
> static DOTCONF_CB(get_db_cache_type);
> @@ -656,6 +657,12 @@
> {"DBCacheType", ARG_STR, get_db_cache_type, NULL,
> CTX_STORAGEHINTS, "sys"},
>
> + /* enable alternate AIO implementation for certain types of I/O
> + * operations (experimental
> + */
> + {"TroveAltAIOMode",ARG_STR, get_trove_alt_aio, NULL,
> + CTX_DEFAULTS|CTX_GLOBAL,"no"},
> +
> /* Specifies the format of the date/timestamp that events will
> have
> * in the event log. Possible values are:
> *
> @@ -1478,6 +1485,28 @@
> return NULL;
> }
>
> +DOTCONF_CB(get_trove_alt_aio)
> +{
> + struct server_configuration_s *config_s =
> + (struct server_configuration_s *)cmd->context;
> +
> + if(strcasecmp(cmd->data.str, "yes") == 0)
> + {
> + config_s->trove_alt_aio_mode = 1;
> + }
> + else if(strcasecmp(cmd->data.str, "no") == 0)
> + {
> + config_s->trove_alt_aio_mode = 0;
> + }
> + else
> + {
> + return("TroveAltAIOMode value must be 'yes' or 'no'.\n");
> + }
> +
> + return NULL;
> +}
> +
> +
> DOTCONF_CB(get_trove_sync_meta)
> {
> struct filesystem_configuration_s *fs_conf = NULL;
> diff -Naur pvfs2/src/common/misc/server-config.h pvfs2-new/src/
> common/misc/server-config.h
> --- pvfs2/src/common/misc/server-config.h 2006-07-13
> 07:11:40.000000000 +0200
> +++ pvfs2-new/src/common/misc/server-config.h 2006-08-03
> 21:58:25.000000000 +0200
> @@ -146,7 +146,10 @@
> int db_cache_size_bytes; /* cache size to use in
> berkeley db
> if zero, use defaults */
> char * db_cache_type;
> -
> + int trove_alt_aio_mode; /* enables experimental
> alternative AIO
> + * implementation for some
> types of
> + * operations
> + */
> } server_configuration_s;
>
> int PINT_parse_config(
> diff -Naur pvfs2/src/io/trove/trove-dbpf/dbpf-bstream.c pvfs2-new/
> src/io/trove/trove-dbpf/dbpf-bstream.c
> --- pvfs2/src/io/trove/trove-dbpf/dbpf-bstream.c 2006-06-23
> 22:59:29.000000000 +0200
> +++ pvfs2-new/src/io/trove/trove-dbpf/dbpf-bstream.c 2006-08-03
> 21:55:22.000000000 +0200
> @@ -73,6 +73,41 @@
> static int dbpf_bstream_flush_op_svc(struct dbpf_op *op_p);
> static int dbpf_bstream_resize_op_svc(struct dbpf_op *op_p);
>
> +struct alt_aio_item
> +{
> + struct aiocb *cb_p;
> + struct sigevent *sig;
> + struct qlist_head list_link;
> +};
> +static int alt_lio_listio(int mode, struct aiocb *list[],
> + int nent, struct sigevent *sig);
> +static void* alt_lio_thread(void*);
> +extern int TROVE_alt_aio_mode;
> +
> +
> +#ifdef __PVFS2_TROVE_AIO_THREADED__
> +/* allow bypassing default lio_listio implementation if user
> requests it and
> + * some conditions are met
> + */
> +static inline int LIO_LISTIO(int mode, struct aiocb *list[],
> + int nent, struct sigevent *sig)
> +{
> + if((TROVE_alt_aio_mode) && (nent == 1) &&
> + (((list[0])->aio_lio_opcode == LIO_READ) ||
> + ((list[0])->aio_lio_opcode == LIO_WRITE)) &&
> + (mode == LIO_NOWAIT))
> + {
> + return(alt_lio_listio(mode, list, nent, sig));
> + }
> + else
> + {
> + return(lio_listio(mode, list, nent, sig));
> + }
> +}
> +#else
> +#define LIO_LISTIO lio_listio
> +#endif
> +
> #ifdef __PVFS2_TROVE_AIO_THREADED__
> #include "dbpf-thread.h"
> #include "pvfs2-internal.h"
> @@ -321,7 +356,7 @@
> }
> }
>
> - ret = lio_listio(LIO_NOWAIT, aiocb_ptr_array,
> aiocb_inuse_count,
> + ret = LIO_LISTIO(LIO_NOWAIT, aiocb_ptr_array,
> aiocb_inuse_count,
> &cur_op->op.u.b_rw_list.sigev);
>
> if (ret != 0)
> @@ -423,7 +458,7 @@
> }
> }
>
> - ret = lio_listio(LIO_NOWAIT, aiocb_ptr_array,
> + ret = LIO_LISTIO(LIO_NOWAIT, aiocb_ptr_array,
> aiocb_inuse_count, sig);
> if (ret != 0)
> {
> @@ -1337,6 +1372,108 @@
> dbpf_bstream_flush
> };
>
> +int alt_lio_listio(int mode, struct aiocb *list[],
> + int nent, struct sigevent *sig)
> +{
> + struct alt_aio_item* tmp_item;
> + int ret;
> + pthread_t tid;
> + pthread_attr_t attr;
> +
> + /* alt_lio only supports a subset of the full lio
> functionality */
> + /* NOTE: an earlier check is supposed to make sure that we
> don't invoke
> + * this function for unsupported cases
> + */
> + assert(mode == LIO_NOWAIT);
> + assert(nent == 1);
> + assert((list[0]->aio_lio_opcode == LIO_READ) ||
> + (list[0]->aio_lio_opcode == LIO_WRITE));
> +
> + tmp_item = (struct alt_aio_item*)malloc(sizeof(struct
> alt_aio_item));
> + if(!tmp_item)
> + {
> + /* preserve errno */
> + return(-1);
> + }
> + tmp_item->cb_p = list[0];
> + tmp_item->sig = sig;
> +
> + /* set detached state */
> + ret = pthread_attr_init(&attr);
> + if(ret != 0)
> + {
> + free(tmp_item);
> + errno = ret;
> + return(-1);
> + }
> + ret = pthread_attr_setdetachstate(&attr,
> PTHREAD_CREATE_DETACHED);
> + if(ret != 0)
> + {
> + free(tmp_item);
> + errno = ret;
> + return(-1);
> + }
> +
> + /* create thread to perform I/O and trigger callback */
> + ret = pthread_create(&tid, &attr, alt_lio_thread, tmp_item);
> + if(ret != 0)
> + {
> + free(tmp_item);
> + errno = ret;
> + return(-1);
> + }
> +
> + return(0);
> +}
> +
> +/* prototypes for pread and pwrite; _XOPEN_SOURCE causes db.h
> problems */
> +ssize_t pread(int fd, void *buf, size_t count, off_t offset);
> +ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);
> +static void* alt_lio_thread(void* foo)
> +{
> + struct alt_aio_item* tmp_item = (struct alt_aio_item*)foo;
> + int ret = 0;
> +
> + if(tmp_item->cb_p->aio_lio_opcode == LIO_READ)
> + {
> + ret = pread(tmp_item->cb_p->aio_fildes,
> + (void*)tmp_item->cb_p->aio_buf,
> + tmp_item->cb_p->aio_nbytes,
> + tmp_item->cb_p->aio_offset);
> + }
> + else if(tmp_item->cb_p->aio_lio_opcode == LIO_WRITE)
> + {
> + ret = pwrite(tmp_item->cb_p->aio_fildes,
> + (const void*)tmp_item->cb_p->aio_buf,
> + tmp_item->cb_p->aio_nbytes,
> + tmp_item->cb_p->aio_offset);
> + }
> + else
> + {
> + /* this should have been caught already */
> + assert(0);
> + }
> +
> + /* store error and return codes */
> + if(ret < 0)
> + {
> + tmp_item->cb_p->__error_code = errno;
> + }
> + else
> + {
> + tmp_item->cb_p->__error_code = 0;
> + tmp_item->cb_p->__return_value = ret;
> + }
> +
> + /* run callback fn */
> + tmp_item->sig->sigev_notify_function(
> + tmp_item->sig->sigev_value);
> +
> + free(tmp_item);
> +
> + return(NULL);
> +}
> +
> /*
> * Local variables:
> * c-indent-level: 4
> diff -Naur pvfs2/src/io/trove/trove.c pvfs2-new/src/io/trove/trove.c
> --- pvfs2/src/io/trove/trove.c 2006-06-16 23:01:13.000000000 +0200
> +++ pvfs2-new/src/io/trove/trove.c 2006-08-03 21:55:56.000000000 +0200
> @@ -30,6 +30,7 @@
> struct PINT_perf_counter* PINT_server_pc = NULL;
>
> int TROVE_db_cache_size_bytes = 0;
> +int TROVE_alt_aio_mode = 0;
> int TROVE_shm_key_hint = 0;
>
> /** Initiate reading from a contiguous region in a bstream into a
> @@ -964,6 +965,11 @@
> TROVE_shm_key_hint = *((int*)parameter);
> return(0);
> }
> + if(option == TROVE_ALT_AIO_MODE)
> + {
> + TROVE_alt_aio_mode = *((int*)parameter);
> + return(0);
> + }
>
> method_id = map_coll_id_to_method(coll_id);
> if (method_id < 0) {
> diff -Naur pvfs2/src/io/trove/trove.h pvfs2-new/src/io/trove/trove.h
> --- pvfs2/src/io/trove/trove.h 2006-07-13 07:11:41.000000000 +0200
> +++ pvfs2-new/src/io/trove/trove.h 2006-08-03 21:56:23.000000000 +0200
> @@ -72,6 +72,7 @@
> TROVE_COLLECTION_ATTR_CACHE_MAX_NUM_ELEMS,
> TROVE_COLLECTION_ATTR_CACHE_INITIALIZE,
> TROVE_DB_CACHE_SIZE_BYTES,
> + TROVE_ALT_AIO_MODE,
> TROVE_COLLECTION_COALESCING_HIGH_WATERMARK,
> TROVE_COLLECTION_COALESCING_LOW_WATERMARK,
> TROVE_COLLECTION_META_SYNC_MODE,
> diff -Naur pvfs2/src/server/pvfs2-server.c pvfs2-new/src/server/
> pvfs2-server.c
> --- pvfs2/src/server/pvfs2-server.c 2006-07-13 07:11:42.000000000
> +0200
> +++ pvfs2-new/src/server/pvfs2-server.c 2006-08-03
> 21:54:02.000000000 +0200
> @@ -950,6 +950,10 @@
>
> &server_config.db_cache_size_bytes);
> /* this should never fail */
> assert(ret == 0);
> + ret = trove_collection_setinfo(0, 0, TROVE_ALT_AIO_MODE,
> + &server_config.trove_alt_aio_mode);
> + /* this should never fail */
> + assert(ret == 0);
>
> /* parse port number and allow trove to use it to help
> differentiate
> * shmem regions if needed
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
More information about the Pvfs2-developers
mailing list