[Pvfs2-developers] patch: alternate AIO implementation
Sam Lang
slang at mcs.anl.gov
Thu Aug 10 14:30:49 EDT 2006
Hi Phil,
Did you have a chance to look at the pwrite/pread threaded
implementation Julian wrote? He did a thread pooling implementation
that sounds similar to what you guys did, as well as a basic
implementation for O_DIRECT. I haven't looked at your patch much
yet, but I wonder if it makes sense to try to combine the best of
both implementations.
-sam
On Aug 10, 2006, at 3:37 PM, Phil Carns wrote:
> Background:
>
> We have been a little suspicious of the posix aio performance on
> some of
> our servers. After digging in the glibc code a little, we found a
> possible problem. Glibc's aio will spawn up to 16 threads by default,
> but will never assign more than a single thread to a given fd. That
> thread will then service all operations on that fd sequentially
> using a
> FIFO queue. This means that if several clients are performing I/O
> to the
> same datafile, then all of their I/O requests get pushed to the disk
> sequentially (and probably not in order by offset).
>
> Patch:
>
> This patch replaces the lio_listio() calls with a macro called
> LIO_LISTIO(). You can then toggle what this macro does by using a
> config file option "TroveAltAIOMode yes|no". If the option is not
> specified (or is set to no) then the normal code path is taken. If the
> option is enabled, then it looks at the arguments. If the operation is
> a single buffer read or write, then it immediately spawns a new
> detached
> thread, services the opertion using p{read/write}, triggers a callback
> function, and exits. More complex operations are sent to the usual
> lio_listio() route.
>
> This idea is to basically try to get the requests off to the kernel as
> quickly as possible without queueing so that the kernel can sort
> out how
> to best service them. Trove doesn't care about ordering at that level.
>
> Drawbacks:
>
> - This option/implementation is only reasonable for systems with NPTL,
> because of the low thread spawning overhead. Non-NPTL systems will
> probably find the cost to be higher. As a side note, we tried an
> implementation that kept a pool of threads and sent operations to
> those
> threads, but we found that the overhead of synchronization and
> signaling
> in this approach was (surprisingly) much higher than the cost of just
> creating brand new threads on every operation that did not require
> synchronization.
> - This implementation only helps contiguous reads or writes as they
> appear to Trove. You could extend it to work for other patterns by
> just
> doing a series of preads and pwrites to work down the list of buffers,
> but we did not handle this case.
>
> Results:
>
> We didn't see a big gain from this approach at first, but since
> then we
> have taken care of some other bottlenecks that make the improvement
> more
> obvious. It also seems that the performance boost varies quite a bit
> depending on the type of system you run it on. We have some new
> servers
> (results shown below) that benefitted greatly from this optimization.
>
> The numbers below show the results from a setup with 16 servers and a
> variable number of clients and number of processes per client. The
> benchmark is performing a read only access pattern with 100 MB
> buffers.
> All clients are accessing the same file 40 GB file (we rotate among
> several to avoid caching). The file is divided into contiguous
> regions,
> one per each process. We are using local hardware raid at each
> server, and gigabit ethernet for communication.
>
> Before optimization:
> client nodes x processes per node - MB/s aggregate throughput
> --------------------------------------------------------------
>
> 1 x 1 - 97.8
> 1 x 2 - 110.4
> 1 x 5 - 111.1
> 12 x 1 - 195.8
> 12 x 2 - 138.8
> 25 x 1 - 160.4
> 25 x 2 - 178.0
>
> After optimization:
> client nodes x processes per node - MB/s aggregate throughput
> --------------------------------------------------------------
> 1 x 1 - 93.4
> 1 x 2 - 109.2
> 1 x 5 - 108.9
> 12 x 1 - 443.1
> 12 x 2 - 502.6
> 25 x 1 - 496.7
> 25 x 2 - 550.7
>
> To confirm the cause of the problem, we performed a variation on the
> test where each client read an independent file, rather than the
> clients
> all hitting the same file. Running this benchmark with 12 client
> nodes (one process per node) resulted in a consistent 430 MB/s of
> aggregate
> throughput regardless of whether the new AIO path was used or not.
> This
> seems to confirm that the problem is a result of the sequential
> queueing
> that the normal AIO implementation does when multiple requests hit the
> same file.
>
> For these particular machines we were able to double or triple the
> read
> throughput for a parallel application that shared one large file. I am
> fairly sure that not all of our machines demonstrate this problem
> to such a drastic degree, but we will probably be testing some
> other setups later to get a better idea.
>
> -Phil
> diff -Naur pvfs2/src/common/misc/server-config.c pvfs2-new/src/
> common/misc/server-config.c
> --- pvfs2/src/common/misc/server-config.c 2006-08-02
> 17:13:00.000000000 +0200
> +++ pvfs2-new/src/common/misc/server-config.c 2006-08-03
> 21:57:35.000000000 +0200
> @@ -71,6 +71,7 @@
> static DOTCONF_CB(get_attr_cache_size);
> static DOTCONF_CB(get_attr_cache_max_num_elems);
> static DOTCONF_CB(get_trove_sync_meta);
> +static DOTCONF_CB(get_trove_alt_aio);
> static DOTCONF_CB(get_trove_sync_data);
> static DOTCONF_CB(get_db_cache_size_bytes);
> static DOTCONF_CB(get_db_cache_type);
> @@ -656,6 +657,12 @@
> {"DBCacheType", ARG_STR, get_db_cache_type, NULL,
> CTX_STORAGEHINTS, "sys"},
>
> + /* enable alternate AIO implementation for certain types of I/O
> + * operations (experimental
> + */
> + {"TroveAltAIOMode",ARG_STR, get_trove_alt_aio, NULL,
> + CTX_DEFAULTS|CTX_GLOBAL,"no"},
> +
> /* Specifies the format of the date/timestamp that events will
> have
> * in the event log. Possible values are:
> *
> @@ -1478,6 +1485,28 @@
> return NULL;
> }
>
> +DOTCONF_CB(get_trove_alt_aio)
> +{
> + struct server_configuration_s *config_s =
> + (struct server_configuration_s *)cmd->context;
> +
> + if(strcasecmp(cmd->data.str, "yes") == 0)
> + {
> + config_s->trove_alt_aio_mode = 1;
> + }
> + else if(strcasecmp(cmd->data.str, "no") == 0)
> + {
> + config_s->trove_alt_aio_mode = 0;
> + }
> + else
> + {
> + return("TroveAltAIOMode value must be 'yes' or 'no'.\n");
> + }
> +
> + return NULL;
> +}
> +
> +
> DOTCONF_CB(get_trove_sync_meta)
> {
> struct filesystem_configuration_s *fs_conf = NULL;
> diff -Naur pvfs2/src/common/misc/server-config.h pvfs2-new/src/
> common/misc/server-config.h
> --- pvfs2/src/common/misc/server-config.h 2006-07-13
> 07:11:40.000000000 +0200
> +++ pvfs2-new/src/common/misc/server-config.h 2006-08-03
> 21:58:25.000000000 +0200
> @@ -146,7 +146,10 @@
> int db_cache_size_bytes; /* cache size to use in
> berkeley db
> if zero, use defaults */
> char * db_cache_type;
> -
> + int trove_alt_aio_mode; /* enables experimental
> alternative AIO
> + * implementation for some
> types of
> + * operations
> + */
> } server_configuration_s;
>
> int PINT_parse_config(
> diff -Naur pvfs2/src/io/trove/trove-dbpf/dbpf-bstream.c pvfs2-new/
> src/io/trove/trove-dbpf/dbpf-bstream.c
> --- pvfs2/src/io/trove/trove-dbpf/dbpf-bstream.c 2006-06-23
> 22:59:29.000000000 +0200
> +++ pvfs2-new/src/io/trove/trove-dbpf/dbpf-bstream.c 2006-08-03
> 21:55:22.000000000 +0200
> @@ -73,6 +73,41 @@
> static int dbpf_bstream_flush_op_svc(struct dbpf_op *op_p);
> static int dbpf_bstream_resize_op_svc(struct dbpf_op *op_p);
>
> +struct alt_aio_item
> +{
> + struct aiocb *cb_p;
> + struct sigevent *sig;
> + struct qlist_head list_link;
> +};
> +static int alt_lio_listio(int mode, struct aiocb *list[],
> + int nent, struct sigevent *sig);
> +static void* alt_lio_thread(void*);
> +extern int TROVE_alt_aio_mode;
> +
> +
> +#ifdef __PVFS2_TROVE_AIO_THREADED__
> +/* allow bypassing default lio_listio implementation if user
> requests it and
> + * some conditions are met
> + */
> +static inline int LIO_LISTIO(int mode, struct aiocb *list[],
> + int nent, struct sigevent *sig)
> +{
> + if((TROVE_alt_aio_mode) && (nent == 1) &&
> + (((list[0])->aio_lio_opcode == LIO_READ) ||
> + ((list[0])->aio_lio_opcode == LIO_WRITE)) &&
> + (mode == LIO_NOWAIT))
> + {
> + return(alt_lio_listio(mode, list, nent, sig));
> + }
> + else
> + {
> + return(lio_listio(mode, list, nent, sig));
> + }
> +}
> +#else
> +#define LIO_LISTIO lio_listio
> +#endif
> +
> #ifdef __PVFS2_TROVE_AIO_THREADED__
> #include "dbpf-thread.h"
> #include "pvfs2-internal.h"
> @@ -321,7 +356,7 @@
> }
> }
>
> - ret = lio_listio(LIO_NOWAIT, aiocb_ptr_array,
> aiocb_inuse_count,
> + ret = LIO_LISTIO(LIO_NOWAIT, aiocb_ptr_array,
> aiocb_inuse_count,
> &cur_op->op.u.b_rw_list.sigev);
>
> if (ret != 0)
> @@ -423,7 +458,7 @@
> }
> }
>
> - ret = lio_listio(LIO_NOWAIT, aiocb_ptr_array,
> + ret = LIO_LISTIO(LIO_NOWAIT, aiocb_ptr_array,
> aiocb_inuse_count, sig);
> if (ret != 0)
> {
> @@ -1337,6 +1372,108 @@
> dbpf_bstream_flush
> };
>
> +int alt_lio_listio(int mode, struct aiocb *list[],
> + int nent, struct sigevent *sig)
> +{
> + struct alt_aio_item* tmp_item;
> + int ret;
> + pthread_t tid;
> + pthread_attr_t attr;
> +
> + /* alt_lio only supports a subset of the full lio
> functionality */
> + /* NOTE: an earlier check is supposed to make sure that we
> don't invoke
> + * this function for unsupported cases
> + */
> + assert(mode == LIO_NOWAIT);
> + assert(nent == 1);
> + assert((list[0]->aio_lio_opcode == LIO_READ) ||
> + (list[0]->aio_lio_opcode == LIO_WRITE));
> +
> + tmp_item = (struct alt_aio_item*)malloc(sizeof(struct
> alt_aio_item));
> + if(!tmp_item)
> + {
> + /* preserve errno */
> + return(-1);
> + }
> + tmp_item->cb_p = list[0];
> + tmp_item->sig = sig;
> +
> + /* set detached state */
> + ret = pthread_attr_init(&attr);
> + if(ret != 0)
> + {
> + free(tmp_item);
> + errno = ret;
> + return(-1);
> + }
> + ret = pthread_attr_setdetachstate(&attr,
> PTHREAD_CREATE_DETACHED);
> + if(ret != 0)
> + {
> + free(tmp_item);
> + errno = ret;
> + return(-1);
> + }
> +
> + /* create thread to perform I/O and trigger callback */
> + ret = pthread_create(&tid, &attr, alt_lio_thread, tmp_item);
> + if(ret != 0)
> + {
> + free(tmp_item);
> + errno = ret;
> + return(-1);
> + }
> +
> + return(0);
> +}
> +
> +/* prototypes for pread and pwrite; _XOPEN_SOURCE causes db.h
> problems */
> +ssize_t pread(int fd, void *buf, size_t count, off_t offset);
> +ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);
> +static void* alt_lio_thread(void* foo)
> +{
> + struct alt_aio_item* tmp_item = (struct alt_aio_item*)foo;
> + int ret = 0;
> +
> + if(tmp_item->cb_p->aio_lio_opcode == LIO_READ)
> + {
> + ret = pread(tmp_item->cb_p->aio_fildes,
> + (void*)tmp_item->cb_p->aio_buf,
> + tmp_item->cb_p->aio_nbytes,
> + tmp_item->cb_p->aio_offset);
> + }
> + else if(tmp_item->cb_p->aio_lio_opcode == LIO_WRITE)
> + {
> + ret = pwrite(tmp_item->cb_p->aio_fildes,
> + (const void*)tmp_item->cb_p->aio_buf,
> + tmp_item->cb_p->aio_nbytes,
> + tmp_item->cb_p->aio_offset);
> + }
> + else
> + {
> + /* this should have been caught already */
> + assert(0);
> + }
> +
> + /* store error and return codes */
> + if(ret < 0)
> + {
> + tmp_item->cb_p->__error_code = errno;
> + }
> + else
> + {
> + tmp_item->cb_p->__error_code = 0;
> + tmp_item->cb_p->__return_value = ret;
> + }
> +
> + /* run callback fn */
> + tmp_item->sig->sigev_notify_function(
> + tmp_item->sig->sigev_value);
> +
> + free(tmp_item);
> +
> + return(NULL);
> +}
> +
> /*
> * Local variables:
> * c-indent-level: 4
> diff -Naur pvfs2/src/io/trove/trove.c pvfs2-new/src/io/trove/trove.c
> --- pvfs2/src/io/trove/trove.c 2006-06-16 23:01:13.000000000 +0200
> +++ pvfs2-new/src/io/trove/trove.c 2006-08-03 21:55:56.000000000 +0200
> @@ -30,6 +30,7 @@
> struct PINT_perf_counter* PINT_server_pc = NULL;
>
> int TROVE_db_cache_size_bytes = 0;
> +int TROVE_alt_aio_mode = 0;
> int TROVE_shm_key_hint = 0;
>
> /** Initiate reading from a contiguous region in a bstream into a
> @@ -964,6 +965,11 @@
> TROVE_shm_key_hint = *((int*)parameter);
> return(0);
> }
> + if(option == TROVE_ALT_AIO_MODE)
> + {
> + TROVE_alt_aio_mode = *((int*)parameter);
> + return(0);
> + }
>
> method_id = map_coll_id_to_method(coll_id);
> if (method_id < 0) {
> diff -Naur pvfs2/src/io/trove/trove.h pvfs2-new/src/io/trove/trove.h
> --- pvfs2/src/io/trove/trove.h 2006-07-13 07:11:41.000000000 +0200
> +++ pvfs2-new/src/io/trove/trove.h 2006-08-03 21:56:23.000000000 +0200
> @@ -72,6 +72,7 @@
> TROVE_COLLECTION_ATTR_CACHE_MAX_NUM_ELEMS,
> TROVE_COLLECTION_ATTR_CACHE_INITIALIZE,
> TROVE_DB_CACHE_SIZE_BYTES,
> + TROVE_ALT_AIO_MODE,
> TROVE_COLLECTION_COALESCING_HIGH_WATERMARK,
> TROVE_COLLECTION_COALESCING_LOW_WATERMARK,
> TROVE_COLLECTION_META_SYNC_MODE,
> diff -Naur pvfs2/src/server/pvfs2-server.c pvfs2-new/src/server/
> pvfs2-server.c
> --- pvfs2/src/server/pvfs2-server.c 2006-07-13 07:11:42.000000000
> +0200
> +++ pvfs2-new/src/server/pvfs2-server.c 2006-08-03
> 21:54:02.000000000 +0200
> @@ -950,6 +950,10 @@
>
> &server_config.db_cache_size_bytes);
> /* this should never fail */
> assert(ret == 0);
> + ret = trove_collection_setinfo(0, 0, TROVE_ALT_AIO_MODE,
> + &server_config.trove_alt_aio_mode);
> + /* this should never fail */
> + assert(ret == 0);
>
> /* parse port number and allow trove to use it to help
> differentiate
> * shmem regions if needed
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
More information about the Pvfs2-developers
mailing list