[Pvfs2-developers] patch: alternate AIO implementation

Sam Lang slang at mcs.anl.gov
Thu Aug 10 14:52:44 EDT 2006


This differs from what Julian did quite a bit I guess.  You've  
essentially re-written the lio_listio call?  Does it make sense to  
factor that out so that it could be used as a separate implementation  
for others (LD_PRELOAD=libfastaio.so :-))?  I guess the only  
advantage pvfs would gain from that would be the performance  
comparisons of the two different lio_listio implementations from all  
the aio tests out already out there.  A disadvantage might be that  
you can't switch between one and the other dynamically at runtime.

Julian's implementation changes the actual dbpf_bstream_rw_list call,  
so its one level up from yours.  With the performance gains you see  
from spawning a thread on demand instead of pooling, maybe its time  
to get rid of the dbpf queue altogether and start spawning a thread  
per trove operation?  The advantage we get from queuing being that we  
can do our own scheduling, although your numbers are impressive, so  
maybe that's not such a good idea.

-sam

On Aug 10, 2006, at 3:37 PM, Phil Carns wrote:

> Background:
>
> We have been a little suspicious of the posix aio performance on  
> some of
> our servers. After digging in the glibc code a little, we found a
> possible problem. Glibc's aio will spawn up to 16 threads by default,
> but will never assign more than a single thread to a given fd. That
> thread will then service all operations on that fd sequentially  
> using a
> FIFO queue. This means that if several clients are performing I/O  
> to the
> same datafile, then all of their I/O requests get pushed to the disk
> sequentially (and probably not in order by offset).
>
> Patch:
>
> This patch replaces the lio_listio() calls with a macro called
> LIO_LISTIO(). You can then toggle what this macro does by using a
> config file option "TroveAltAIOMode yes|no". If the option is not
> specified (or is set to no) then the normal code path is taken. If the
> option is enabled, then it looks at the arguments. If the operation is
> a single buffer read or write, then it immediately spawns a new  
> detached
> thread, services the opertion using p{read/write}, triggers a callback
> function, and exits. More complex operations are sent to the usual
> lio_listio() route.
>
> This idea is to basically try to get the requests off to the kernel as
> quickly as possible without queueing so that the kernel can sort  
> out how
> to best service them. Trove doesn't care about ordering at that level.
>
> Drawbacks:
>
> - This option/implementation is only reasonable for systems with NPTL,
> because of the low thread spawning overhead. Non-NPTL systems will
> probably find the cost to be higher. As a side note, we tried an
> implementation that kept a pool of threads and sent operations to  
> those
> threads, but we found that the overhead of synchronization and  
> signaling
> in this approach was (surprisingly) much higher than the cost of just
> creating brand new threads on every operation that did not require
> synchronization.
> - This implementation only helps contiguous reads or writes as they
> appear to Trove. You could extend it to work for other patterns by  
> just
> doing a series of preads and pwrites to work down the list of buffers,
> but we did not handle this case.
>
> Results:
>
> We didn't see a big gain from this approach at first, but since  
> then we
> have taken care of some other bottlenecks that make the improvement  
> more
> obvious. It also seems that the performance boost varies quite a bit
> depending on the type of system you run it on. We have some new  
> servers
> (results shown below) that benefitted greatly from this optimization.
>
> The numbers below show the results from a setup with 16 servers and a
> variable number of clients and number of processes per client. The
> benchmark is performing a read only access pattern with 100 MB  
> buffers.
> All clients are accessing the same file 40 GB file (we rotate among
> several to avoid caching). The file is divided into contiguous  
> regions,
> one per each process.  We are using local hardware raid at each  
> server, and gigabit ethernet for communication.
>
> Before optimization:
> client nodes x processes per node - MB/s aggregate throughput
> --------------------------------------------------------------
>
> 1 x 1 - 97.8
> 1 x 2 - 110.4
> 1 x 5 - 111.1
> 12 x 1 - 195.8
> 12 x 2 - 138.8
> 25 x 1 - 160.4
> 25 x 2 - 178.0
>
> After optimization:
> client nodes x processes per node - MB/s aggregate throughput
> --------------------------------------------------------------
> 1 x 1 - 93.4
> 1 x 2 - 109.2
> 1 x 5 - 108.9
> 12 x 1 - 443.1
> 12 x 2 - 502.6
> 25 x 1 - 496.7
> 25 x 2 - 550.7
>
> To confirm the cause of the problem, we performed a variation on the
> test where each client read an independent file, rather than the  
> clients
> all hitting the same file. Running this benchmark with 12 client  
> nodes (one process per node) resulted in a consistent 430 MB/s of  
> aggregate
> throughput regardless of whether the new AIO path was used or not.  
> This
> seems to confirm that the problem is a result of the sequential  
> queueing
> that the normal AIO implementation does when multiple requests hit the
> same file.
>
> For these particular machines we were able to double or triple the  
> read
> throughput for a parallel application that shared one large file. I am
> fairly sure that not all of our machines demonstrate this problem  
> to such a drastic degree, but we will probably be testing some  
> other setups later to get a better idea.
>
> -Phil
> diff -Naur pvfs2/src/common/misc/server-config.c pvfs2-new/src/ 
> common/misc/server-config.c
> --- pvfs2/src/common/misc/server-config.c	2006-08-02  
> 17:13:00.000000000 +0200
> +++ pvfs2-new/src/common/misc/server-config.c	2006-08-03  
> 21:57:35.000000000 +0200
> @@ -71,6 +71,7 @@
>  static DOTCONF_CB(get_attr_cache_size);
>  static DOTCONF_CB(get_attr_cache_max_num_elems);
>  static DOTCONF_CB(get_trove_sync_meta);
> +static DOTCONF_CB(get_trove_alt_aio);
>  static DOTCONF_CB(get_trove_sync_data);
>  static DOTCONF_CB(get_db_cache_size_bytes);
>  static DOTCONF_CB(get_db_cache_type);
> @@ -656,6 +657,12 @@
>      {"DBCacheType", ARG_STR, get_db_cache_type, NULL,
>          CTX_STORAGEHINTS, "sys"},
>
> +    /* enable alternate AIO implementation for certain types of I/O
> +     * operations (experimental
> +     */
> +    {"TroveAltAIOMode",ARG_STR, get_trove_alt_aio, NULL,
> +        CTX_DEFAULTS|CTX_GLOBAL,"no"},
> +
>      /* Specifies the format of the date/timestamp that events will  
> have
>       * in the event log.  Possible values are:
>       *
> @@ -1478,6 +1485,28 @@
>      return NULL;
>  }
>
> +DOTCONF_CB(get_trove_alt_aio)
> +{
> +    struct server_configuration_s *config_s =
> +        (struct server_configuration_s *)cmd->context;
> +
> +    if(strcasecmp(cmd->data.str, "yes") == 0)
> +    {
> +        config_s->trove_alt_aio_mode = 1;
> +    }
> +    else if(strcasecmp(cmd->data.str, "no") == 0)
> +    {
> +        config_s->trove_alt_aio_mode = 0;
> +    }
> +    else
> +    {
> +        return("TroveAltAIOMode value must be 'yes' or 'no'.\n");
> +    }
> +
> +    return NULL;
> +}
> +
> +
>  DOTCONF_CB(get_trove_sync_meta)
>  {
>      struct filesystem_configuration_s *fs_conf = NULL;
> diff -Naur pvfs2/src/common/misc/server-config.h pvfs2-new/src/ 
> common/misc/server-config.h
> --- pvfs2/src/common/misc/server-config.h	2006-07-13  
> 07:11:40.000000000 +0200
> +++ pvfs2-new/src/common/misc/server-config.h	2006-08-03  
> 21:58:25.000000000 +0200
> @@ -146,7 +146,10 @@
>      int db_cache_size_bytes;        /* cache size to use in  
> berkeley db
>                                         if zero, use defaults */
>      char * db_cache_type;
> -
> +    int trove_alt_aio_mode;         /* enables experimental  
> alternative AIO
> +                                     * implementation for some  
> types of
> +                                     * operations
> +                                     */
>  } server_configuration_s;
>
>  int PINT_parse_config(
> diff -Naur pvfs2/src/io/trove/trove-dbpf/dbpf-bstream.c pvfs2-new/ 
> src/io/trove/trove-dbpf/dbpf-bstream.c
> --- pvfs2/src/io/trove/trove-dbpf/dbpf-bstream.c	2006-06-23  
> 22:59:29.000000000 +0200
> +++ pvfs2-new/src/io/trove/trove-dbpf/dbpf-bstream.c	2006-08-03  
> 21:55:22.000000000 +0200
> @@ -73,6 +73,41 @@
>  static int dbpf_bstream_flush_op_svc(struct dbpf_op *op_p);
>  static int dbpf_bstream_resize_op_svc(struct dbpf_op *op_p);
>
> +struct alt_aio_item
> +{
> +    struct aiocb *cb_p;
> +    struct sigevent *sig;
> +    struct qlist_head list_link;
> +};
> +static int alt_lio_listio(int mode, struct aiocb *list[],
> +    int nent, struct sigevent *sig);
> +static void* alt_lio_thread(void*);
> +extern int TROVE_alt_aio_mode;
> +
> +
> +#ifdef __PVFS2_TROVE_AIO_THREADED__
> +/* allow bypassing default lio_listio implementation if user  
> requests it and
> + * some conditions are met
> + */
> +static inline int LIO_LISTIO(int mode, struct aiocb *list[],
> +    int nent, struct sigevent *sig)
> +{
> +    if((TROVE_alt_aio_mode) && (nent == 1) &&
> +        (((list[0])->aio_lio_opcode == LIO_READ) ||
> +        ((list[0])->aio_lio_opcode == LIO_WRITE)) &&
> +        (mode == LIO_NOWAIT))
> +    {
> +        return(alt_lio_listio(mode, list, nent, sig));
> +    }
> +    else
> +    {
> +        return(lio_listio(mode, list, nent, sig));
> +    }
> +}
> +#else
> +#define LIO_LISTIO lio_listio
> +#endif
> +
>  #ifdef __PVFS2_TROVE_AIO_THREADED__
>  #include "dbpf-thread.h"
>  #include "pvfs2-internal.h"
> @@ -321,7 +356,7 @@
>              }
>          }
>
> -        ret = lio_listio(LIO_NOWAIT, aiocb_ptr_array,  
> aiocb_inuse_count,
> +        ret = LIO_LISTIO(LIO_NOWAIT, aiocb_ptr_array,  
> aiocb_inuse_count,
>                           &cur_op->op.u.b_rw_list.sigev);
>
>          if (ret != 0)
> @@ -423,7 +458,7 @@
>              }
>          }
>
> -        ret = lio_listio(LIO_NOWAIT, aiocb_ptr_array,
> +        ret = LIO_LISTIO(LIO_NOWAIT, aiocb_ptr_array,
>                           aiocb_inuse_count, sig);
>          if (ret != 0)
>          {
> @@ -1337,6 +1372,108 @@
>      dbpf_bstream_flush
>  };
>
> +int alt_lio_listio(int mode, struct aiocb *list[],
> +    int nent, struct sigevent *sig)
> +{
> +    struct alt_aio_item* tmp_item;
> +    int ret;
> +    pthread_t tid;
> +    pthread_attr_t attr;
> +
> +    /* alt_lio only supports a subset of the full lio  
> functionality */
> +    /* NOTE: an earlier check is supposed to make sure that we  
> don't invoke
> +     * this function for unsupported cases
> +     */
> +    assert(mode == LIO_NOWAIT);
> +    assert(nent == 1);
> +    assert((list[0]->aio_lio_opcode == LIO_READ) ||
> +        (list[0]->aio_lio_opcode == LIO_WRITE));
> +
> +    tmp_item = (struct alt_aio_item*)malloc(sizeof(struct  
> alt_aio_item));
> +    if(!tmp_item)
> +    {
> +        /* preserve errno */
> +        return(-1);
> +    }
> +    tmp_item->cb_p = list[0];
> +    tmp_item->sig = sig;
> +
> +    /* set detached state */
> +    ret = pthread_attr_init(&attr);
> +    if(ret != 0)
> +    {
> +        free(tmp_item);
> +        errno = ret;
> +        return(-1);
> +    }
> +    ret = pthread_attr_setdetachstate(&attr,  
> PTHREAD_CREATE_DETACHED);
> +    if(ret != 0)
> +    {
> +        free(tmp_item);
> +        errno = ret;
> +        return(-1);
> +    }
> +
> +    /* create thread to perform I/O and trigger callback */
> +    ret = pthread_create(&tid, &attr, alt_lio_thread, tmp_item);
> +    if(ret != 0)
> +    {
> +        free(tmp_item);
> +        errno = ret;
> +        return(-1);
> +    }
> +
> +    return(0);
> +}
> +
> +/* prototypes for pread and pwrite; _XOPEN_SOURCE causes db.h  
> problems */
> +ssize_t pread(int fd, void *buf, size_t count, off_t offset);
> +ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);
> +static void* alt_lio_thread(void* foo)
> +{
> +    struct alt_aio_item* tmp_item = (struct alt_aio_item*)foo;
> +    int ret = 0;
> +
> +    if(tmp_item->cb_p->aio_lio_opcode == LIO_READ)
> +    {
> +        ret = pread(tmp_item->cb_p->aio_fildes,
> +            (void*)tmp_item->cb_p->aio_buf,
> +            tmp_item->cb_p->aio_nbytes,
> +            tmp_item->cb_p->aio_offset);
> +    }
> +    else if(tmp_item->cb_p->aio_lio_opcode == LIO_WRITE)
> +    {
> +        ret = pwrite(tmp_item->cb_p->aio_fildes,
> +            (const void*)tmp_item->cb_p->aio_buf,
> +            tmp_item->cb_p->aio_nbytes,
> +            tmp_item->cb_p->aio_offset);
> +    }
> +    else
> +    {
> +        /* this should have been caught already */
> +        assert(0);
> +    }
> +
> +    /* store error and return codes */
> +    if(ret < 0)
> +    {
> +        tmp_item->cb_p->__error_code = errno;
> +    }
> +    else
> +    {
> +        tmp_item->cb_p->__error_code = 0;
> +        tmp_item->cb_p->__return_value = ret;
> +    }
> +
> +    /* run callback fn */
> +    tmp_item->sig->sigev_notify_function(
> +        tmp_item->sig->sigev_value);
> +
> +    free(tmp_item);
> +
> +    return(NULL);
> +}
> +
>  /*
>   * Local variables:
>   *  c-indent-level: 4
> diff -Naur pvfs2/src/io/trove/trove.c pvfs2-new/src/io/trove/trove.c
> --- pvfs2/src/io/trove/trove.c	2006-06-16 23:01:13.000000000 +0200
> +++ pvfs2-new/src/io/trove/trove.c	2006-08-03 21:55:56.000000000 +0200
> @@ -30,6 +30,7 @@
>  struct PINT_perf_counter* PINT_server_pc = NULL;
>
>  int TROVE_db_cache_size_bytes = 0;
> +int TROVE_alt_aio_mode = 0;
>  int TROVE_shm_key_hint = 0;
>
>  /** Initiate reading from a contiguous region in a bstream into a
> @@ -964,6 +965,11 @@
>          TROVE_shm_key_hint = *((int*)parameter);
>  	return(0);
>      }
> +    if(option == TROVE_ALT_AIO_MODE)
> +    {
> +        TROVE_alt_aio_mode = *((int*)parameter);
> +	return(0);
> +    }
>
>      method_id = map_coll_id_to_method(coll_id);
>      if (method_id < 0) {
> diff -Naur pvfs2/src/io/trove/trove.h pvfs2-new/src/io/trove/trove.h
> --- pvfs2/src/io/trove/trove.h	2006-07-13 07:11:41.000000000 +0200
> +++ pvfs2-new/src/io/trove/trove.h	2006-08-03 21:56:23.000000000 +0200
> @@ -72,6 +72,7 @@
>      TROVE_COLLECTION_ATTR_CACHE_MAX_NUM_ELEMS,
>      TROVE_COLLECTION_ATTR_CACHE_INITIALIZE,
>      TROVE_DB_CACHE_SIZE_BYTES,
> +    TROVE_ALT_AIO_MODE,
>      TROVE_COLLECTION_COALESCING_HIGH_WATERMARK,
>      TROVE_COLLECTION_COALESCING_LOW_WATERMARK,
>      TROVE_COLLECTION_META_SYNC_MODE,
> diff -Naur pvfs2/src/server/pvfs2-server.c pvfs2-new/src/server/ 
> pvfs2-server.c
> --- pvfs2/src/server/pvfs2-server.c	2006-07-13 07:11:42.000000000  
> +0200
> +++ pvfs2-new/src/server/pvfs2-server.c	2006-08-03  
> 21:54:02.000000000 +0200
> @@ -950,6 +950,10 @@
>                                      
> &server_config.db_cache_size_bytes);
>      /* this should never fail */
>      assert(ret == 0);
> +    ret = trove_collection_setinfo(0, 0, TROVE_ALT_AIO_MODE,
> +        &server_config.trove_alt_aio_mode);
> +    /* this should never fail */
> +    assert(ret == 0);
>
>      /* parse port number and allow trove to use it to help  
> differentiate
>       * shmem regions if needed
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers



More information about the Pvfs2-developers mailing list