[Pvfs2-developers] patches: misc. bug fixes

Sam Lang slang at mcs.anl.gov
Fri Aug 11 15:29:11 EDT 2006


Hi Phil,

Thanks for catching these.  Its nice to be able to just apply and  
commit them to trunk.  Related to the flow cancel codepath, we've  
been seeing a bug where the flow gets cancelled after a while, which  
seems like it might be due to the default timeout values being set to  
30 seconds.  Have you guys seen anything like this in your tests?

Thanks,

-sam

On Aug 10, 2006, at 3:55 PM, Phil Carns wrote:

> bmi-socket-close.patch:
> -----------------------
> This fixes a bug in the new BMI_set_info(...  
> BMI_TCP_CLOSE_SOCKET ...) mechanism, which is used to reconnect the  
> socket to the initial configuration server if new socket buffer  
> sizes are specified in the config file.  I didn't follow the code  
> path find the exact problem, but at a high level it wasn't being  
> thorough enough in cleaning out the old socket.  This showed up  
> when using epoll and specifying socket buffer sizes in the server  
> configuration- in this case the client will often fail to mount  
> with a cryptic "not a directory" error and leave some epoll()  
> errors in the pvfs2-client.log file.  I think a stale (or possibly  
> reused) file descriptor was being left in the epoll fd set.  At any  
> rate, the fix is to use a different set of functions for tearing  
> down the entire address etc. so that it is reconnected on the next  
> BMI addr lookup.  This path is already used by the server to  
> discard old BMI addresses after critical errors on addresses that  
> cannot be reconnected.  It is triggered from bmi.c without entering  
> the bmi_tcp module, so this patch also adds a check to make sure we  
> don't bother for non-tcp methods.
>
> bmi-test-overflow.patch:
> ------------------------
> One of the bmi bandwidth test programs was using types that might  
> overflow if testing large enough transfers.  The fix is to convert  
> to doubles and drop in several type casts to be cautious when  
> performing the computation that was causing trouble.
>
> cancel-bugs.patch:
> -------------------------
> The biggest fix here is a change to the job timer code.  It was  
> performing some pointer operations in the wrong order, which could  
> lead to job timers failing to trigger in some cases. This would  
> prevent some operations from ever timing out.  A secondary fix is a  
> minor cleanup in BMI to catch potential race conditions in  
> cancellation where a lock wasn't being held while checking to see  
> if the target operation is complete.
>
> flow-post-error.patch:
> -------------------------
> This patch adds checks in the client side I/O state machine to test  
> for failure at post time for flow operations.  This type of error  
> is uncommon unless the flow parameters are faulty, but it should  
> have checked anyway to be safe.
>
> ndfile-config-check.patch:
> --------------------------
> This is a safety test.  The problem here is that there was no  
> bounds checking for the DefaultNumDFiles option in the config  
> file.  This made it possible to select -1 (which in PVFS1 meant  
> "use the default number").  In PVFS2 this number gets passed  
> verbatim to the client and would cause malloc failures and various  
> other odd results when used. The patch just checks at parse time to  
> make sure the value isn't negative.
>
> -Phil
> Index: pvfs2_src/src/io/bmi/bmi.c
> ===================================================================
> --- pvfs2_src/src/io/bmi/bmi.c	(revision 2347)
> +++ pvfs2_src/src/io/bmi/bmi.c	(revision 2348)
> @@ -1179,15 +1179,22 @@
>  	return(0);
>      }
>
> -    /*
> -     * Pass the TCP address structure as the parameter for this  
> operation,
> -     * holding the lock.
> -     */
> -    if (option == BMI_TCP_CLOSE_SOCKET) {
> -        inout_parameter = tmp_ref->method_addr;
> -        ret = tmp_ref->interface->BMI_meth_set_info(option,  
> inout_parameter);
> +    /* if the caller requests a TCP specific close socket action */
> +    if (option == BMI_TCP_CLOSE_SOCKET)
> +    {
> +        /* check to see if the address is in fact a tcp address */
> +        if(strcmp(tmp_ref->interface->method_name, "bmi_tcp") == 0)
> +        {
> +            /* take the same action as in the BMI_DEC_ADDR_REF  
> case to clean
> +             * out the entire address structure and anything  
> linked to it so
> +             * that the next addr_lookup starts from scratch
> +             */
> +	    gossip_debug(GOSSIP_BMI_DEBUG_CONTROL, "Closing bmi_tcp  
> connection at caller's request.\n");
> +            ref_list_rem(cur_ref_list, addr);
> +            dealloc_ref_st(tmp_ref);
> +        }
>          gen_mutex_unlock(&ref_mutex);
> -        return ret;
> +        return 0;
>      }
>
>      gen_mutex_unlock(&ref_mutex);
> Index: pvfs2_src/src/io/bmi/bmi_tcp/bmi-tcp.c
> ===================================================================
> --- pvfs2_src/src/io/bmi/bmi_tcp/bmi-tcp.c	(revision 2347)
> +++ pvfs2_src/src/io/bmi/bmi_tcp/bmi-tcp.c	(revision 2348)
> @@ -677,19 +677,10 @@
>              tcp_method_params.listen_addr->method_data)->socket);
>  #endif
>         break;
> -    /*
> -     * Used after changing buffer sizes to force the connection to
> -     * the server used for the config to close so it can be reopened
> -     * with the new buffer sizes as the rest of the future server
> -     * connections will be.
> -     */
> -    case BMI_TCP_CLOSE_SOCKET: {
> -        struct method_addr *map = inout_parameter;
> -        if (tcp_socket_collection_p)
> -            BMI_socket_collection_remove(tcp_socket_collection_p,  
> map);
> -        ret = tcp_shutdown_addr(map);
> +    case BMI_TCP_CLOSE_SOCKET:
> +        /* this should no longer make it to the bmi_tcp method;  
> see bmi.c */
> +        ret = 0;
>          break;
> -    }
>      case BMI_FORCEFUL_CANCEL_MODE:
>  	forceful_cancel_mode = 1;
>  	ret = 0;
> Index: pvfs2_src/test/io/bmi/driver-bw-multi.c
> ===================================================================
> --- pvfs2_src/test/io/bmi/driver-bw-multi.c	(revision 2290)
> +++ pvfs2_src/test/io/bmi/driver-bw-multi.c	(revision 2291)
> @@ -75,7 +75,7 @@
>      double stddev_bmi_time, stddev_mpi_time;
>      double agg_bmi_bw, agg_mpi_bw;
>      double ave_bmi_time, ave_mpi_time;
> -    int total_data_xfer = 0;
> +    double total_data_xfer = 0;
>      bmi_context_id context = -1;
>
>      /* start up benchmark environment */
> @@ -315,8 +315,8 @@
>       */
>      if (world_rank == 0)
>      {
> -	total_data_xfer = opts.num_servers * num_clients * num_messages *
> -	    opts.message_len;
> +	total_data_xfer = (double)opts.num_servers * (double)num_clients *
> +            (double)num_messages * (double)opts.message_len;
>  	if (opts.num_servers > 1)
>  	{
>  	    var_bmi_time = sumsq_bmi_time -
> @@ -352,8 +352,8 @@
>
>      if (world_rank == opts.num_servers)
>      {
> -	total_data_xfer = opts.num_servers * num_clients * num_messages *
> -	    opts.message_len;
> +	total_data_xfer = (double)opts.num_servers * (double)num_clients *
> +            (double)num_messages * (double)opts.message_len;
>
>  	if (num_clients > 1)
>  	{
> Index: pvfs2_src/src/io/bmi/bmi.c
> ===================================================================
> --- pvfs2_src/src/io/bmi/bmi.c	(revision 2089)
> +++ pvfs2_src/src/io/bmi/bmi.c	(revision 2090)
> @@ -1616,6 +1616,14 @@
>                   "%s: cancel id %llu\n", __func__, llu(id));
>
>      target_op = id_gen_safe_lookup(id);
> +    if(target_op == NULL)
> +    {
> +        /* if we can't find the operation, then assume it has already
> +         * completed naturally.
> +         */
> +        return(0);
> +    }
> +
>      assert(target_op->op_id == id);
>
>      if(active_method_table[target_op->addr->method_type]- 
> >BMI_meth_cancel)
> Index: pvfs2_src/src/io/bmi/bmi_tcp/bmi-tcp.c
> ===================================================================
> --- pvfs2_src/src/io/bmi/bmi_tcp/bmi-tcp.c	(revision 2089)
> +++ pvfs2_src/src/io/bmi/bmi_tcp/bmi-tcp.c	(revision 2090)
> @@ -1451,12 +1451,20 @@
>   */
>  int BMI_tcp_cancel(bmi_op_id_t id, bmi_context_id context_id)
>  {
> -    method_op_p query_op = (method_op_p)id_gen_safe_lookup(id);
> +    method_op_p query_op = NULL;
> +
> +    gen_mutex_lock(&interface_mutex);
>
> -    assert(query_op);
> +    query_op = (method_op_p)id_gen_safe_lookup(id);
> +    if(!query_op)
> +    {
> +        /* if we can't find the operattion, then assume that it  
> has already
> +         * completed naturally
> +         */
> +        gen_mutex_unlock(&interface_mutex);
> +        return(0);
> +    }
>
> -    gen_mutex_lock(&interface_mutex);
> -
>      /* easy case: is the operation already completed? */
>      if(((struct tcp_op*)(query_op->method_data))->tcp_op_state ==
>  	BMI_TCP_COMPLETE)
> Index: pvfs2_src/src/io/job/job-time-mgr.c
> ===================================================================
> --- pvfs2_src/src/io/job/job-time-mgr.c	(revision 2089)
> +++ pvfs2_src/src/io/job/job-time-mgr.c	(revision 2090)
> @@ -118,8 +118,8 @@
>              break;
>          }
>
> +        prev_bucket = tmp_bucket;
>          tmp_bucket = NULL;
> -        prev_bucket = tmp_bucket;
>      }
>
>      if(!tmp_bucket || tmp_bucket->expire_time_sec != expire_time_sec)
> @@ -257,6 +257,7 @@
>  	    case JOB_BMI:
>  		gossip_err("Job time out: cancelling bmi operation, job_id: %llu. 
> \n", llu(jd->job_id));
>  		ret = job_bmi_cancel(jd->job_id, jd->context_id);
> +	        jd->time_bucket = NULL;
>  		break;
>  	    case JOB_FLOW:
>  		/* have we made any progress since last time we checked? */
> @@ -274,22 +275,23 @@
>  		    /* otherwise kill the flow */
>  		    gossip_err("Job time out: cancelling flow operation, job_id:  
> %llu.\n", llu(jd->job_id));
>  		    ret = job_flow_cancel(jd->job_id, jd->context_id);
> +	            jd->time_bucket = NULL;
>  		}
>  		break;
>  	    case JOB_TROVE:
>  		gossip_err("Job time out: cancelling trove operation, job_id: % 
> llu.\n", llu(jd->job_id));
>  		ret = job_trove_dspace_cancel(
>                      jd->u.trove.fsid, jd->job_id, jd->context_id);
> +	        jd->time_bucket = NULL;
>                  break;
>  	    default:
>  		ret = 0;
> +	        jd->time_bucket = NULL;
>  		break;
>  	    }
>
>  	    /* FIXME: error handling */
>  	    assert(ret == 0);
> -
> -	    jd->time_bucket = NULL;
>  	}
>  	qlist_del(&tmp_bucket->bucket_link);
>          INIT_QLIST_HEAD(&tmp_bucket->jd_queue);
> Index: pvfs2_src/src/client/sysint/sys-io.sm
> ===================================================================
> --- pvfs2_src/src/client/sysint/sys-io.sm	(revision 2067)
> +++ pvfs2_src/src/client/sysint/sys-io.sm	(revision 2068)
> @@ -1048,12 +1048,30 @@
>          sm_p->u.io.flow_completion_count--;
>          assert(sm_p->u.io.flow_completion_count > -1);
>
> -        /* if error, restart; but if this is a write, let write  
> ack catch */
> +        /* look for flow error when no write ack is in progress  
> (usually a
> +         * read case)
> +         */
>          if (js_p->error_code < 0 && !cur_ctx- 
> >write_ack_in_progress) {
> -            gossip_debug(GOSSIP_IO_DEBUG,
> -              "%s: flow failed, retrying from msgpair\n", __func__);
> -            cur_ctx->msg_send_has_been_posted = 0;
> -            cur_ctx->msg_recv_has_been_posted = 0;
> +            if ((PVFS_ERROR_CLASS(-js_p->error_code) ==  
> PVFS_ERROR_BMI) ||
> +                 (PVFS_ERROR_CLASS(-js_p->error_code) ==  
> PVFS_ERROR_FLOW) ||
> +                 (js_p->error_code == -ECONNRESET) ||
> +                 (js_p->error_code == -PVFS_EPROTO))
> +            {
> +                /* if this is a an error that we can retry */
> +                gossip_debug(GOSSIP_IO_DEBUG,
> +                   "%s: flow failed, retrying from msgpair\n",  
> __func__);
> +                cur_ctx->msg_send_has_been_posted = 0;
> +                cur_ctx->msg_recv_has_been_posted = 0;
> +            }
> +            else
> +            {
> +                /* do not retry on remaining error codes */
> +                gossip_debug(GOSSIP_IO_DEBUG,
> +                   "%s: flow failed, not retrying\n", __func__);
> +
> +                /* forcing the count high insures that the sm  
> won't restart */
> +                sm_p->u.io.retry_count = sm_p- 
> >msgarray_params.retry_limit;
> +            }
>          }
>      }
>      else if (STATUS_USER_TAG_TYPE(status_user_tag,  
> IO_SM_PHASE_FINAL_ACK))
> @@ -1741,27 +1759,47 @@
>          server_config->client_job_flow_timeout);
>      PINT_put_server_config_struct(server_config);
>
> -    if (ret < 0)
> +    /* if the flow fails immediately, then we have to do some special
> +     * handling.  This function is not equiped to handle the failure
> +     * directly, so we instead post a null job that will propigate  
> the error
> +     * to the normal state where we interpret flow errors
> +     */
> +    if((ret < 0) || (ret == 1 && cur_ctx->flow_status.error_code ! 
> = 0))
>      {
> -        gossip_err("job_flow failed\n");
> -        return ret;
> +        /* make sure the error code is stored in the flow  
> descriptor */
> +        if(ret == 1)
> +        {
> +            cur_ctx->flow_desc.error_code = cur_ctx- 
> >flow_status.error_code;
> +        }
> +        else
> +        {
> +            cur_ctx->flow_desc.error_code = ret;
> +        }
> +
> +        gossip_debug(GOSSIP_IO_DEBUG, "  immediate flow failure for "
> +                     "context %p, error code: %d\n", cur_ctx,
> +                     cur_ctx->flow_desc.error_code);
> +        gossip_debug(GOSSIP_IO_DEBUG, "  posting job_null() to  
> propigate.\n");
> +
> +        /* post a fake job to propigate the flow failure to a  
> later state */
> +        ret = job_null(cur_ctx->flow_desc.error_code, sm_p,
> +            status_user_tag, &cur_ctx->flow_status,
> +            &cur_ctx->flow_job_id, pint_client_sm_context);
> +        if(ret !=0)
> +        {
> +            return(ret);
> +        }
>      }
> -    else if (ret == 1)
> -    {
> -        gossip_debug(GOSSIP_IO_DEBUG, "  flow for context %p "
> -                     "completed immediately\n", cur_ctx);
> -        assert(cur_ctx->flow_status.error_code == 0);
> -    }
>      else
>      {
>          gossip_debug(GOSSIP_IO_DEBUG, "  posted flow for "
>                       "context %p\n", cur_ctx);
> -
> -        cur_ctx->flow_has_been_posted = 1;
> -        cur_ctx->flow_in_progress = 1;
> -        sm_p->u.io.flow_completion_count++;
>      }
>
> +    cur_ctx->flow_has_been_posted = 1;
> +    cur_ctx->flow_in_progress = 1;
> +    sm_p->u.io.flow_completion_count++;
> +
>      return 0;
>  }
>
> Index: pvfs2_src/src/common/misc/server-config.c
> ===================================================================
> --- pvfs2_src/src/common/misc/server-config.c	(revision 2273)
> +++ pvfs2_src/src/common/misc/server-config.c	(revision 2274)
> @@ -1871,6 +1871,10 @@
>          PINT_llist_head(config_s->file_systems);
>
>      fs_conf->default_num_dfiles = (int)cmd->data.value;
> +    if(fs_conf->default_num_dfiles < 0)
> +    {
> +        return("Error DefaultNumDFiles must be positive.\n");
> +    }
>      return NULL;
>  }
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers



More information about the Pvfs2-developers mailing list