[Pvfs2-developers] patches: misc. bug fixes
Sam Lang
slang at mcs.anl.gov
Fri Aug 11 15:29:11 EDT 2006
Hi Phil,
Thanks for catching these. Its nice to be able to just apply and
commit them to trunk. Related to the flow cancel codepath, we've
been seeing a bug where the flow gets cancelled after a while, which
seems like it might be due to the default timeout values being set to
30 seconds. Have you guys seen anything like this in your tests?
Thanks,
-sam
On Aug 10, 2006, at 3:55 PM, Phil Carns wrote:
> bmi-socket-close.patch:
> -----------------------
> This fixes a bug in the new BMI_set_info(...
> BMI_TCP_CLOSE_SOCKET ...) mechanism, which is used to reconnect the
> socket to the initial configuration server if new socket buffer
> sizes are specified in the config file. I didn't follow the code
> path find the exact problem, but at a high level it wasn't being
> thorough enough in cleaning out the old socket. This showed up
> when using epoll and specifying socket buffer sizes in the server
> configuration- in this case the client will often fail to mount
> with a cryptic "not a directory" error and leave some epoll()
> errors in the pvfs2-client.log file. I think a stale (or possibly
> reused) file descriptor was being left in the epoll fd set. At any
> rate, the fix is to use a different set of functions for tearing
> down the entire address etc. so that it is reconnected on the next
> BMI addr lookup. This path is already used by the server to
> discard old BMI addresses after critical errors on addresses that
> cannot be reconnected. It is triggered from bmi.c without entering
> the bmi_tcp module, so this patch also adds a check to make sure we
> don't bother for non-tcp methods.
>
> bmi-test-overflow.patch:
> ------------------------
> One of the bmi bandwidth test programs was using types that might
> overflow if testing large enough transfers. The fix is to convert
> to doubles and drop in several type casts to be cautious when
> performing the computation that was causing trouble.
>
> cancel-bugs.patch:
> -------------------------
> The biggest fix here is a change to the job timer code. It was
> performing some pointer operations in the wrong order, which could
> lead to job timers failing to trigger in some cases. This would
> prevent some operations from ever timing out. A secondary fix is a
> minor cleanup in BMI to catch potential race conditions in
> cancellation where a lock wasn't being held while checking to see
> if the target operation is complete.
>
> flow-post-error.patch:
> -------------------------
> This patch adds checks in the client side I/O state machine to test
> for failure at post time for flow operations. This type of error
> is uncommon unless the flow parameters are faulty, but it should
> have checked anyway to be safe.
>
> ndfile-config-check.patch:
> --------------------------
> This is a safety test. The problem here is that there was no
> bounds checking for the DefaultNumDFiles option in the config
> file. This made it possible to select -1 (which in PVFS1 meant
> "use the default number"). In PVFS2 this number gets passed
> verbatim to the client and would cause malloc failures and various
> other odd results when used. The patch just checks at parse time to
> make sure the value isn't negative.
>
> -Phil
> Index: pvfs2_src/src/io/bmi/bmi.c
> ===================================================================
> --- pvfs2_src/src/io/bmi/bmi.c (revision 2347)
> +++ pvfs2_src/src/io/bmi/bmi.c (revision 2348)
> @@ -1179,15 +1179,22 @@
> return(0);
> }
>
> - /*
> - * Pass the TCP address structure as the parameter for this
> operation,
> - * holding the lock.
> - */
> - if (option == BMI_TCP_CLOSE_SOCKET) {
> - inout_parameter = tmp_ref->method_addr;
> - ret = tmp_ref->interface->BMI_meth_set_info(option,
> inout_parameter);
> + /* if the caller requests a TCP specific close socket action */
> + if (option == BMI_TCP_CLOSE_SOCKET)
> + {
> + /* check to see if the address is in fact a tcp address */
> + if(strcmp(tmp_ref->interface->method_name, "bmi_tcp") == 0)
> + {
> + /* take the same action as in the BMI_DEC_ADDR_REF
> case to clean
> + * out the entire address structure and anything
> linked to it so
> + * that the next addr_lookup starts from scratch
> + */
> + gossip_debug(GOSSIP_BMI_DEBUG_CONTROL, "Closing bmi_tcp
> connection at caller's request.\n");
> + ref_list_rem(cur_ref_list, addr);
> + dealloc_ref_st(tmp_ref);
> + }
> gen_mutex_unlock(&ref_mutex);
> - return ret;
> + return 0;
> }
>
> gen_mutex_unlock(&ref_mutex);
> Index: pvfs2_src/src/io/bmi/bmi_tcp/bmi-tcp.c
> ===================================================================
> --- pvfs2_src/src/io/bmi/bmi_tcp/bmi-tcp.c (revision 2347)
> +++ pvfs2_src/src/io/bmi/bmi_tcp/bmi-tcp.c (revision 2348)
> @@ -677,19 +677,10 @@
> tcp_method_params.listen_addr->method_data)->socket);
> #endif
> break;
> - /*
> - * Used after changing buffer sizes to force the connection to
> - * the server used for the config to close so it can be reopened
> - * with the new buffer sizes as the rest of the future server
> - * connections will be.
> - */
> - case BMI_TCP_CLOSE_SOCKET: {
> - struct method_addr *map = inout_parameter;
> - if (tcp_socket_collection_p)
> - BMI_socket_collection_remove(tcp_socket_collection_p,
> map);
> - ret = tcp_shutdown_addr(map);
> + case BMI_TCP_CLOSE_SOCKET:
> + /* this should no longer make it to the bmi_tcp method;
> see bmi.c */
> + ret = 0;
> break;
> - }
> case BMI_FORCEFUL_CANCEL_MODE:
> forceful_cancel_mode = 1;
> ret = 0;
> Index: pvfs2_src/test/io/bmi/driver-bw-multi.c
> ===================================================================
> --- pvfs2_src/test/io/bmi/driver-bw-multi.c (revision 2290)
> +++ pvfs2_src/test/io/bmi/driver-bw-multi.c (revision 2291)
> @@ -75,7 +75,7 @@
> double stddev_bmi_time, stddev_mpi_time;
> double agg_bmi_bw, agg_mpi_bw;
> double ave_bmi_time, ave_mpi_time;
> - int total_data_xfer = 0;
> + double total_data_xfer = 0;
> bmi_context_id context = -1;
>
> /* start up benchmark environment */
> @@ -315,8 +315,8 @@
> */
> if (world_rank == 0)
> {
> - total_data_xfer = opts.num_servers * num_clients * num_messages *
> - opts.message_len;
> + total_data_xfer = (double)opts.num_servers * (double)num_clients *
> + (double)num_messages * (double)opts.message_len;
> if (opts.num_servers > 1)
> {
> var_bmi_time = sumsq_bmi_time -
> @@ -352,8 +352,8 @@
>
> if (world_rank == opts.num_servers)
> {
> - total_data_xfer = opts.num_servers * num_clients * num_messages *
> - opts.message_len;
> + total_data_xfer = (double)opts.num_servers * (double)num_clients *
> + (double)num_messages * (double)opts.message_len;
>
> if (num_clients > 1)
> {
> Index: pvfs2_src/src/io/bmi/bmi.c
> ===================================================================
> --- pvfs2_src/src/io/bmi/bmi.c (revision 2089)
> +++ pvfs2_src/src/io/bmi/bmi.c (revision 2090)
> @@ -1616,6 +1616,14 @@
> "%s: cancel id %llu\n", __func__, llu(id));
>
> target_op = id_gen_safe_lookup(id);
> + if(target_op == NULL)
> + {
> + /* if we can't find the operation, then assume it has already
> + * completed naturally.
> + */
> + return(0);
> + }
> +
> assert(target_op->op_id == id);
>
> if(active_method_table[target_op->addr->method_type]-
> >BMI_meth_cancel)
> Index: pvfs2_src/src/io/bmi/bmi_tcp/bmi-tcp.c
> ===================================================================
> --- pvfs2_src/src/io/bmi/bmi_tcp/bmi-tcp.c (revision 2089)
> +++ pvfs2_src/src/io/bmi/bmi_tcp/bmi-tcp.c (revision 2090)
> @@ -1451,12 +1451,20 @@
> */
> int BMI_tcp_cancel(bmi_op_id_t id, bmi_context_id context_id)
> {
> - method_op_p query_op = (method_op_p)id_gen_safe_lookup(id);
> + method_op_p query_op = NULL;
> +
> + gen_mutex_lock(&interface_mutex);
>
> - assert(query_op);
> + query_op = (method_op_p)id_gen_safe_lookup(id);
> + if(!query_op)
> + {
> + /* if we can't find the operattion, then assume that it
> has already
> + * completed naturally
> + */
> + gen_mutex_unlock(&interface_mutex);
> + return(0);
> + }
>
> - gen_mutex_lock(&interface_mutex);
> -
> /* easy case: is the operation already completed? */
> if(((struct tcp_op*)(query_op->method_data))->tcp_op_state ==
> BMI_TCP_COMPLETE)
> Index: pvfs2_src/src/io/job/job-time-mgr.c
> ===================================================================
> --- pvfs2_src/src/io/job/job-time-mgr.c (revision 2089)
> +++ pvfs2_src/src/io/job/job-time-mgr.c (revision 2090)
> @@ -118,8 +118,8 @@
> break;
> }
>
> + prev_bucket = tmp_bucket;
> tmp_bucket = NULL;
> - prev_bucket = tmp_bucket;
> }
>
> if(!tmp_bucket || tmp_bucket->expire_time_sec != expire_time_sec)
> @@ -257,6 +257,7 @@
> case JOB_BMI:
> gossip_err("Job time out: cancelling bmi operation, job_id: %llu.
> \n", llu(jd->job_id));
> ret = job_bmi_cancel(jd->job_id, jd->context_id);
> + jd->time_bucket = NULL;
> break;
> case JOB_FLOW:
> /* have we made any progress since last time we checked? */
> @@ -274,22 +275,23 @@
> /* otherwise kill the flow */
> gossip_err("Job time out: cancelling flow operation, job_id:
> %llu.\n", llu(jd->job_id));
> ret = job_flow_cancel(jd->job_id, jd->context_id);
> + jd->time_bucket = NULL;
> }
> break;
> case JOB_TROVE:
> gossip_err("Job time out: cancelling trove operation, job_id: %
> llu.\n", llu(jd->job_id));
> ret = job_trove_dspace_cancel(
> jd->u.trove.fsid, jd->job_id, jd->context_id);
> + jd->time_bucket = NULL;
> break;
> default:
> ret = 0;
> + jd->time_bucket = NULL;
> break;
> }
>
> /* FIXME: error handling */
> assert(ret == 0);
> -
> - jd->time_bucket = NULL;
> }
> qlist_del(&tmp_bucket->bucket_link);
> INIT_QLIST_HEAD(&tmp_bucket->jd_queue);
> Index: pvfs2_src/src/client/sysint/sys-io.sm
> ===================================================================
> --- pvfs2_src/src/client/sysint/sys-io.sm (revision 2067)
> +++ pvfs2_src/src/client/sysint/sys-io.sm (revision 2068)
> @@ -1048,12 +1048,30 @@
> sm_p->u.io.flow_completion_count--;
> assert(sm_p->u.io.flow_completion_count > -1);
>
> - /* if error, restart; but if this is a write, let write
> ack catch */
> + /* look for flow error when no write ack is in progress
> (usually a
> + * read case)
> + */
> if (js_p->error_code < 0 && !cur_ctx-
> >write_ack_in_progress) {
> - gossip_debug(GOSSIP_IO_DEBUG,
> - "%s: flow failed, retrying from msgpair\n", __func__);
> - cur_ctx->msg_send_has_been_posted = 0;
> - cur_ctx->msg_recv_has_been_posted = 0;
> + if ((PVFS_ERROR_CLASS(-js_p->error_code) ==
> PVFS_ERROR_BMI) ||
> + (PVFS_ERROR_CLASS(-js_p->error_code) ==
> PVFS_ERROR_FLOW) ||
> + (js_p->error_code == -ECONNRESET) ||
> + (js_p->error_code == -PVFS_EPROTO))
> + {
> + /* if this is a an error that we can retry */
> + gossip_debug(GOSSIP_IO_DEBUG,
> + "%s: flow failed, retrying from msgpair\n",
> __func__);
> + cur_ctx->msg_send_has_been_posted = 0;
> + cur_ctx->msg_recv_has_been_posted = 0;
> + }
> + else
> + {
> + /* do not retry on remaining error codes */
> + gossip_debug(GOSSIP_IO_DEBUG,
> + "%s: flow failed, not retrying\n", __func__);
> +
> + /* forcing the count high insures that the sm
> won't restart */
> + sm_p->u.io.retry_count = sm_p-
> >msgarray_params.retry_limit;
> + }
> }
> }
> else if (STATUS_USER_TAG_TYPE(status_user_tag,
> IO_SM_PHASE_FINAL_ACK))
> @@ -1741,27 +1759,47 @@
> server_config->client_job_flow_timeout);
> PINT_put_server_config_struct(server_config);
>
> - if (ret < 0)
> + /* if the flow fails immediately, then we have to do some special
> + * handling. This function is not equiped to handle the failure
> + * directly, so we instead post a null job that will propigate
> the error
> + * to the normal state where we interpret flow errors
> + */
> + if((ret < 0) || (ret == 1 && cur_ctx->flow_status.error_code !
> = 0))
> {
> - gossip_err("job_flow failed\n");
> - return ret;
> + /* make sure the error code is stored in the flow
> descriptor */
> + if(ret == 1)
> + {
> + cur_ctx->flow_desc.error_code = cur_ctx-
> >flow_status.error_code;
> + }
> + else
> + {
> + cur_ctx->flow_desc.error_code = ret;
> + }
> +
> + gossip_debug(GOSSIP_IO_DEBUG, " immediate flow failure for "
> + "context %p, error code: %d\n", cur_ctx,
> + cur_ctx->flow_desc.error_code);
> + gossip_debug(GOSSIP_IO_DEBUG, " posting job_null() to
> propigate.\n");
> +
> + /* post a fake job to propigate the flow failure to a
> later state */
> + ret = job_null(cur_ctx->flow_desc.error_code, sm_p,
> + status_user_tag, &cur_ctx->flow_status,
> + &cur_ctx->flow_job_id, pint_client_sm_context);
> + if(ret !=0)
> + {
> + return(ret);
> + }
> }
> - else if (ret == 1)
> - {
> - gossip_debug(GOSSIP_IO_DEBUG, " flow for context %p "
> - "completed immediately\n", cur_ctx);
> - assert(cur_ctx->flow_status.error_code == 0);
> - }
> else
> {
> gossip_debug(GOSSIP_IO_DEBUG, " posted flow for "
> "context %p\n", cur_ctx);
> -
> - cur_ctx->flow_has_been_posted = 1;
> - cur_ctx->flow_in_progress = 1;
> - sm_p->u.io.flow_completion_count++;
> }
>
> + cur_ctx->flow_has_been_posted = 1;
> + cur_ctx->flow_in_progress = 1;
> + sm_p->u.io.flow_completion_count++;
> +
> return 0;
> }
>
> Index: pvfs2_src/src/common/misc/server-config.c
> ===================================================================
> --- pvfs2_src/src/common/misc/server-config.c (revision 2273)
> +++ pvfs2_src/src/common/misc/server-config.c (revision 2274)
> @@ -1871,6 +1871,10 @@
> PINT_llist_head(config_s->file_systems);
>
> fs_conf->default_num_dfiles = (int)cmd->data.value;
> + if(fs_conf->default_num_dfiles < 0)
> + {
> + return("Error DefaultNumDFiles must be positive.\n");
> + }
> return NULL;
> }
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
More information about the Pvfs2-developers
mailing list