[Pvfs2-developers] Re: pvfs-client segfault
Troy Benjegerdes
troy at scl.ameslab.gov
Wed Feb 13 17:41:35 EST 2008
On PPC, I would get segfaults when any of the admin apps exit due to
what looked like a double-free.. commenting out the two frees in
BMI_ib_set_info *appeared* to fix this, but I suspect something else is
going on..
/*
* Used to set some optional parameters and random functions, like ioctl.
*/
static int BMI_ib_set_info(int option, void *param __unused)
{
switch (option) {
case BMI_DROP_ADDR: {
struct bmi_method_addr *map = param;
ib_method_addr_t *ibmap = map->method_data;
//free(ibmap->hostname);
//free(map);
break;
}
case BMI_OPTIMISTIC_BUFFER_REG: {
/* not guaranteed to work */
const struct bmi_optimistic_buffer_info *binfo = param;
memcache_preregister(ib_device->memcache, binfo->buffer,
binfo->len, binfo->rw);
break;
}
default:
/* Should return -ENOSYS, but return 0 for caller ease. */
break;
}
return 0;
}
Pete Wyckoff wrote:
> troy at scl.ameslab.gov wrote on Tue, 12 Feb 2008 17:14 -0600:
>
>> I'm getting a sig11 with the power5 client.. Here's a bunch of debugging
>> info.. now where do I got next?
>>
>> [D 16:57:30.033896] BMI_post_sendunexpected_list: addr: 269231512, count:
>> 1, tot
>> al_size: 52, tag: 15
>> [D 16:57:30.033926] element 0: offset: 0x1013d390, size: 52
>> [D 16:57:30.033955] post_send: sq 0x100c7f60 len 52 peer da13:3345.
>> [D 16:57:30.033984] encourage_send_waiting_buffer: sq 0x100c7f60 sent EAGER
>> len
>> 52.
>> [D 16:57:30.034019] ib_check_cq: send to da13:3345 completed locally: sq
>> 0x100c7
>> f60 -> SQ_WAITING_USER_TEST.
>> [D 16:57:30.034047] test_sq: sq 0x100c7f60 completed 52 to da13:3345.
>> [D 16:57:30.034162] ib_check_cq: recv from da13:3345 len 104 type
>> MSG_EAGER_SEND
>> credit 1.
>> [D 16:57:30.034191] encourage_recv_incoming: recv eager len 104.
>> [D 16:57:30.034216] encourage_recv_incoming: matched rq 0x100d8790 now
>> RQ_EAGER_
>> WAITING_USER_TEST.
>> [D 16:57:30.034246] encourage_recv_incoming: early registration not needed,
>> dere g after eager.
>> [D 16:57:30.034276] memcache_deregister: dec refcount [0] 0x10146930 len
>> 8224 (v ia 0x10146930 len 8224) refcnt now 1.
>> [D 16:57:30.034307] test_rq: rq 0x100d8790 completed 88 from da13:3345.
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> [Switching to Thread -134410208 (LWP 6302)]
>> completion_list_retrieve_completed (op_id_array=0xfff7d710,
>> user_ptr_array=0xfff7d310, error_code_array=0xfff7d410, limit=64,
>> out_count=0xfff7d2f0) at ../src/client/sysint/client-state-machine.c:141
>> 141 op_id_array[i] = sm_p->sys_op_id;
>> (gdb)
>> (gdb)
>> (gdb)
>> (gdb)
>> (gdb) bt
>> #0 completion_list_retrieve_completed (op_id_array=0xfff7d710,
>> user_ptr_array=0xfff7d310, error_code_array=0xfff7d410, limit=64,
>> out_count=0xfff7d2f0) at ../src/client/sysint/client-state-machine.c:141
>> #1 0x100441b4 in PINT_client_state_machine_testsome
>> (op_id_array=0xfff7d710,
>> op_count=0xfff7d2f0, user_ptr_array=0xfff7d310,
>> error_code_array=0xfff7d410, timeout_ms=10)
>> at ../src/client/sysint/client-state-machine.c:694
>> #2 0x10010c00 in process_vfs_requests ()
>> at ../src/apps/kernel/linux/pvfs2-client-core.c:2943
>> #3 0x100120f4 in main (argc=<value optimized out>, argv=0xfff7dc74)
>> at ../src/apps/kernel/linux/pvfs2-client-core.c:3379
>> (gdb) print sm_p
>> $1 = (PINT_client_sm *) 0x0
>> (gdb)
>> $2 = (PINT_client_sm *) 0x0
>> (gdb) list
>> 136 assert(smcb);
>> 137
>> 138 if (i < limit)
>> 139 {
>> 140 sm_p = PINT_sm_frame(smcb, PINT_FRAME_CURRENT);
>> 141 op_id_array[i] = sm_p->sys_op_id;
>> 142 error_code_array[i] = sm_p->error_code;
>> 143
>> 144 if (user_ptr_array)
>> 145 {
>> (gdb) print smcb
>> No symbol "smcb" in current context.
>> (gdb) list -
>> 126
>> 127 gen_mutex_lock(&s_completion_list_mutex);
>> 128 for(i = 0; i < s_completion_list_index; i++)
>> 129 {
>> 130 if (s_completion_list[i] == NULL)
>> 131 {
>> 132 continue;
>> 133 }
>> 134
>> 135 smcb = s_completion_list[i];
>> (gdb) print s_completion_list[0]
>> $3 = (PINT_smcb *) 0x100da450
>> (gdb) print *s_completion_list[0]
>> $4 = {stackptr = 0, current_state = 0x100b0068, state_stack = {0x100aff90,
>> 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, frames = {next = 0x100da478,
>> prev = 0x100da478}, frame_count = 1,
>> op_get_state_machine = 0x10043b80 <client_op_state_get_machine>, op = 5,
>> op_id = 0, parent_smcb = 0x0, op_terminate = 1, op_cancelled = 0,
>> children_running = 0, op_completed = 1, context = 0,
>> terminate_fn = 0x100452a0 <client_state_machine_terminate>, user_ptr =
>> 0x0}
>>
>
> All I get from this is that the frames qlist has a single entry,
> state_stack[4]. Not sure how it got so deep into there. Likely
> some sort of memory corruption, or we have a fairly major
> undiscovered SM bug on our hands.
>
> If you can repeat this at will, doing a -g build and running with
> all debugging would be especially nice. Maybe the debug log would
> show something curious.
>
> The other approach is to run under valgrind and cross fingers it
> finds something interesting.
>
> -- Pete
>
>
>> (gdb) info locals
>> i = 0
>> new_list_index = 0
>> tmp_completion_list = {0x0 <repeats 256 times>}
>> sm_p = (PINT_client_sm *) 0x0
>> __PRETTY_FUNCTION__ = "completion_list_retrieve_completed"
>> (gdb) print op_id_array
>> $5 = (PVFS_sys_op_id *) 0xfff7d710
>> (gdb) print op_id_array[0]
>> $7 = 34
>>
>>
More information about the Pvfs2-developers
mailing list