[Pvfs2-developers] Re: pvfs-client segfault

Troy Benjegerdes troy at scl.ameslab.gov
Wed Feb 13 18:12:22 EST 2008


Here's another one:

http://www.scl.ameslab.gov/~troy/pvfs/pvfs2-client.log-a5-n5-abort

When I run pvfs2-client-core with no arguments, it seems to work fine.

If you can also take a look at:

http://www.scl.ameslab.gov/~troy/pvfs/hangs/

These are all instances of the pvfs-client-core hanging on PPC64 while 
I'm doing a DD of a large file.

Pete Wyckoff wrote:
> troy at scl.ameslab.gov wrote on Tue, 12 Feb 2008 17:14 -0600:
>   
>> I'm getting a sig11 with the power5 client.. Here's a bunch of debugging 
>> info.. now where do I got next?
>>
>> [D 16:57:30.033896] BMI_post_sendunexpected_list: addr: 269231512, count: 
>> 1, tot
>> al_size: 52, tag: 15
>> [D 16:57:30.033926]    element 0: offset: 0x1013d390, size: 52
>> [D 16:57:30.033955] post_send: sq 0x100c7f60 len 52 peer da13:3345.
>> [D 16:57:30.033984] encourage_send_waiting_buffer: sq 0x100c7f60 sent EAGER 
>> len
>> 52.
>> [D 16:57:30.034019] ib_check_cq: send to da13:3345 completed locally: sq 
>> 0x100c7
>> f60 -> SQ_WAITING_USER_TEST.
>> [D 16:57:30.034047] test_sq: sq 0x100c7f60 completed 52 to da13:3345.
>> [D 16:57:30.034162] ib_check_cq: recv from da13:3345 len 104 type 
>> MSG_EAGER_SEND
>> credit 1.
>> [D 16:57:30.034191] encourage_recv_incoming: recv eager len 104.
>> [D 16:57:30.034216] encourage_recv_incoming: matched rq 0x100d8790 now 
>> RQ_EAGER_
>> WAITING_USER_TEST.
>> [D 16:57:30.034246] encourage_recv_incoming: early registration not needed, 
>> dere                          g after eager.
>> [D 16:57:30.034276] memcache_deregister: dec refcount [0] 0x10146930 len 
>> 8224 (v                          ia 0x10146930 len 8224) refcnt now 1.
>> [D 16:57:30.034307] test_rq: rq 0x100d8790 completed 88 from da13:3345.
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> [Switching to Thread -134410208 (LWP 6302)]
>> completion_list_retrieve_completed (op_id_array=0xfff7d710,
>>    user_ptr_array=0xfff7d310, error_code_array=0xfff7d410, limit=64,
>>    out_count=0xfff7d2f0) at ../src/client/sysint/client-state-machine.c:141
>> 141                 op_id_array[i] = sm_p->sys_op_id;
>> (gdb)
>> (gdb)
>> (gdb)
>> (gdb)
>> (gdb) bt
>> #0  completion_list_retrieve_completed (op_id_array=0xfff7d710,
>>    user_ptr_array=0xfff7d310, error_code_array=0xfff7d410, limit=64,
>>    out_count=0xfff7d2f0) at ../src/client/sysint/client-state-machine.c:141
>> #1  0x100441b4 in PINT_client_state_machine_testsome 
>> (op_id_array=0xfff7d710,
>>    op_count=0xfff7d2f0, user_ptr_array=0xfff7d310,
>>    error_code_array=0xfff7d410, timeout_ms=10)
>>    at ../src/client/sysint/client-state-machine.c:694
>> #2  0x10010c00 in process_vfs_requests ()
>>    at ../src/apps/kernel/linux/pvfs2-client-core.c:2943
>> #3  0x100120f4 in main (argc=<value optimized out>, argv=0xfff7dc74)
>>    at ../src/apps/kernel/linux/pvfs2-client-core.c:3379
>> (gdb) print sm_p
>> $1 = (PINT_client_sm *) 0x0
>> (gdb)
>> $2 = (PINT_client_sm *) 0x0
>> (gdb) list
>> 136             assert(smcb);
>> 137
>> 138             if (i < limit)
>> 139             {
>> 140                 sm_p = PINT_sm_frame(smcb, PINT_FRAME_CURRENT);
>> 141                 op_id_array[i] = sm_p->sys_op_id;
>> 142                 error_code_array[i] = sm_p->error_code;
>> 143
>> 144                 if (user_ptr_array)
>> 145                 {
>> (gdb) print smcb
>> No symbol "smcb" in current context.
>> (gdb) list -
>> 126
>> 127         gen_mutex_lock(&s_completion_list_mutex);
>> 128         for(i = 0; i < s_completion_list_index; i++)
>> 129         {
>> 130             if (s_completion_list[i] == NULL)
>> 131             {
>> 132                 continue;
>> 133             }
>> 134
>> 135             smcb = s_completion_list[i];
>> (gdb) print s_completion_list[0]
>> $3 = (PINT_smcb *) 0x100da450
>> (gdb) print *s_completion_list[0]
>> $4 = {stackptr = 0, current_state = 0x100b0068, state_stack = {0x100aff90,
>>    0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, frames = {next = 0x100da478,
>>    prev = 0x100da478}, frame_count = 1,
>>  op_get_state_machine = 0x10043b80 <client_op_state_get_machine>, op = 5,
>>  op_id = 0, parent_smcb = 0x0, op_terminate = 1, op_cancelled = 0,
>>  children_running = 0, op_completed = 1, context = 0,
>>  terminate_fn = 0x100452a0 <client_state_machine_terminate>, user_ptr = 
>> 0x0}
>>     
>
> All I get from this is that the frames qlist has a single entry,
> state_stack[4].  Not sure how it got so deep into there.  Likely
> some sort of memory corruption, or we have a fairly major
> undiscovered SM bug on our hands.
>
> If you can repeat this at will, doing a -g build and running with
> all debugging would be especially nice.  Maybe the debug log would
> show something curious.
>
> The other approach is to run under valgrind and cross fingers it
> finds something interesting.
>
> 		-- Pete
>
>   
>> (gdb) info locals
>> i = 0
>> new_list_index = 0
>> tmp_completion_list = {0x0 <repeats 256 times>}
>> sm_p = (PINT_client_sm *) 0x0
>> __PRETTY_FUNCTION__ = "completion_list_retrieve_completed"
>> (gdb) print op_id_array
>> $5 = (PVFS_sys_op_id *) 0xfff7d710
>> (gdb) print op_id_array[0]
>> $7 = 34
>>
>>     



More information about the Pvfs2-developers mailing list