[Pvfs2-developers] Re: pvfs-client segfault

Sam Lang slang at mcs.anl.gov
Wed Feb 13 22:25:02 EST 2008


What happens when you restart the client daemon?  Does the segfault  
occur with bmi_tcp?
-sam

On Feb 13, 2008, at 9:17 PM, Troy Benjegerdes wrote:

> An earlier instance of this got me this backtrace:
>
> (gdb) bt
> #0  completion_list_retrieve_completed (op_id_array=0xfff7d710,
>  user_ptr_array=0xfff7d310, error_code_array=0xfff7d410, limit=64,
>  out_count=0xfff7d2f0) at ../src/client/sysint/client-state- 
> machine.c:141
> #1  0x100441b4 in PINT_client_state_machine_testsome  
> (op_id_array=0xfff7d710,
>  op_count=0xfff7d2f0, user_ptr_array=0xfff7d310,
>  error_code_array=0xfff7d410, timeout_ms=10)
>  at ../src/client/sysint/client-state-machine.c:694
> #2  0x10010c00 in process_vfs_requests ()
>  at ../src/apps/kernel/linux/pvfs2-client-core.c:2943
> #3  0x100120f4 in main (argc=<value optimized out>, argv=0xfff7dc74)
>  at ../src/apps/kernel/linux/pvfs2-client-core.c:3379
> (gdb) print sm_p
>
> Sam Lang wrote:
>>
>> Troy,
>>
>> Could you also sent the stacktrace from gdb where the segfault  
>> occurs?  That's going to be the most useful info for us.
>>
>> Thanks,
>> -sam
>>
>> On Feb 13, 2008, at 4:24 PM, Troy Benjegerdes wrote:
>>
>>> http://www.scl.ameslab.gov/~troy/pvfs/pvfs2-client-log-crash
>>> or
>>> http://www.scl.ameslab.gov/~troy/pvfs/pvfs2-client-log-crash.gz
>>>
>>> This looks pretty bad:
>>>
>>> [D 16:10:47.741903] [SM Entering]: (0x10103cf0)  
>>> sysdev_unexp_sm:post (status: 0)
>>> [D 16:10:47.741929] [SM Exiting]: (0x10103cf0)  
>>> sysdev_unexp_sm:post (error code: 0), (action: DEFERRED)
>>> [D 16:10:47.741953] Posted PVFS_DEV_UNEXPECTED (375) (waiting for  
>>> test)(-1073741839)
>>> [D 16:10:47.741977] [-] reposted unexp req [0x10117fc0] due to  
>>> inlined completion
>>> [D 16:10:47.742001] FRAME GET smcb 0x10119bd8 index 0 -> frame: NULL
>>>
>>> Maybe we should have an assert in this code??
>>>
>>> .. more info from a gdb trace..
>>>
>>> (gdb) print smcb
>>> $1 = (PINT_smcb *) 0x10119bd8
>>> (gdb) print *smcb
>>> $2 = {stackptr = 0, current_state = 0x100ea3e4, state_stack =  
>>> {0x100ea3c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
>>>  0x0}, frames = {next = 0x10119c00, prev = 0x10119c00},  
>>> frame_count = 1,
>>> op_get_state_machine = 0x1005cce0 <client_op_state_get_machine>,  
>>> op = 5, op_id = 0, parent_smcb = 0x0,
>>> op_terminate = 1, op_cancelled = 0, children_running = 0,  
>>> op_completed = 1, context = 0,
>>> terminate_fn = 0x1005ce74 <client_state_machine_terminate>,  
>>> user_ptr = 0x0}
>>> (gdb) print limit
>>> $3 = 64
>>> (gdb) print i
>>> $4 = 0
>>> (gdb) list PINT_sm_frame
>>> 586      * Params: pointer to smcb, stack index
>>> 587      * Returns: pointer to frame
>>> 588      * Synopsis: returns a frame off of the frame stack
>>> 589      */
>>> 590     void *PINT_sm_frame(struct PINT_smcb *smcb, int index)
>>> 591     {
>>> 592         struct PINT_frame_s *frame_entry;
>>> 593         struct qlist_head *next;
>>> 594
>>> 595         if(qlist_empty(&smcb->frames))
>>> (gdb)
>>> 596         {
>>> 597             gossip_debug(GOSSIP_STATE_MACHINE_DEBUG,
>>> 598                          "FRAME GET smcb %p index %d -> frame:  
>>> NULL\n",
>>> 599                          smcb, index);
>>> 600             return NULL;
>>> 601         }
>>> 602         else
>>> 603         {
>>> 604             int i = 0;
>>> 605
>>> (gdb)
>>> 606             next = smcb->frames.next;
>>> 607             while(i < index)
>>> 608             {
>>> 609                 next = next->next;
>>> 610             }
>>> 611             frame_entry = qlist_entry(next, struct  
>>> PINT_frame_s, link);
>>> 612             return frame_entry->frame;
>>> 613         }
>>> 614     }
>>> 615
>>> (gdb) print smcb->frames
>>> $5 = {next = 0x10119c00, prev = 0x10119c00}
>>>
>>>
>>>>
>>>> All I get from this is that the frames qlist has a single entry,
>>>> state_stack[4].  Not sure how it got so deep into there.  Likely
>>>> some sort of memory corruption, or we have a fairly major
>>>> undiscovered SM bug on our hands.
>>>>
>>>> If you can repeat this at will, doing a -g build and running with
>>>> all debugging would be especially nice.  Maybe the debug log would
>>>> show something curious.
>>>>
>>>> The other approach is to run under valgrind and cross fingers it
>>>> finds something interesting.
>>>>
>>>>        -- Pete
>>>>
>>>>> (gdb) info locals
>>>>> i = 0
>>>>> new_list_index = 0
>>>>> tmp_completion_list = {0x0 <repeats 256 times>}
>>>>> sm_p = (PINT_client_sm *) 0x0
>>>>> __PRETTY_FUNCTION__ = "completion_list_retrieve_completed"
>>>>> (gdb) print op_id_array
>>>>> $5 = (PVFS_sys_op_id *) 0xfff7d710
>>>>> (gdb) print op_id_array[0]
>>>>> $7 = 34
>>>>>
>>>
>>> _______________________________________________
>>> Pvfs2-developers mailing list
>>> Pvfs2-developers at beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>>
>>
>



More information about the Pvfs2-developers mailing list