[Pvfs2-developers] Does this work for IB and TCP? *** SOLVED ***

Sam Lang slang at mcs.anl.gov
Thu Jan 11 14:47:48 EST 2007


On Jan 11, 2007, at 1:47 PM, Scott Atchley wrote:

> Hi all,
>
> It was a bug in my BMI_mx_mem[alloc|free]() code. I was taking a  
> lock, and in one rare case, I did not release it. The next one to  
> try to take it deadlocked the app.
>
> The multiple copies now complete and the server can exit cleanly.
>
> Sorry for the distraction.
>

Cool.  The server tries to cancel outstanding operations on a Ctrl- 
C.  It doesn't actually exit until all jobs are cancelled, which is  
why you saw it hang like that.

-sam

> Scott
>
> On Jan 11, 2007, at 2:09 PM, Scott Atchley wrote:
>
>> I do not know if this is related or not, but when I try to ^C the  
>> server, I do not see the normal shutdown messages such as:
>>
>> PVFS2 server got signal 2 (server_status_flag: 262143)
>> [D 01/11 13:40] *** server shutdown in progress ***
>> [D 01/11 13:40] [+] halting state machine processor   [   ...   ]
>> [D 01/11 13:40] [-]         state machine processor   [ stopped ]
>> <snip>
>>
>> Instead, I only get:
>>
>> PVFS2 server got signal 2 (server_status_flag: 262143)
>>
>> and nothing else. I have waited more than 5 minutes before ^Z the  
>> process and killing it. Also, before using ^C, it is unresponsive  
>> to new operations such as pvfs2-ls.
>>
>> Scott
>>
>> On Jan 11, 2007, at 1:48 PM, Scott Atchley wrote:
>>
>>> Well, I still can't use the core file but it is not happening in  
>>> my BMI_mx_cancel() function. I added print statements at several  
>>> locations including the end of the function and all print.
>>>
>>> [E 13:36:40.025329] job_time_mgr_expire: job time out: cancelling  
>>> bmi operation, job_id: 71.
>>> [D 13:36:40.025508] PINT_thread_mgr_bmi_cancel: trying to cancel  
>>> opid: 72, ptr: 0x810b924.
>>> [D 13:36:40.025577] BMI_cancel: cancel id 72
>>> * BMI_mx_cancel RX op_id 72 mxc_state 4 peer state 2
>>> BMI_mx_cancel calling mx_cancel()
>>> BMI_mx_cancel mx_cancel() succeeded
>>> BMI_mx_cancel bmx_deq_pending_ctx()
>>> BMI_mx_cancel bmx_q_canceled_ctx()
>>> BMI_mx_cancel done
>>> [3]  + segmentation fault (core dumped)  pvfs2-cp /scratch/ 
>>> atchley/test /mnt/pvfs2/test-${I}
>>>
>>> [E 13:36:40.988876] job_time_mgr_expire: job time out: cancelling  
>>> bmi operation, job_id: 77.
>>> [D 13:36:40.988926] PINT_thread_mgr_bmi_cancel: trying to cancel  
>>> opid: 78, ptr: 0x810b924.
>>> [D 13:36:40.988945] BMI_cancel: cancel id 78
>>> * BMI_mx_cancel RX op_id 78 mxc_state 4 peer state 2
>>> BMI_mx_cancel calling mx_cancel()
>>> BMI_mx_cancel  mx_cancel() succeeded
>>> BMI_mx_cancel bmx_deq_pending_ctx()
>>> BMI_mx_cancel bmx_q_canceled_ctx()
>>> BMI_mx_cancel done
>>> [2]  + segmentation fault (core dumped)  pvfs2-cp /scratch/ 
>>> atchley/test /mnt/pvfs2/test-${I}
>>>
>>> I noticed that bmi_ib does not actually cancel an operation, it  
>>> simply closes the connection. MX can cancel receives, and I only  
>>> close the connection if I need to cancel a send. I do not know if  
>>> this is relevant or not.
>>>
>>> Scott
>>>
>>>
>>> On Jan 11, 2007, at 11:58 AM, Scott Atchley wrote:
>>>
>>>> Hi Sam,
>>>>
>>>> I am using a 256 MB file on a machine with only 1 GB of memory.  
>>>> The server sees the timeouts after ~60 seconds and the client is  
>>>> much longer (and may be 5 minutes). I will time it on my next run.
>>>>
>>>> Scott
>>>>
>>>> On Jan 11, 2007, at 11:48 AM, Sam Lang wrote:
>>>>
>>>>>
>>>>> Hi Scott,
>>>>>
>>>>> How big is the test file you're copying?  tcp doesn't hang with  
>>>>> two pvfs2-cp on a 40MB, but I should probably try something  
>>>>> larger.  :-)  The flow timeouts on the server are set to 5  
>>>>> minutes.  Are you waiting that long before seeing those  
>>>>> messages in the log?
>>>>>
>>>>> -sam
>>>>>
>>>>> On Jan 11, 2007, at 10:41 AM, Scott Atchley wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Here is a little more detail. On the server, after the stall I  
>>>>>> only see:
>>>>>>
>>>>>> [E 01/11 11:28] job_time_mgr_expire: job time out: cancelling  
>>>>>> flow operation, job_id: 362.
>>>>>> [E 01/11 11:28] fp_multiqueue_cancel: flow proto cancel called  
>>>>>> on 0x8197828
>>>>>> [D 01/11 11:28] fp_multiqueue_cancel: called on already  
>>>>>> completed flow; doing nothing.
>>>>>>
>>>>>> There are no timeouts in BMI.
>>>>>>
>>>>>> On the client, I eventually see:
>>>>>>
>>>>>> [E 11:32:21.004481] job_time_mgr_expire: job time out:  
>>>>>> cancelling bmi operation, job_id: 41.
>>>>>> [D 11:32:21.004649] PINT_thread_mgr_bmi_cancel: trying to  
>>>>>> cancel opid: 42, ptr: 0x810c5ac.
>>>>>> [D 11:32:21.004719] BMI_cancel: cancel id 42
>>>>>> * BMI_mx_cancel RX op_id 42 mxc_state 4 peer state 2
>>>>>> /* This is a recv that is pending (mxc_state 4) and the peer  
>>>>>> is READY (peer state 2) */
>>>>>>
>>>>>> [2]  - segmentation fault (core dumped)  pvfs2-cp /scratch/ 
>>>>>> atchley/test /mnt/pvfs2/test-${I}
>>>>>> [E 11:32:21.993383] job_time_mgr_expire: job time out:  
>>>>>> cancelling bmi operation, job_id: 53.
>>>>>> [D 11:32:21.993421] PINT_thread_mgr_bmi_cancel: trying to  
>>>>>> cancel opid: 54, ptr: 0x810c5ac.
>>>>>> [D 11:32:21.993439] BMI_cancel: cancel id 54
>>>>>> * BMI_mx_cancel RX op_id 54 mxc_state 4 peer state 2
>>>>>> /* This is a recv that is pending (mxc_state 4) and the peer  
>>>>>> is READY (peer state 2) */
>>>>>>
>>>>>> [3]  + segmentation fault (core dumped)  pvfs2-cp /scratch/ 
>>>>>> atchley/test /mnt/pvfs2/test-${I}
>>>>>>
>>>>>> Since the receives are pending (posted, not completed), the  
>>>>>> cancel should succeed. It is probably a bug in my code that  
>>>>>> causes the segfaults. Unfortunately, the core files are not  
>>>>>> usable:
>>>>>>
>>>>>> % gdb pvfs2-cp core.28788
>>>>>> "core.28788" is not a core dump: File format not recognized
>>>>>>
>>>>>> File disagrees:
>>>>>>
>>>>>> % file core.28788
>>>>>> core.28788: ELF 32-bit LSB core file Intel 80386, version 1  
>>>>>> (SYSV), SVR4-style, SVR4-style, from 'pvfs2-cp'
>>>>>>
>>>>>> If the other BMI methods do not hang, then I need to keep  
>>>>>> digging.
>>>>>>
>>>>>> Scott
>>>>>>
>>>>>> On Jan 11, 2007, at 11:23 AM, Scott Atchley wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I am simply running two pvfs2-cp processes at the same time  
>>>>>>> to see how everything works. For some reason, the copies  
>>>>>>> start but do not finish. Eventually, I see timeouts in BMI  
>>>>>>> but not in bmi_mx. Before I spend too much time on this, can  
>>>>>>> other methods run two copies at the same time?
>>>>>>>
>>>>>>> $ for I in 1 2 ; do
>>>>>>> pvfs2-cp test /mnt/pvfs2/test-${I} &
>>>>>>> done
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Scott
>>>>>>> _______________________________________________
>>>>>>> Pvfs2-developers mailing list
>>>>>>> Pvfs2-developers at beowulf-underground.org
>>>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2- 
>>>>>>> developers
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pvfs2-developers mailing list
>>>>>> Pvfs2-developers at beowulf-underground.org
>>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2- 
>>>>>> developers
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Pvfs2-developers mailing list
>>>> Pvfs2-developers at beowulf-underground.org
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2- 
>>>> developers
>>>
>>> _______________________________________________
>>> Pvfs2-developers mailing list
>>> Pvfs2-developers at beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>
>> _______________________________________________
>> Pvfs2-developers mailing list
>> Pvfs2-developers at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>



More information about the Pvfs2-developers mailing list