[Pvfs2-developers] Does this work for IB and TCP?

Scott Atchley atchley at myri.com
Thu Jan 11 14:09:28 EST 2007


I do not know if this is related or not, but when I try to ^C the  
server, I do not see the normal shutdown messages such as:

PVFS2 server got signal 2 (server_status_flag: 262143)
[D 01/11 13:40] *** server shutdown in progress ***
[D 01/11 13:40] [+] halting state machine processor   [   ...   ]
[D 01/11 13:40] [-]         state machine processor   [ stopped ]
<snip>

Instead, I only get:

PVFS2 server got signal 2 (server_status_flag: 262143)

and nothing else. I have waited more than 5 minutes before ^Z the  
process and killing it. Also, before using ^C, it is unresponsive to  
new operations such as pvfs2-ls.

Scott

On Jan 11, 2007, at 1:48 PM, Scott Atchley wrote:

> Well, I still can't use the core file but it is not happening in my  
> BMI_mx_cancel() function. I added print statements at several  
> locations including the end of the function and all print.
>
> [E 13:36:40.025329] job_time_mgr_expire: job time out: cancelling  
> bmi operation, job_id: 71.
> [D 13:36:40.025508] PINT_thread_mgr_bmi_cancel: trying to cancel  
> opid: 72, ptr: 0x810b924.
> [D 13:36:40.025577] BMI_cancel: cancel id 72
> * BMI_mx_cancel RX op_id 72 mxc_state 4 peer state 2
> BMI_mx_cancel calling mx_cancel()
> BMI_mx_cancel mx_cancel() succeeded
> BMI_mx_cancel bmx_deq_pending_ctx()
> BMI_mx_cancel bmx_q_canceled_ctx()
> BMI_mx_cancel done
> [3]  + segmentation fault (core dumped)  pvfs2-cp /scratch/atchley/ 
> test /mnt/pvfs2/test-${I}
>
> [E 13:36:40.988876] job_time_mgr_expire: job time out: cancelling  
> bmi operation, job_id: 77.
> [D 13:36:40.988926] PINT_thread_mgr_bmi_cancel: trying to cancel  
> opid: 78, ptr: 0x810b924.
> [D 13:36:40.988945] BMI_cancel: cancel id 78
> * BMI_mx_cancel RX op_id 78 mxc_state 4 peer state 2
> BMI_mx_cancel calling mx_cancel()
> BMI_mx_cancel  mx_cancel() succeeded
> BMI_mx_cancel bmx_deq_pending_ctx()
> BMI_mx_cancel bmx_q_canceled_ctx()
> BMI_mx_cancel done
> [2]  + segmentation fault (core dumped)  pvfs2-cp /scratch/atchley/ 
> test /mnt/pvfs2/test-${I}
>
> I noticed that bmi_ib does not actually cancel an operation, it  
> simply closes the connection. MX can cancel receives, and I only  
> close the connection if I need to cancel a send. I do not know if  
> this is relevant or not.
>
> Scott
>
>
> On Jan 11, 2007, at 11:58 AM, Scott Atchley wrote:
>
>> Hi Sam,
>>
>> I am using a 256 MB file on a machine with only 1 GB of memory.  
>> The server sees the timeouts after ~60 seconds and the client is  
>> much longer (and may be 5 minutes). I will time it on my next run.
>>
>> Scott
>>
>> On Jan 11, 2007, at 11:48 AM, Sam Lang wrote:
>>
>>>
>>> Hi Scott,
>>>
>>> How big is the test file you're copying?  tcp doesn't hang with  
>>> two pvfs2-cp on a 40MB, but I should probably try something  
>>> larger.  :-)  The flow timeouts on the server are set to 5  
>>> minutes.  Are you waiting that long before seeing those messages  
>>> in the log?
>>>
>>> -sam
>>>
>>> On Jan 11, 2007, at 10:41 AM, Scott Atchley wrote:
>>>
>>>> Hi all,
>>>>
>>>> Here is a little more detail. On the server, after the stall I  
>>>> only see:
>>>>
>>>> [E 01/11 11:28] job_time_mgr_expire: job time out: cancelling  
>>>> flow operation, job_id: 362.
>>>> [E 01/11 11:28] fp_multiqueue_cancel: flow proto cancel called  
>>>> on 0x8197828
>>>> [D 01/11 11:28] fp_multiqueue_cancel: called on already  
>>>> completed flow; doing nothing.
>>>>
>>>> There are no timeouts in BMI.
>>>>
>>>> On the client, I eventually see:
>>>>
>>>> [E 11:32:21.004481] job_time_mgr_expire: job time out:  
>>>> cancelling bmi operation, job_id: 41.
>>>> [D 11:32:21.004649] PINT_thread_mgr_bmi_cancel: trying to cancel  
>>>> opid: 42, ptr: 0x810c5ac.
>>>> [D 11:32:21.004719] BMI_cancel: cancel id 42
>>>> * BMI_mx_cancel RX op_id 42 mxc_state 4 peer state 2
>>>> /* This is a recv that is pending (mxc_state 4) and the peer is  
>>>> READY (peer state 2) */
>>>>
>>>> [2]  - segmentation fault (core dumped)  pvfs2-cp /scratch/ 
>>>> atchley/test /mnt/pvfs2/test-${I}
>>>> [E 11:32:21.993383] job_time_mgr_expire: job time out:  
>>>> cancelling bmi operation, job_id: 53.
>>>> [D 11:32:21.993421] PINT_thread_mgr_bmi_cancel: trying to cancel  
>>>> opid: 54, ptr: 0x810c5ac.
>>>> [D 11:32:21.993439] BMI_cancel: cancel id 54
>>>> * BMI_mx_cancel RX op_id 54 mxc_state 4 peer state 2
>>>> /* This is a recv that is pending (mxc_state 4) and the peer is  
>>>> READY (peer state 2) */
>>>>
>>>> [3]  + segmentation fault (core dumped)  pvfs2-cp /scratch/ 
>>>> atchley/test /mnt/pvfs2/test-${I}
>>>>
>>>> Since the receives are pending (posted, not completed), the  
>>>> cancel should succeed. It is probably a bug in my code that  
>>>> causes the segfaults. Unfortunately, the core files are not usable:
>>>>
>>>> % gdb pvfs2-cp core.28788
>>>> "core.28788" is not a core dump: File format not recognized
>>>>
>>>> File disagrees:
>>>>
>>>> % file core.28788
>>>> core.28788: ELF 32-bit LSB core file Intel 80386, version 1  
>>>> (SYSV), SVR4-style, SVR4-style, from 'pvfs2-cp'
>>>>
>>>> If the other BMI methods do not hang, then I need to keep digging.
>>>>
>>>> Scott
>>>>
>>>> On Jan 11, 2007, at 11:23 AM, Scott Atchley wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am simply running two pvfs2-cp processes at the same time to  
>>>>> see how everything works. For some reason, the copies start but  
>>>>> do not finish. Eventually, I see timeouts in BMI but not in  
>>>>> bmi_mx. Before I spend too much time on this, can other methods  
>>>>> run two copies at the same time?
>>>>>
>>>>> $ for I in 1 2 ; do
>>>>> pvfs2-cp test /mnt/pvfs2/test-${I} &
>>>>> done
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Scott
>>>>> _______________________________________________
>>>>> Pvfs2-developers mailing list
>>>>> Pvfs2-developers at beowulf-underground.org
>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2- 
>>>>> developers
>>>>
>>>> _______________________________________________
>>>> Pvfs2-developers mailing list
>>>> Pvfs2-developers at beowulf-underground.org
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2- 
>>>> developers
>>>>
>>>
>>
>> _______________________________________________
>> Pvfs2-developers mailing list
>> Pvfs2-developers at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers



More information about the Pvfs2-developers mailing list