[Pvfs2-developers] Does this work for IB and TCP? *** SOLVED ***
Scott Atchley
atchley at myri.com
Thu Jan 11 14:47:24 EST 2007
Hi all,
It was a bug in my BMI_mx_mem[alloc|free]() code. I was taking a
lock, and in one rare case, I did not release it. The next one to try
to take it deadlocked the app.
The multiple copies now complete and the server can exit cleanly.
Sorry for the distraction.
Scott
On Jan 11, 2007, at 2:09 PM, Scott Atchley wrote:
> I do not know if this is related or not, but when I try to ^C the
> server, I do not see the normal shutdown messages such as:
>
> PVFS2 server got signal 2 (server_status_flag: 262143)
> [D 01/11 13:40] *** server shutdown in progress ***
> [D 01/11 13:40] [+] halting state machine processor [ ... ]
> [D 01/11 13:40] [-] state machine processor [ stopped ]
> <snip>
>
> Instead, I only get:
>
> PVFS2 server got signal 2 (server_status_flag: 262143)
>
> and nothing else. I have waited more than 5 minutes before ^Z the
> process and killing it. Also, before using ^C, it is unresponsive
> to new operations such as pvfs2-ls.
>
> Scott
>
> On Jan 11, 2007, at 1:48 PM, Scott Atchley wrote:
>
>> Well, I still can't use the core file but it is not happening in
>> my BMI_mx_cancel() function. I added print statements at several
>> locations including the end of the function and all print.
>>
>> [E 13:36:40.025329] job_time_mgr_expire: job time out: cancelling
>> bmi operation, job_id: 71.
>> [D 13:36:40.025508] PINT_thread_mgr_bmi_cancel: trying to cancel
>> opid: 72, ptr: 0x810b924.
>> [D 13:36:40.025577] BMI_cancel: cancel id 72
>> * BMI_mx_cancel RX op_id 72 mxc_state 4 peer state 2
>> BMI_mx_cancel calling mx_cancel()
>> BMI_mx_cancel mx_cancel() succeeded
>> BMI_mx_cancel bmx_deq_pending_ctx()
>> BMI_mx_cancel bmx_q_canceled_ctx()
>> BMI_mx_cancel done
>> [3] + segmentation fault (core dumped) pvfs2-cp /scratch/atchley/
>> test /mnt/pvfs2/test-${I}
>>
>> [E 13:36:40.988876] job_time_mgr_expire: job time out: cancelling
>> bmi operation, job_id: 77.
>> [D 13:36:40.988926] PINT_thread_mgr_bmi_cancel: trying to cancel
>> opid: 78, ptr: 0x810b924.
>> [D 13:36:40.988945] BMI_cancel: cancel id 78
>> * BMI_mx_cancel RX op_id 78 mxc_state 4 peer state 2
>> BMI_mx_cancel calling mx_cancel()
>> BMI_mx_cancel mx_cancel() succeeded
>> BMI_mx_cancel bmx_deq_pending_ctx()
>> BMI_mx_cancel bmx_q_canceled_ctx()
>> BMI_mx_cancel done
>> [2] + segmentation fault (core dumped) pvfs2-cp /scratch/atchley/
>> test /mnt/pvfs2/test-${I}
>>
>> I noticed that bmi_ib does not actually cancel an operation, it
>> simply closes the connection. MX can cancel receives, and I only
>> close the connection if I need to cancel a send. I do not know if
>> this is relevant or not.
>>
>> Scott
>>
>>
>> On Jan 11, 2007, at 11:58 AM, Scott Atchley wrote:
>>
>>> Hi Sam,
>>>
>>> I am using a 256 MB file on a machine with only 1 GB of memory.
>>> The server sees the timeouts after ~60 seconds and the client is
>>> much longer (and may be 5 minutes). I will time it on my next run.
>>>
>>> Scott
>>>
>>> On Jan 11, 2007, at 11:48 AM, Sam Lang wrote:
>>>
>>>>
>>>> Hi Scott,
>>>>
>>>> How big is the test file you're copying? tcp doesn't hang with
>>>> two pvfs2-cp on a 40MB, but I should probably try something
>>>> larger. :-) The flow timeouts on the server are set to 5
>>>> minutes. Are you waiting that long before seeing those messages
>>>> in the log?
>>>>
>>>> -sam
>>>>
>>>> On Jan 11, 2007, at 10:41 AM, Scott Atchley wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Here is a little more detail. On the server, after the stall I
>>>>> only see:
>>>>>
>>>>> [E 01/11 11:28] job_time_mgr_expire: job time out: cancelling
>>>>> flow operation, job_id: 362.
>>>>> [E 01/11 11:28] fp_multiqueue_cancel: flow proto cancel called
>>>>> on 0x8197828
>>>>> [D 01/11 11:28] fp_multiqueue_cancel: called on already
>>>>> completed flow; doing nothing.
>>>>>
>>>>> There are no timeouts in BMI.
>>>>>
>>>>> On the client, I eventually see:
>>>>>
>>>>> [E 11:32:21.004481] job_time_mgr_expire: job time out:
>>>>> cancelling bmi operation, job_id: 41.
>>>>> [D 11:32:21.004649] PINT_thread_mgr_bmi_cancel: trying to
>>>>> cancel opid: 42, ptr: 0x810c5ac.
>>>>> [D 11:32:21.004719] BMI_cancel: cancel id 42
>>>>> * BMI_mx_cancel RX op_id 42 mxc_state 4 peer state 2
>>>>> /* This is a recv that is pending (mxc_state 4) and the peer is
>>>>> READY (peer state 2) */
>>>>>
>>>>> [2] - segmentation fault (core dumped) pvfs2-cp /scratch/
>>>>> atchley/test /mnt/pvfs2/test-${I}
>>>>> [E 11:32:21.993383] job_time_mgr_expire: job time out:
>>>>> cancelling bmi operation, job_id: 53.
>>>>> [D 11:32:21.993421] PINT_thread_mgr_bmi_cancel: trying to
>>>>> cancel opid: 54, ptr: 0x810c5ac.
>>>>> [D 11:32:21.993439] BMI_cancel: cancel id 54
>>>>> * BMI_mx_cancel RX op_id 54 mxc_state 4 peer state 2
>>>>> /* This is a recv that is pending (mxc_state 4) and the peer is
>>>>> READY (peer state 2) */
>>>>>
>>>>> [3] + segmentation fault (core dumped) pvfs2-cp /scratch/
>>>>> atchley/test /mnt/pvfs2/test-${I}
>>>>>
>>>>> Since the receives are pending (posted, not completed), the
>>>>> cancel should succeed. It is probably a bug in my code that
>>>>> causes the segfaults. Unfortunately, the core files are not
>>>>> usable:
>>>>>
>>>>> % gdb pvfs2-cp core.28788
>>>>> "core.28788" is not a core dump: File format not recognized
>>>>>
>>>>> File disagrees:
>>>>>
>>>>> % file core.28788
>>>>> core.28788: ELF 32-bit LSB core file Intel 80386, version 1
>>>>> (SYSV), SVR4-style, SVR4-style, from 'pvfs2-cp'
>>>>>
>>>>> If the other BMI methods do not hang, then I need to keep digging.
>>>>>
>>>>> Scott
>>>>>
>>>>> On Jan 11, 2007, at 11:23 AM, Scott Atchley wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am simply running two pvfs2-cp processes at the same time to
>>>>>> see how everything works. For some reason, the copies start
>>>>>> but do not finish. Eventually, I see timeouts in BMI but not
>>>>>> in bmi_mx. Before I spend too much time on this, can other
>>>>>> methods run two copies at the same time?
>>>>>>
>>>>>> $ for I in 1 2 ; do
>>>>>> pvfs2-cp test /mnt/pvfs2/test-${I} &
>>>>>> done
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Scott
>>>>>> _______________________________________________
>>>>>> Pvfs2-developers mailing list
>>>>>> Pvfs2-developers at beowulf-underground.org
>>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-
>>>>>> developers
>>>>>
>>>>> _______________________________________________
>>>>> Pvfs2-developers mailing list
>>>>> Pvfs2-developers at beowulf-underground.org
>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-
>>>>> developers
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> Pvfs2-developers mailing list
>>> Pvfs2-developers at beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>
>> _______________________________________________
>> Pvfs2-developers mailing list
>> Pvfs2-developers at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
More information about the Pvfs2-developers
mailing list