[Pvfs2-developers] Does this work for IB and TCP?
Scott Atchley
atchley at myri.com
Thu Jan 11 13:48:32 EST 2007
Well, I still can't use the core file but it is not happening in my
BMI_mx_cancel() function. I added print statements at several
locations including the end of the function and all print.
[E 13:36:40.025329] job_time_mgr_expire: job time out: cancelling bmi
operation, job_id: 71.
[D 13:36:40.025508] PINT_thread_mgr_bmi_cancel: trying to cancel
opid: 72, ptr: 0x810b924.
[D 13:36:40.025577] BMI_cancel: cancel id 72
* BMI_mx_cancel RX op_id 72 mxc_state 4 peer state 2
BMI_mx_cancel calling mx_cancel()
BMI_mx_cancel mx_cancel() succeeded
BMI_mx_cancel bmx_deq_pending_ctx()
BMI_mx_cancel bmx_q_canceled_ctx()
BMI_mx_cancel done
[3] + segmentation fault (core dumped) pvfs2-cp /scratch/atchley/
test /mnt/pvfs2/test-${I}
[E 13:36:40.988876] job_time_mgr_expire: job time out: cancelling bmi
operation, job_id: 77.
[D 13:36:40.988926] PINT_thread_mgr_bmi_cancel: trying to cancel
opid: 78, ptr: 0x810b924.
[D 13:36:40.988945] BMI_cancel: cancel id 78
* BMI_mx_cancel RX op_id 78 mxc_state 4 peer state 2
BMI_mx_cancel calling mx_cancel()
BMI_mx_cancel mx_cancel() succeeded
BMI_mx_cancel bmx_deq_pending_ctx()
BMI_mx_cancel bmx_q_canceled_ctx()
BMI_mx_cancel done
[2] + segmentation fault (core dumped) pvfs2-cp /scratch/atchley/
test /mnt/pvfs2/test-${I}
I noticed that bmi_ib does not actually cancel an operation, it
simply closes the connection. MX can cancel receives, and I only
close the connection if I need to cancel a send. I do not know if
this is relevant or not.
Scott
On Jan 11, 2007, at 11:58 AM, Scott Atchley wrote:
> Hi Sam,
>
> I am using a 256 MB file on a machine with only 1 GB of memory. The
> server sees the timeouts after ~60 seconds and the client is much
> longer (and may be 5 minutes). I will time it on my next run.
>
> Scott
>
> On Jan 11, 2007, at 11:48 AM, Sam Lang wrote:
>
>>
>> Hi Scott,
>>
>> How big is the test file you're copying? tcp doesn't hang with
>> two pvfs2-cp on a 40MB, but I should probably try something
>> larger. :-) The flow timeouts on the server are set to 5
>> minutes. Are you waiting that long before seeing those messages
>> in the log?
>>
>> -sam
>>
>> On Jan 11, 2007, at 10:41 AM, Scott Atchley wrote:
>>
>>> Hi all,
>>>
>>> Here is a little more detail. On the server, after the stall I
>>> only see:
>>>
>>> [E 01/11 11:28] job_time_mgr_expire: job time out: cancelling
>>> flow operation, job_id: 362.
>>> [E 01/11 11:28] fp_multiqueue_cancel: flow proto cancel called on
>>> 0x8197828
>>> [D 01/11 11:28] fp_multiqueue_cancel: called on already completed
>>> flow; doing nothing.
>>>
>>> There are no timeouts in BMI.
>>>
>>> On the client, I eventually see:
>>>
>>> [E 11:32:21.004481] job_time_mgr_expire: job time out: cancelling
>>> bmi operation, job_id: 41.
>>> [D 11:32:21.004649] PINT_thread_mgr_bmi_cancel: trying to cancel
>>> opid: 42, ptr: 0x810c5ac.
>>> [D 11:32:21.004719] BMI_cancel: cancel id 42
>>> * BMI_mx_cancel RX op_id 42 mxc_state 4 peer state 2
>>> /* This is a recv that is pending (mxc_state 4) and the peer is
>>> READY (peer state 2) */
>>>
>>> [2] - segmentation fault (core dumped) pvfs2-cp /scratch/
>>> atchley/test /mnt/pvfs2/test-${I}
>>> [E 11:32:21.993383] job_time_mgr_expire: job time out: cancelling
>>> bmi operation, job_id: 53.
>>> [D 11:32:21.993421] PINT_thread_mgr_bmi_cancel: trying to cancel
>>> opid: 54, ptr: 0x810c5ac.
>>> [D 11:32:21.993439] BMI_cancel: cancel id 54
>>> * BMI_mx_cancel RX op_id 54 mxc_state 4 peer state 2
>>> /* This is a recv that is pending (mxc_state 4) and the peer is
>>> READY (peer state 2) */
>>>
>>> [3] + segmentation fault (core dumped) pvfs2-cp /scratch/
>>> atchley/test /mnt/pvfs2/test-${I}
>>>
>>> Since the receives are pending (posted, not completed), the
>>> cancel should succeed. It is probably a bug in my code that
>>> causes the segfaults. Unfortunately, the core files are not usable:
>>>
>>> % gdb pvfs2-cp core.28788
>>> "core.28788" is not a core dump: File format not recognized
>>>
>>> File disagrees:
>>>
>>> % file core.28788
>>> core.28788: ELF 32-bit LSB core file Intel 80386, version 1
>>> (SYSV), SVR4-style, SVR4-style, from 'pvfs2-cp'
>>>
>>> If the other BMI methods do not hang, then I need to keep digging.
>>>
>>> Scott
>>>
>>> On Jan 11, 2007, at 11:23 AM, Scott Atchley wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am simply running two pvfs2-cp processes at the same time to
>>>> see how everything works. For some reason, the copies start but
>>>> do not finish. Eventually, I see timeouts in BMI but not in
>>>> bmi_mx. Before I spend too much time on this, can other methods
>>>> run two copies at the same time?
>>>>
>>>> $ for I in 1 2 ; do
>>>> pvfs2-cp test /mnt/pvfs2/test-${I} &
>>>> done
>>>>
>>>> Thanks,
>>>>
>>>> Scott
>>>> _______________________________________________
>>>> Pvfs2-developers mailing list
>>>> Pvfs2-developers at beowulf-underground.org
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-
>>>> developers
>>>
>>> _______________________________________________
>>> Pvfs2-developers mailing list
>>> Pvfs2-developers at beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>>
>>
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
More information about the Pvfs2-developers
mailing list