[Pvfs2-developers] Does this work for IB and TCP?
Sam Lang
slang at mcs.anl.gov
Thu Jan 11 11:48:09 EST 2007
Hi Scott,
How big is the test file you're copying? tcp doesn't hang with two
pvfs2-cp on a 40MB, but I should probably try something larger. :-)
The flow timeouts on the server are set to 5 minutes. Are you
waiting that long before seeing those messages in the log?
-sam
On Jan 11, 2007, at 10:41 AM, Scott Atchley wrote:
> Hi all,
>
> Here is a little more detail. On the server, after the stall I only
> see:
>
> [E 01/11 11:28] job_time_mgr_expire: job time out: cancelling flow
> operation, job_id: 362.
> [E 01/11 11:28] fp_multiqueue_cancel: flow proto cancel called on
> 0x8197828
> [D 01/11 11:28] fp_multiqueue_cancel: called on already completed
> flow; doing nothing.
>
> There are no timeouts in BMI.
>
> On the client, I eventually see:
>
> [E 11:32:21.004481] job_time_mgr_expire: job time out: cancelling
> bmi operation, job_id: 41.
> [D 11:32:21.004649] PINT_thread_mgr_bmi_cancel: trying to cancel
> opid: 42, ptr: 0x810c5ac.
> [D 11:32:21.004719] BMI_cancel: cancel id 42
> * BMI_mx_cancel RX op_id 42 mxc_state 4 peer state 2
> /* This is a recv that is pending (mxc_state 4) and the peer is
> READY (peer state 2) */
>
> [2] - segmentation fault (core dumped) pvfs2-cp /scratch/atchley/
> test /mnt/pvfs2/test-${I}
> [E 11:32:21.993383] job_time_mgr_expire: job time out: cancelling
> bmi operation, job_id: 53.
> [D 11:32:21.993421] PINT_thread_mgr_bmi_cancel: trying to cancel
> opid: 54, ptr: 0x810c5ac.
> [D 11:32:21.993439] BMI_cancel: cancel id 54
> * BMI_mx_cancel RX op_id 54 mxc_state 4 peer state 2
> /* This is a recv that is pending (mxc_state 4) and the peer is
> READY (peer state 2) */
>
> [3] + segmentation fault (core dumped) pvfs2-cp /scratch/atchley/
> test /mnt/pvfs2/test-${I}
>
> Since the receives are pending (posted, not completed), the cancel
> should succeed. It is probably a bug in my code that causes the
> segfaults. Unfortunately, the core files are not usable:
>
> % gdb pvfs2-cp core.28788
> "core.28788" is not a core dump: File format not recognized
>
> File disagrees:
>
> % file core.28788
> core.28788: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV),
> SVR4-style, SVR4-style, from 'pvfs2-cp'
>
> If the other BMI methods do not hang, then I need to keep digging.
>
> Scott
>
> On Jan 11, 2007, at 11:23 AM, Scott Atchley wrote:
>
>> Hi all,
>>
>> I am simply running two pvfs2-cp processes at the same time to see
>> how everything works. For some reason, the copies start but do not
>> finish. Eventually, I see timeouts in BMI but not in bmi_mx.
>> Before I spend too much time on this, can other methods run two
>> copies at the same time?
>>
>> $ for I in 1 2 ; do
>> pvfs2-cp test /mnt/pvfs2/test-${I} &
>> done
>>
>> Thanks,
>>
>> Scott
>> _______________________________________________
>> Pvfs2-developers mailing list
>> Pvfs2-developers at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
More information about the Pvfs2-developers
mailing list