[Pvfs2-developers] Does this work for IB and TCP?

Scott Atchley atchley at myri.com
Thu Jan 11 11:41:53 EST 2007


Hi all,

Here is a little more detail. On the server, after the stall I only see:

[E 01/11 11:28] job_time_mgr_expire: job time out: cancelling flow  
operation, job_id: 362.
[E 01/11 11:28] fp_multiqueue_cancel: flow proto cancel called on  
0x8197828
[D 01/11 11:28] fp_multiqueue_cancel: called on already completed  
flow; doing nothing.

There are no timeouts in BMI.

On the client, I eventually see:

[E 11:32:21.004481] job_time_mgr_expire: job time out: cancelling bmi  
operation, job_id: 41.
[D 11:32:21.004649] PINT_thread_mgr_bmi_cancel: trying to cancel  
opid: 42, ptr: 0x810c5ac.
[D 11:32:21.004719] BMI_cancel: cancel id 42
* BMI_mx_cancel RX op_id 42 mxc_state 4 peer state 2
/* This is a recv that is pending (mxc_state 4) and the peer is READY  
(peer state 2) */

[2]  - segmentation fault (core dumped)  pvfs2-cp /scratch/atchley/ 
test /mnt/pvfs2/test-${I}
[E 11:32:21.993383] job_time_mgr_expire: job time out: cancelling bmi  
operation, job_id: 53.
[D 11:32:21.993421] PINT_thread_mgr_bmi_cancel: trying to cancel  
opid: 54, ptr: 0x810c5ac.
[D 11:32:21.993439] BMI_cancel: cancel id 54
* BMI_mx_cancel RX op_id 54 mxc_state 4 peer state 2
/* This is a recv that is pending (mxc_state 4) and the peer is READY  
(peer state 2) */

[3]  + segmentation fault (core dumped)  pvfs2-cp /scratch/atchley/ 
test /mnt/pvfs2/test-${I}

Since the receives are pending (posted, not completed), the cancel  
should succeed. It is probably a bug in my code that causes the  
segfaults. Unfortunately, the core files are not usable:

% gdb pvfs2-cp core.28788
"core.28788" is not a core dump: File format not recognized

File disagrees:

% file core.28788
core.28788: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV),  
SVR4-style, SVR4-style, from 'pvfs2-cp'

If the other BMI methods do not hang, then I need to keep digging.

Scott

On Jan 11, 2007, at 11:23 AM, Scott Atchley wrote:

> Hi all,
>
> I am simply running two pvfs2-cp processes at the same time to see  
> how everything works. For some reason, the copies start but do not  
> finish. Eventually, I see timeouts in BMI but not in bmi_mx. Before  
> I spend too much time on this, can other methods run two copies at  
> the same time?
>
> $ for I in 1 2 ; do
> pvfs2-cp test /mnt/pvfs2/test-${I} &
> done
>
> Thanks,
>
> Scott
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers



More information about the Pvfs2-developers mailing list