[Pvfs2-developers] Does this work for IB and TCP?

Scott Atchley atchley at myri.com
Thu Jan 11 11:58:42 EST 2007


Hi Sam,

I am using a 256 MB file on a machine with only 1 GB of memory. The  
server sees the timeouts after ~60 seconds and the client is much  
longer (and may be 5 minutes). I will time it on my next run.

Scott

On Jan 11, 2007, at 11:48 AM, Sam Lang wrote:

>
> Hi Scott,
>
> How big is the test file you're copying?  tcp doesn't hang with two  
> pvfs2-cp on a 40MB, but I should probably try something  
> larger.  :-)  The flow timeouts on the server are set to 5  
> minutes.  Are you waiting that long before seeing those messages in  
> the log?
>
> -sam
>
> On Jan 11, 2007, at 10:41 AM, Scott Atchley wrote:
>
>> Hi all,
>>
>> Here is a little more detail. On the server, after the stall I  
>> only see:
>>
>> [E 01/11 11:28] job_time_mgr_expire: job time out: cancelling flow  
>> operation, job_id: 362.
>> [E 01/11 11:28] fp_multiqueue_cancel: flow proto cancel called on  
>> 0x8197828
>> [D 01/11 11:28] fp_multiqueue_cancel: called on already completed  
>> flow; doing nothing.
>>
>> There are no timeouts in BMI.
>>
>> On the client, I eventually see:
>>
>> [E 11:32:21.004481] job_time_mgr_expire: job time out: cancelling  
>> bmi operation, job_id: 41.
>> [D 11:32:21.004649] PINT_thread_mgr_bmi_cancel: trying to cancel  
>> opid: 42, ptr: 0x810c5ac.
>> [D 11:32:21.004719] BMI_cancel: cancel id 42
>> * BMI_mx_cancel RX op_id 42 mxc_state 4 peer state 2
>> /* This is a recv that is pending (mxc_state 4) and the peer is  
>> READY (peer state 2) */
>>
>> [2]  - segmentation fault (core dumped)  pvfs2-cp /scratch/atchley/ 
>> test /mnt/pvfs2/test-${I}
>> [E 11:32:21.993383] job_time_mgr_expire: job time out: cancelling  
>> bmi operation, job_id: 53.
>> [D 11:32:21.993421] PINT_thread_mgr_bmi_cancel: trying to cancel  
>> opid: 54, ptr: 0x810c5ac.
>> [D 11:32:21.993439] BMI_cancel: cancel id 54
>> * BMI_mx_cancel RX op_id 54 mxc_state 4 peer state 2
>> /* This is a recv that is pending (mxc_state 4) and the peer is  
>> READY (peer state 2) */
>>
>> [3]  + segmentation fault (core dumped)  pvfs2-cp /scratch/atchley/ 
>> test /mnt/pvfs2/test-${I}
>>
>> Since the receives are pending (posted, not completed), the cancel  
>> should succeed. It is probably a bug in my code that causes the  
>> segfaults. Unfortunately, the core files are not usable:
>>
>> % gdb pvfs2-cp core.28788
>> "core.28788" is not a core dump: File format not recognized
>>
>> File disagrees:
>>
>> % file core.28788
>> core.28788: ELF 32-bit LSB core file Intel 80386, version 1  
>> (SYSV), SVR4-style, SVR4-style, from 'pvfs2-cp'
>>
>> If the other BMI methods do not hang, then I need to keep digging.
>>
>> Scott
>>
>> On Jan 11, 2007, at 11:23 AM, Scott Atchley wrote:
>>
>>> Hi all,
>>>
>>> I am simply running two pvfs2-cp processes at the same time to  
>>> see how everything works. For some reason, the copies start but  
>>> do not finish. Eventually, I see timeouts in BMI but not in  
>>> bmi_mx. Before I spend too much time on this, can other methods  
>>> run two copies at the same time?
>>>
>>> $ for I in 1 2 ; do
>>> pvfs2-cp test /mnt/pvfs2/test-${I} &
>>> done
>>>
>>> Thanks,
>>>
>>> Scott
>>> _______________________________________________
>>> Pvfs2-developers mailing list
>>> Pvfs2-developers at beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>
>> _______________________________________________
>> Pvfs2-developers mailing list
>> Pvfs2-developers at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>
>



More information about the Pvfs2-developers mailing list