[Pvfs2-developers] Re: memory allocations in BMI

Kyle Schochenmaier kschoche at scl.ameslab.gov
Fri Dec 29 13:57:44 EST 2006


Just a sanity check for everyone, I rebuilt release 2.5.1, and 
everything works correctly.
I looked back at the 2.5.1 code, and noticed that we were just 
hardcoding the max_wr value to 512, which is much less than what I'm 
seeing here in 2.6.1 - (70).
I also verified that we have a 32768 max_wr value from 2.5.1's 
ibv_device_query() function.
However, 2.6.1 is reporting 0?

This looks to me now like we're tripping up the driver somehow, if the 0 
is accurate.
Was there any change in order of init between the two releases?

I'm puzzled.

Kyle Schochenmaier wrote:
> Pete Wyckoff wrote:
>> kschoche at scl.ameslab.gov wrote on Wed, 27 Dec 2006 15:06 -0600:
>>  
>>> Excellent, thanks Pete!  For some reason i thought that patch was  
>>> only for caching.
>>> I built the latest pvfs-CVS, and am having problems with openib.  I  
>>> originally thought it was a problem with having my client using CVS  
>>> and the server using the latest release, but rebuilt the server to 
>>> be  on CVS head, and got this on the client:
>>>
>>> p5l8:~# pvfs2-ls
>>> [E 15:02:10.201683] Error: openib_new_connection: asked for 70 send  
>>> WRs on QP, got 0.
>>>     
>>
>>   
> So this is what I'm getting on the server now:
>
> [D 11:41:10.340612] PVFS2 Server version 2.6.1pre1-2006-12-27-205836 
> starting.
> [E 11:41:39.405117] Warning: exchange_data: partial read, 1/8 bytes.
> [E 11:41:39.408171] SIGSEGV: skipping cleanup; exit now!
>
> Something is weird here, I wouldnt expect a connection 
> failure/resource issue on client side to
> segfault the server, but not the client.  More debugging to come.
>
> With network debugging on, I get this on server, which leads me to 
> believe I have a configuration problem somewhere as well:
> [D 11:46:01.594426] PVFS2 Server version 2.6.1pre1-2006-12-27-205836 
> starting.
> [D 11:46:01.595332] BMI_ib_initialize: init.
> [D 11:46:01.595404] openib_ib_initialize: init.
> [D 11:46:01.596383] openib_ib_initialize: max 65408 completion queue 
> entries.
> [D 11:46:01.596664] BMI_ib_initialize: done.
>
> 'pvfs2-ls on client'
> <pages of this>
> [D 11:46:15.711040] BMI_ib_testcontext: last activity too long ago, 
> blocking.
> [D 11:46:15.723029] BMI_ib_testcontext: last activity too long ago, 
> blocking.
> [D 11:46:15.735042] BMI_ib_testcontext: last activity too long ago, 
> blocking.
> [D 11:46:15.747048] BMI_ib_testcontext: last activity too long ago, 
> blocking.
> [D 11:46:15.759037] BMI_ib_testcontext: last activity too long ago, 
> blocking.
> [E 11:46:15.761497] Warning: exchange_data: partial read, 1/8 bytes.
> [D 11:46:15.761542] ib_close_connection: closing connection to 
> 10.1.4.57:60889.
> (END)
>                                                    The interesting 
> part is I have timeouts effectively turned off in the filesystem 
> config for other testing purposes, which when set back to default 
> (timeouts in the pvfs2-fs config file)..
>
> [D 11:54:11.946168] BMI_ib_testcontext: last activity too long ago, 
> blocking.
> [D 11:54:11.958164] BMI_ib_testcontext: last activity too long ago, 
> blocking.
> [D 11:54:11.970166] BMI_ib_testcontext: last activity too long ago, 
> blocking.
> [D 11:54:11.982167] BMI_ib_testcontext: last activity too long ago, 
> blocking.
> [D 11:54:11.994166] BMI_ib_testcontext: last activity too long ago, 
> blocking.
> [D 11:54:12.006190] BMI_ib_testcontext: last activity too long ago, 
> blocking.
> [D 11:54:12.018165] BMI_ib_testcontext: last activity too long ago, 
> blocking.
> [D 11:54:12.030165] BMI_ib_testcontext: last activity too long ago, 
> blocking.
> [E 11:54:12.041386] Warning: exchange_data: partial read, 1/8 bytes.
> [D 11:54:12.041441] ib_close_connection: closing connection to 
> 10.1.4.57:34756.
> [E 11:54:12.042161] SIGSEGV: skipping cleanup; exit now!
>
> I realize this may also be a hardware issue, but I'd like to see the 
> server not barf when clients fail to connect..
> I also tried commenting out those checks, yes its a hardware problem 
> on the client side for now, oddly enough though, netpipe over ib works 
> fine, so do all of my standard IB tests.
>
>> That's just hardware.  It's comparing the returned values in
>> ibv_qp_init_attr to what was asked for.  You could try commenting
>> out those checks, just to see if the IB library did not set the
>> return values properly.  Or you could shrink the request num_wr and
>> see if that helps.  Either way, next stop is your IB card vendor.
>>
>>         -- Pete
>>
>>
>>   
>
>


-- 
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory 



More information about the Pvfs2-developers mailing list