[Pvfs2-developers] Re: memory allocations in BMI

Kyle Schochenmaier kschoche at scl.ameslab.gov
Fri Dec 29 12:58:13 EST 2006


Pete Wyckoff wrote:
> kschoche at scl.ameslab.gov wrote on Wed, 27 Dec 2006 15:06 -0600:
>   
>> Excellent, thanks Pete!  For some reason i thought that patch was  
>> only for caching.
>> I built the latest pvfs-CVS, and am having problems with openib.  I  
>> originally thought it was a problem with having my client using CVS  
>> and the server using the latest release, but rebuilt the server to be  
>> on CVS head, and got this on the client:
>>
>> p5l8:~# pvfs2-ls
>> [E 15:02:10.201683] Error: openib_new_connection: asked for 70 send  
>> WRs on QP, got 0.
>>     
>
>   
So this is what I'm getting on the server now:

[D 11:41:10.340612] PVFS2 Server version 2.6.1pre1-2006-12-27-205836 
starting.
[E 11:41:39.405117] Warning: exchange_data: partial read, 1/8 bytes.
[E 11:41:39.408171] SIGSEGV: skipping cleanup; exit now!

Something is weird here, I wouldnt expect a connection failure/resource 
issue on client side to
segfault the server, but not the client.  More debugging to come.

With network debugging on, I get this on server, which leads me to 
believe I have a configuration problem somewhere as well:
[D 11:46:01.594426] PVFS2 Server version 2.6.1pre1-2006-12-27-205836 
starting.
[D 11:46:01.595332] BMI_ib_initialize: init.
[D 11:46:01.595404] openib_ib_initialize: init.
[D 11:46:01.596383] openib_ib_initialize: max 65408 completion queue 
entries.
[D 11:46:01.596664] BMI_ib_initialize: done.

'pvfs2-ls on client'
<pages of this>
[D 11:46:15.711040] BMI_ib_testcontext: last activity too long ago, 
blocking.
[D 11:46:15.723029] BMI_ib_testcontext: last activity too long ago, 
blocking.
[D 11:46:15.735042] BMI_ib_testcontext: last activity too long ago, 
blocking.
[D 11:46:15.747048] BMI_ib_testcontext: last activity too long ago, 
blocking.
[D 11:46:15.759037] BMI_ib_testcontext: last activity too long ago, 
blocking.
[E 11:46:15.761497] Warning: exchange_data: partial read, 1/8 bytes.
[D 11:46:15.761542] ib_close_connection: closing connection to 
10.1.4.57:60889.
(END)
                                                    
The interesting part is I have timeouts effectively turned off in the 
filesystem config for other testing purposes, which when set back to 
default (timeouts in the pvfs2-fs config file)..

[D 11:54:11.946168] BMI_ib_testcontext: last activity too long ago, 
blocking.
[D 11:54:11.958164] BMI_ib_testcontext: last activity too long ago, 
blocking.
[D 11:54:11.970166] BMI_ib_testcontext: last activity too long ago, 
blocking.
[D 11:54:11.982167] BMI_ib_testcontext: last activity too long ago, 
blocking.
[D 11:54:11.994166] BMI_ib_testcontext: last activity too long ago, 
blocking.
[D 11:54:12.006190] BMI_ib_testcontext: last activity too long ago, 
blocking.
[D 11:54:12.018165] BMI_ib_testcontext: last activity too long ago, 
blocking.
[D 11:54:12.030165] BMI_ib_testcontext: last activity too long ago, 
blocking.
[E 11:54:12.041386] Warning: exchange_data: partial read, 1/8 bytes.
[D 11:54:12.041441] ib_close_connection: closing connection to 
10.1.4.57:34756.
[E 11:54:12.042161] SIGSEGV: skipping cleanup; exit now!

I realize this may also be a hardware issue, but I'd like to see the 
server not barf when clients fail to connect..
I also tried commenting out those checks, yes its a hardware problem on 
the client side for now, oddly enough though, netpipe over ib works 
fine, so do all of my standard IB tests.

> That's just hardware.  It's comparing the returned values in
> ibv_qp_init_attr to what was asked for.  You could try commenting
> out those checks, just to see if the IB library did not set the
> return values properly.  Or you could shrink the request num_wr and
> see if that helps.  Either way, next stop is your IB card vendor.
>
> 		-- Pete
>
>
>   


-- 
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory 



More information about the Pvfs2-developers mailing list