[Pvfs2-developers] Re: memory allocations in BMI
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Fri Dec 29 13:57:44 EST 2006
Just a sanity check for everyone, I rebuilt release 2.5.1, and
everything works correctly.
I looked back at the 2.5.1 code, and noticed that we were just
hardcoding the max_wr value to 512, which is much less than what I'm
seeing here in 2.6.1 - (70).
I also verified that we have a 32768 max_wr value from 2.5.1's
ibv_device_query() function.
However, 2.6.1 is reporting 0?
This looks to me now like we're tripping up the driver somehow, if the 0
is accurate.
Was there any change in order of init between the two releases?
I'm puzzled.
Kyle Schochenmaier wrote:
> Pete Wyckoff wrote:
>> kschoche at scl.ameslab.gov wrote on Wed, 27 Dec 2006 15:06 -0600:
>>
>>> Excellent, thanks Pete! For some reason i thought that patch was
>>> only for caching.
>>> I built the latest pvfs-CVS, and am having problems with openib. I
>>> originally thought it was a problem with having my client using CVS
>>> and the server using the latest release, but rebuilt the server to
>>> be on CVS head, and got this on the client:
>>>
>>> p5l8:~# pvfs2-ls
>>> [E 15:02:10.201683] Error: openib_new_connection: asked for 70 send
>>> WRs on QP, got 0.
>>>
>>
>>
> So this is what I'm getting on the server now:
>
> [D 11:41:10.340612] PVFS2 Server version 2.6.1pre1-2006-12-27-205836
> starting.
> [E 11:41:39.405117] Warning: exchange_data: partial read, 1/8 bytes.
> [E 11:41:39.408171] SIGSEGV: skipping cleanup; exit now!
>
> Something is weird here, I wouldnt expect a connection
> failure/resource issue on client side to
> segfault the server, but not the client. More debugging to come.
>
> With network debugging on, I get this on server, which leads me to
> believe I have a configuration problem somewhere as well:
> [D 11:46:01.594426] PVFS2 Server version 2.6.1pre1-2006-12-27-205836
> starting.
> [D 11:46:01.595332] BMI_ib_initialize: init.
> [D 11:46:01.595404] openib_ib_initialize: init.
> [D 11:46:01.596383] openib_ib_initialize: max 65408 completion queue
> entries.
> [D 11:46:01.596664] BMI_ib_initialize: done.
>
> 'pvfs2-ls on client'
> <pages of this>
> [D 11:46:15.711040] BMI_ib_testcontext: last activity too long ago,
> blocking.
> [D 11:46:15.723029] BMI_ib_testcontext: last activity too long ago,
> blocking.
> [D 11:46:15.735042] BMI_ib_testcontext: last activity too long ago,
> blocking.
> [D 11:46:15.747048] BMI_ib_testcontext: last activity too long ago,
> blocking.
> [D 11:46:15.759037] BMI_ib_testcontext: last activity too long ago,
> blocking.
> [E 11:46:15.761497] Warning: exchange_data: partial read, 1/8 bytes.
> [D 11:46:15.761542] ib_close_connection: closing connection to
> 10.1.4.57:60889.
> (END)
> The interesting
> part is I have timeouts effectively turned off in the filesystem
> config for other testing purposes, which when set back to default
> (timeouts in the pvfs2-fs config file)..
>
> [D 11:54:11.946168] BMI_ib_testcontext: last activity too long ago,
> blocking.
> [D 11:54:11.958164] BMI_ib_testcontext: last activity too long ago,
> blocking.
> [D 11:54:11.970166] BMI_ib_testcontext: last activity too long ago,
> blocking.
> [D 11:54:11.982167] BMI_ib_testcontext: last activity too long ago,
> blocking.
> [D 11:54:11.994166] BMI_ib_testcontext: last activity too long ago,
> blocking.
> [D 11:54:12.006190] BMI_ib_testcontext: last activity too long ago,
> blocking.
> [D 11:54:12.018165] BMI_ib_testcontext: last activity too long ago,
> blocking.
> [D 11:54:12.030165] BMI_ib_testcontext: last activity too long ago,
> blocking.
> [E 11:54:12.041386] Warning: exchange_data: partial read, 1/8 bytes.
> [D 11:54:12.041441] ib_close_connection: closing connection to
> 10.1.4.57:34756.
> [E 11:54:12.042161] SIGSEGV: skipping cleanup; exit now!
>
> I realize this may also be a hardware issue, but I'd like to see the
> server not barf when clients fail to connect..
> I also tried commenting out those checks, yes its a hardware problem
> on the client side for now, oddly enough though, netpipe over ib works
> fine, so do all of my standard IB tests.
>
>> That's just hardware. It's comparing the returned values in
>> ibv_qp_init_attr to what was asked for. You could try commenting
>> out those checks, just to see if the IB library did not set the
>> return values properly. Or you could shrink the request num_wr and
>> see if that helps. Either way, next stop is your IB card vendor.
>>
>> -- Pete
>>
>>
>>
>
>
--
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory
More information about the Pvfs2-developers
mailing list