[Pvfs2-developers] Re: memory allocations in BMI
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Fri Dec 29 12:58:13 EST 2006
Pete Wyckoff wrote:
> kschoche at scl.ameslab.gov wrote on Wed, 27 Dec 2006 15:06 -0600:
>
>> Excellent, thanks Pete! For some reason i thought that patch was
>> only for caching.
>> I built the latest pvfs-CVS, and am having problems with openib. I
>> originally thought it was a problem with having my client using CVS
>> and the server using the latest release, but rebuilt the server to be
>> on CVS head, and got this on the client:
>>
>> p5l8:~# pvfs2-ls
>> [E 15:02:10.201683] Error: openib_new_connection: asked for 70 send
>> WRs on QP, got 0.
>>
>
>
So this is what I'm getting on the server now:
[D 11:41:10.340612] PVFS2 Server version 2.6.1pre1-2006-12-27-205836
starting.
[E 11:41:39.405117] Warning: exchange_data: partial read, 1/8 bytes.
[E 11:41:39.408171] SIGSEGV: skipping cleanup; exit now!
Something is weird here, I wouldnt expect a connection failure/resource
issue on client side to
segfault the server, but not the client. More debugging to come.
With network debugging on, I get this on server, which leads me to
believe I have a configuration problem somewhere as well:
[D 11:46:01.594426] PVFS2 Server version 2.6.1pre1-2006-12-27-205836
starting.
[D 11:46:01.595332] BMI_ib_initialize: init.
[D 11:46:01.595404] openib_ib_initialize: init.
[D 11:46:01.596383] openib_ib_initialize: max 65408 completion queue
entries.
[D 11:46:01.596664] BMI_ib_initialize: done.
'pvfs2-ls on client'
<pages of this>
[D 11:46:15.711040] BMI_ib_testcontext: last activity too long ago,
blocking.
[D 11:46:15.723029] BMI_ib_testcontext: last activity too long ago,
blocking.
[D 11:46:15.735042] BMI_ib_testcontext: last activity too long ago,
blocking.
[D 11:46:15.747048] BMI_ib_testcontext: last activity too long ago,
blocking.
[D 11:46:15.759037] BMI_ib_testcontext: last activity too long ago,
blocking.
[E 11:46:15.761497] Warning: exchange_data: partial read, 1/8 bytes.
[D 11:46:15.761542] ib_close_connection: closing connection to
10.1.4.57:60889.
(END)
The interesting part is I have timeouts effectively turned off in the
filesystem config for other testing purposes, which when set back to
default (timeouts in the pvfs2-fs config file)..
[D 11:54:11.946168] BMI_ib_testcontext: last activity too long ago,
blocking.
[D 11:54:11.958164] BMI_ib_testcontext: last activity too long ago,
blocking.
[D 11:54:11.970166] BMI_ib_testcontext: last activity too long ago,
blocking.
[D 11:54:11.982167] BMI_ib_testcontext: last activity too long ago,
blocking.
[D 11:54:11.994166] BMI_ib_testcontext: last activity too long ago,
blocking.
[D 11:54:12.006190] BMI_ib_testcontext: last activity too long ago,
blocking.
[D 11:54:12.018165] BMI_ib_testcontext: last activity too long ago,
blocking.
[D 11:54:12.030165] BMI_ib_testcontext: last activity too long ago,
blocking.
[E 11:54:12.041386] Warning: exchange_data: partial read, 1/8 bytes.
[D 11:54:12.041441] ib_close_connection: closing connection to
10.1.4.57:34756.
[E 11:54:12.042161] SIGSEGV: skipping cleanup; exit now!
I realize this may also be a hardware issue, but I'd like to see the
server not barf when clients fail to connect..
I also tried commenting out those checks, yes its a hardware problem on
the client side for now, oddly enough though, netpipe over ib works
fine, so do all of my standard IB tests.
> That's just hardware. It's comparing the returned values in
> ibv_qp_init_attr to what was asked for. You could try commenting
> out those checks, just to see if the IB library did not set the
> return values properly. Or you could shrink the request num_wr and
> see if that helps. Either way, next stop is your IB card vendor.
>
> -- Pete
>
>
>
--
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory
More information about the Pvfs2-developers
mailing list