[Pvfs2-developers] Re: bmi_ib resource constraints with older hardware

Troy Benjegerdes troy at scl.ameslab.gov
Wed Mar 12 12:55:31 EST 2008


Pete Wyckoff wrote:
> kschoche at gmail.com wrote on Mon, 10 Mar 2008 13:51 -0500:
>   
>> I am trying to hack together a test case to implement what we had
>> talked about in the previous emails with a wr_credit...
>> I'm trying to keep track of it in the openib_device (od) structure
>> inside openib.c and would like to keep the necessary changes inside
>> openib.c if at all possible.  The problem I'm running into, is that
>> I'm going to need to call check_cq() from inside the send_rdma writes
>> function, which lies in openib.c, not ib.c.        openib.c has a
>> function for this but its really intended to work *with* ib.c's
>> check_cq() fucntionality...
>> In order to get around this I needed to make ib_check_cq() visible to
>> openib.c  (got rid of the static and added a declaration to ib.h)..
>> but I'm getting weird things when I'm linking..
>>
>> Any ideas how to get around this?
>>
>> lib/libpvfs2-server.a(bmi-server.o):(.rodata+0x780): undefined
>> reference to `bmi_ib_ops'
>> collect2: ld returned 1 exit status
>> make: *** [src/server/pvfs2-server] Error 1
>>
>> (I've attached a very rudimentary patch that sort of gets at what I'm
>> trying to do, not sure if its correct yet, still trying to compile)
>>     
>
> Just hack up anything you like to get it to work.  If it fixes the
> situation, we'll go back and clean up the code later.
>
> It is optimistic, what you're trying to do, but I'm not sure if it
> will be sufficient.  If there are no credits to get back from
> checking the CQ, you'll just deadlock.  I'm also nervous about
> locking implications, as you're checking the CQ in the thread that
> is trying to do the send.  Not sure if we have done this before.
>
> A simpler way would be just to just fail whatever operation got us
> into this RDMA, by abandoning it, with another state that says we're
> waiting on credits.  An easier first step is just to add lots of
> printfs to track the credits and see if you can correlate a credit
> overflow with the rdma failures.  If that works, a check at the top
> of "post rdma" can say whether we should even bother and we won't
> need your fixup step of looking at the CQ from the send.
>
> 		-- Pete
>   
I added debug code that incremented od->nic_wr_credit for every 
ibv_post_send, and decremented it for every RDMA completion in 
openib_poll_cq.. this is what I got:

We ended up posting about 70 RDMA's with ibv_post_send, and 3 with 
signals, and then run out of resources. I suspect if I just had a loop 
that called 'ib_block_for_activity', and then 'ib_poll_cq', and then 
retried the post_send that it would work fine.  I'll probably try that next.

[D 12:32:35.013945] openib_post_sr: 10.1.4.240:46814 bh 17 len 32 wr 6/70.
[D 12:32:35.014139] BMI_post_send_list: addr: 6663808, count: 1, 
total_size: 24, tag: 4294
[D 12:32:35.014203]    element 0: offset: 0x6155b0, size: 24
[D 12:32:35.014225] openib_post_sr: 10.1.4.240:46814 bh 18 len 40 wr 7/70.
[D 12:32:35.014293] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f43dc000 rkey 3a0004b nic_wr_credit 19011.
[D 12:32:35.014321] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f43fc000 rkey 3a0004b nic_wr_credit 19012.
[D 12:32:35.014340] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f441c000 rkey 3a0004b nic_wr_credit 19013.
[D 12:32:35.014362] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f443c000 rkey 3a0004b nic_wr_credit 19014.
[D 12:32:35.014381] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f445c000 rkey 3a0004b nic_wr_credit 19015.
[D 12:32:35.014399] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f447c000 rkey 3a0004b nic_wr_credit 19016.
[D 12:32:35.014418] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f449c000 rkey 3a0004b nic_wr_credit 19017.
[D 12:32:35.014436] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f44bc000 rkey 3a0004b nic_wr_credit 19018.
[D 12:32:35.014455] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f44dc000 rkey 3a0004b nic_wr_credit 19019.
[D 12:32:35.014474] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f44fc000 rkey 3a0004b nic_wr_credit 19020.
[D 12:32:35.014493] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f451c000 rkey 3a0004b nic_wr_credit 19021.
[D 12:32:35.014512] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f453c000 rkey 3a0004b nic_wr_credit 19022.
[D 12:32:35.014540] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f455c000 rkey 3a0004b nic_wr_credit 19023.
[D 12:32:35.014558] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f457c000 rkey 3a0004b nic_wr_credit 19024.
[D 12:32:35.014577] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f459c000 rkey 3a0004b nic_wr_credit 19025.
[D 12:32:35.014596] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f45bc000 rkey 3a0004b nic_wr_credit 19026.
[D 12:32:35.014615] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f45dc000 rkey 3a0004b nic_wr_credit 19027.
[D 12:32:35.014631] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f45fc000 rkey 3a0004b nic_wr_credit 19028.
[D 12:32:35.014649] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f461c000 rkey 3a0004b nic_wr_credit 19029.
[D 12:32:35.014668] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f463c000 rkey 3a0004b nic_wr_credit 19030.
[D 12:32:35.014688] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f465c000 rkey 3a0004b nic_wr_credit 19031.
[D 12:32:35.014707] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f467c000 rkey 3a0004b nic_wr_credit 19032.
[D 12:32:35.014726] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f469c000 rkey 3a0004b nic_wr_credit 19033.
[D 12:32:35.014745] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f46bc000 rkey 3a0004b nic_wr_credit 19034.
[D 12:32:35.014764] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f46dc000 rkey 3a0004b nic_wr_credit 19035.
[D 12:32:35.014783] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f46fc000 rkey 3a0004b nic_wr_credit 19036.
[D 12:32:35.014802] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f471c000 rkey 3a0004b nic_wr_credit 19037.
[D 12:32:35.014821] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f473c000 rkey 3a0004b nic_wr_credit 19038.
[D 12:32:35.014841] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f475c000 rkey 3a0004b nic_wr_credit 19039.
[D 12:32:35.014856] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f477c000 rkey 3a0004b nic_wr_credit 19040.
[D 12:32:35.014870] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f479c000 rkey 3a0004b nic_wr_credit 19041.
[D 12:32:35.014883] openib_post_sr_rdmaw: ibv_post_send wr_id 693390 to 
10.1.4.240:46814 remote addr f47bc000 rkey 3a0004b nic_wr_credit 19042.
[D 12:32:35.014899] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f47ec000 rkey 3a0004c nic_wr_credit 19043.
[D 12:32:35.014913] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f480c000 rkey 3a0004c nic_wr_credit 19044.
[D 12:32:35.014927] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f482c000 rkey 3a0004c nic_wr_credit 19045.
[D 12:32:35.014941] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f484c000 rkey 3a0004c nic_wr_credit 19046.
[D 12:32:35.014955] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f486c000 rkey 3a0004c nic_wr_credit 19047.
[D 12:32:35.014969] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f488c000 rkey 3a0004c nic_wr_credit 19048.
[D 12:32:35.014983] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f48ac000 rkey 3a0004c nic_wr_credit 19049.
[D 12:32:35.014996] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f48cc000 rkey 3a0004c nic_wr_credit 19050.
[D 12:32:35.015010] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f48ec000 rkey 3a0004c nic_wr_credit 19051.
[D 12:32:35.015024] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f490c000 rkey 3a0004c nic_wr_credit 19052.
[D 12:32:35.015038] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f492c000 rkey 3a0004c nic_wr_credit 19053.
[D 12:32:35.015052] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f494c000 rkey 3a0004c nic_wr_credit 19054.
[D 12:32:35.015066] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f496c000 rkey 3a0004c nic_wr_credit 19055.
[D 12:32:35.015079] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f498c000 rkey 3a0004c nic_wr_credit 19056.
[D 12:32:35.015093] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f49ac000 rkey 3a0004c nic_wr_credit 19057.
[D 12:32:35.015107] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f49cc000 rkey 3a0004c nic_wr_credit 19058.
[D 12:32:35.015121] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f49ec000 rkey 3a0004c nic_wr_credit 19059.
[D 12:32:35.015135] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4a0c000 rkey 3a0004c nic_wr_credit 19060.
[D 12:32:35.015148] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4a2c000 rkey 3a0004c nic_wr_credit 19061.
[D 12:32:35.015162] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4a4c000 rkey 3a0004c nic_wr_credit 19062.
[D 12:32:35.015176] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4a6c000 rkey 3a0004c nic_wr_credit 19063.
[D 12:32:35.015190] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4a8c000 rkey 3a0004c nic_wr_credit 19064.
[D 12:32:35.015204] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4aac000 rkey 3a0004c nic_wr_credit 19065.
[D 12:32:35.015218] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4acc000 rkey 3a0004c nic_wr_credit 19066.
[D 12:32:35.015232] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4aec000 rkey 3a0004c nic_wr_credit 19067.
[D 12:32:35.015245] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4b0c000 rkey 3a0004c nic_wr_credit 19068.
[D 12:32:35.015261] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4b2c000 rkey 3a0004c nic_wr_credit 19069.
[D 12:32:35.015274] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4b4c000 rkey 3a0004c nic_wr_credit 19070.
[D 12:32:35.015288] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4b6c000 rkey 3a0004c nic_wr_credit 19071.
[D 12:32:35.015302] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4b8c000 rkey 3a0004c nic_wr_credit 19072.
[D 12:32:35.015316] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4bac000 rkey 3a0004c nic_wr_credit 19073.
[D 12:32:35.015329] openib_post_sr_rdmaw: ibv_post_send wr_id 64eab0 to 
10.1.4.240:46814 remote addr f4bcc000 rkey 3a0004c nic_wr_credit 19074.
[D 12:32:35.015345] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4bec000 rkey 3a0004f nic_wr_credit 19075.
[D 12:32:35.015359] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4c0c000 rkey 3a0004f nic_wr_credit 19076.
[D 12:32:35.015373] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4c2c000 rkey 3a0004f nic_wr_credit 19077.
[D 12:32:35.015387] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4c4c000 rkey 3a0004f nic_wr_credit 19078.
[D 12:32:35.015401] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4c6c000 rkey 3a0004f nic_wr_credit 19079.
[D 12:32:35.015414] openib_post_sr_rdmaw: ibv_post_send wr_id 0 to 
10.1.4.240:46814 remote addr f4c8c000 rkey 3a0004f nic_wr_credit 19080.
[E 12:32:35.015429] openib_post_sr_rdmaw: ibv_post_send failed ret: 
-1001 errno: 0
[E 12:32:35.015442]  wr_id: 0x0 next: (nil) sg_list 0x65a970 num_sge 1
[E 12:32:35.015456]  opcode: 0x0 send_flags: 0x0 imm_data: 0x0
[E 12:32:35.015468]  sr.wr.rdma.remote_addr: 0xf4c8c000 rkey 0x3a0004f
[E 12:32:35.015480]  od->nic_wr_credit 19081 od->nic_max_wr 65535
[E 12:32:35.015913] openib_post_sr_rdmaw: QP_request sge: 1
[E 12:32:35.015961] Error: openib_post_sr_rdmaw: QP_sge: 28
: Unknown error 18446744073709550615.



More information about the Pvfs2-developers mailing list