[Pvfs2-users] OpenIB/kernel interface: null pointerdereference in put_back_slot

Kyle Schochenmaier kschoche at scl.ameslab.gov
Tue Mar 20 12:12:45 EST 2007


According to the log, you're getting IBV_WC_WR_FLUSH returned by the 
check_cq fuction which does all the polling for openIB.
The IB spec says this about the error:
"Work Request Flushed Error - A Work Request was in process or 
outstanding when the QP transitioned into the Error State."

It doesnt go any further into the details of this error, but generally 
whenever the QP is sent into an error state,
it is considered to be a fatal error by most of the IB community. 
(correct me if I'm wrong, please)
This leads me to believe that you may still have underlying network 
problems.
Have you been able to successfully run the various openIB test programs 
like ibv_rc_pingpong() or possibly tried the latest NetPIPE release 
which has openIB support (it may not give a pretty answer other than 
crashing if you have network problems though :-/ )

If the network ends up not being the problem, we've got a serious 
problem here in the code, as we should never be putting the QP into 
erroneous states.

Also, pete, the spec doesnt say anything about having async errors being 
flagged for an error like this, is this a case where we might be able to 
get useful information about the QP before or as it goes into an error 
state via async events?

Kyle

Kyle Schochenmaier wrote:
> This is actually an error propagating up from openIB, not pvfs.  I've 
> never seen the error before, and I'm not sure if it is a fatal error 
> or something that we can handle inside pvfs, I'll have to look at the 
> IB spec and see if we can generate a patch for this.
>
> [E 15:44:43.719270] Error: ib_check_cq: entry id 0x5c4e70 opcode RECV
> error IBV_WC_WR_FLUSH_ERR.
>
> Kyle
>
>
> Tad Kollar wrote:
>> Pete Wyckoff wrote:
>>  
>>> Have you been able to use, say, pvfs2-cp to put files into PVFS over
>>> IB?  That will help us know if it's a kernel problem or an IB
>>> problem, perhaps.
>>>       
>> After getting your reply I set up a test that used pvfs2-cp to copy a
>> 2.5G file back and forth a total of 30 times. During that process,
>> pvfs2-cp generated these three errors, always during the read back from
>> the pvfs2 fs:
>>
>> [E 15:44:43.719270] Error: ib_check_cq: entry id 0x5c4e70 opcode RECV
>> error IBV_WC_WR_FLUSH_ERR.
>> [E 15:44:43.924115]     [bt] pvfs2-cp(error+0xca) [0x44a1ca]
>> [E 15:44:43.924161]     [bt] pvfs2-cp [0x448dc3]
>> [E 15:44:43.924171]     [bt] pvfs2-cp [0x4492c6]
>> [E 15:44:43.924179]     [bt] pvfs2-cp(BMI_testcontext+0x151) [0x433371]
>> [E 15:44:43.924187]     [bt] pvfs2-cp(PINT_thread_mgr_bmi_push+0x144)
>> [0x43c054]
>> [E 15:44:43.924195]     [bt] pvfs2-cp(job_testcontext+0x15a) [0x43b87a]
>> [E 15:44:43.924204]     [bt]
>> pvfs2-cp(PINT_client_state_machine_test+0x98) [0x40ff88]
>> [E 15:44:43.924211]     [bt] pvfs2-cp(PVFS_sys_wait+0x63) [0x4103b3]
>> [E 15:44:43.924220]     [bt] pvfs2-cp(PVFS_sys_io+0x6b) [0x41635b]
>> [E 15:44:43.924228]     [bt] pvfs2-cp(main+0x372) [0x40d792]
>> [E 15:44:43.924236]     [bt] /lib/libc.so.6(__libc_start_main+0xda)
>> [0x2aaaab0784ca]
>>
>> [E 09:06:20.511281] Error: ib_check_cq: entry id 0x5e83f0 opcode RECV
>> error IBV_WC_WR_FLUSH_ERR.
>> [E 09:06:21.104063]     [bt] pvfs2-cp(error+0xca) [0x44a1ca]
>> [E 09:06:21.104112]     [bt] pvfs2-cp [0x448dc3]
>> [E 09:06:21.104120]     [bt] pvfs2-cp [0x4492c6]
>> [E 09:06:21.104128]     [bt] pvfs2-cp(BMI_testcontext+0x151) [0x433371]
>> [E 09:06:21.104136]     [bt] pvfs2-cp(PINT_thread_mgr_bmi_push+0x144)
>> [0x43c054]
>> [E 09:06:21.104143]     [bt] pvfs2-cp(job_testcontext+0x15a) [0x43b87a]
>> [E 09:06:21.104151]     [bt]
>> pvfs2-cp(PINT_client_state_machine_test+0x98) [0x40ff88]
>> [E 09:06:21.104158]     [bt] pvfs2-cp(PVFS_sys_wait+0x63) [0x4103b3]
>> [E 09:06:21.104165]     [bt] pvfs2-cp(PVFS_sys_io+0x6b) [0x41635b]
>> [E 09:06:21.104173]     [bt] pvfs2-cp(main+0x372) [0x40d792]
>> [E 09:06:21.104181]     [bt] /lib/libc.so.6(__libc_start_main+0xda)
>> [0x2aaaab0784ca]
>>
>> [E 09:09:46.596001] Error: ib_check_cq: entry id 0x5c4cc0 opcode RECV
>> error IBV_WC_WR_FLUSH_ERR.
>> [E 09:09:47.109736]     [bt] pvfs2-cp(error+0xca) [0x44a1ca]
>> [E 09:09:47.109790]     [bt] pvfs2-cp [0x448dc3]
>> [E 09:09:47.109799]     [bt] pvfs2-cp [0x4492c6]
>> [E 09:09:47.109807]     [bt] pvfs2-cp(BMI_testcontext+0x151) [0x433371]
>> [E 09:09:47.109816]     [bt] pvfs2-cp(PINT_thread_mgr_bmi_push+0x144)
>> [0x43c054]
>> [E 09:09:47.109823]     [bt] pvfs2-cp(job_testcontext+0x15a) [0x43b87a]
>> [E 09:09:47.109831]     [bt]
>> pvfs2-cp(PINT_client_state_machine_test+0x98) [0x40ff88]
>> [E 09:09:47.109840]     [bt] pvfs2-cp(PVFS_sys_wait+0x63) [0x4103b3]
>> [E 09:09:47.109847]     [bt] pvfs2-cp(PVFS_sys_io+0x6b) [0x41635b]
>> [E 09:09:47.109856]     [bt] pvfs2-cp(main+0x372) [0x40d792]
>> [E 09:09:47.109863]     [bt] /lib/libc.so.6(__libc_start_main+0xda)
>> [0x2aaaab0784ca]
>>  
>>> The other interesting thing to know is if you can recofigure PVFS to
>>> use only TCP, then run your bonnie test and get the same error.
>>>       
>> Except for IB testing, I've had TCP specified in the pvfs2tab and mount
>> options and haven't been able to disrupt it; is that sufficient or
>> should I remove all references to IB? I repeated the pvfs2-cp using TCP
>> and didn't receive any errors.
>>
>> Tad
>> _______________________________________________
>> Pvfs2-users mailing list
>> Pvfs2-users at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>
>>
>>
>>   
>
> _______________________________________________
> Pvfs2-users mailing list
> Pvfs2-users at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>
> !DSPAM:460006b4105153366512726!
>



More information about the Pvfs2-users mailing list