[Pvfs2-users] OpenIB/kernel interface: null pointerdereference
in put_back_slot
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Tue Mar 20 12:12:45 EST 2007
According to the log, you're getting IBV_WC_WR_FLUSH returned by the
check_cq fuction which does all the polling for openIB.
The IB spec says this about the error:
"Work Request Flushed Error - A Work Request was in process or
outstanding when the QP transitioned into the Error State."
It doesnt go any further into the details of this error, but generally
whenever the QP is sent into an error state,
it is considered to be a fatal error by most of the IB community.
(correct me if I'm wrong, please)
This leads me to believe that you may still have underlying network
problems.
Have you been able to successfully run the various openIB test programs
like ibv_rc_pingpong() or possibly tried the latest NetPIPE release
which has openIB support (it may not give a pretty answer other than
crashing if you have network problems though :-/ )
If the network ends up not being the problem, we've got a serious
problem here in the code, as we should never be putting the QP into
erroneous states.
Also, pete, the spec doesnt say anything about having async errors being
flagged for an error like this, is this a case where we might be able to
get useful information about the QP before or as it goes into an error
state via async events?
Kyle
Kyle Schochenmaier wrote:
> This is actually an error propagating up from openIB, not pvfs. I've
> never seen the error before, and I'm not sure if it is a fatal error
> or something that we can handle inside pvfs, I'll have to look at the
> IB spec and see if we can generate a patch for this.
>
> [E 15:44:43.719270] Error: ib_check_cq: entry id 0x5c4e70 opcode RECV
> error IBV_WC_WR_FLUSH_ERR.
>
> Kyle
>
>
> Tad Kollar wrote:
>> Pete Wyckoff wrote:
>>
>>> Have you been able to use, say, pvfs2-cp to put files into PVFS over
>>> IB? That will help us know if it's a kernel problem or an IB
>>> problem, perhaps.
>>>
>> After getting your reply I set up a test that used pvfs2-cp to copy a
>> 2.5G file back and forth a total of 30 times. During that process,
>> pvfs2-cp generated these three errors, always during the read back from
>> the pvfs2 fs:
>>
>> [E 15:44:43.719270] Error: ib_check_cq: entry id 0x5c4e70 opcode RECV
>> error IBV_WC_WR_FLUSH_ERR.
>> [E 15:44:43.924115] [bt] pvfs2-cp(error+0xca) [0x44a1ca]
>> [E 15:44:43.924161] [bt] pvfs2-cp [0x448dc3]
>> [E 15:44:43.924171] [bt] pvfs2-cp [0x4492c6]
>> [E 15:44:43.924179] [bt] pvfs2-cp(BMI_testcontext+0x151) [0x433371]
>> [E 15:44:43.924187] [bt] pvfs2-cp(PINT_thread_mgr_bmi_push+0x144)
>> [0x43c054]
>> [E 15:44:43.924195] [bt] pvfs2-cp(job_testcontext+0x15a) [0x43b87a]
>> [E 15:44:43.924204] [bt]
>> pvfs2-cp(PINT_client_state_machine_test+0x98) [0x40ff88]
>> [E 15:44:43.924211] [bt] pvfs2-cp(PVFS_sys_wait+0x63) [0x4103b3]
>> [E 15:44:43.924220] [bt] pvfs2-cp(PVFS_sys_io+0x6b) [0x41635b]
>> [E 15:44:43.924228] [bt] pvfs2-cp(main+0x372) [0x40d792]
>> [E 15:44:43.924236] [bt] /lib/libc.so.6(__libc_start_main+0xda)
>> [0x2aaaab0784ca]
>>
>> [E 09:06:20.511281] Error: ib_check_cq: entry id 0x5e83f0 opcode RECV
>> error IBV_WC_WR_FLUSH_ERR.
>> [E 09:06:21.104063] [bt] pvfs2-cp(error+0xca) [0x44a1ca]
>> [E 09:06:21.104112] [bt] pvfs2-cp [0x448dc3]
>> [E 09:06:21.104120] [bt] pvfs2-cp [0x4492c6]
>> [E 09:06:21.104128] [bt] pvfs2-cp(BMI_testcontext+0x151) [0x433371]
>> [E 09:06:21.104136] [bt] pvfs2-cp(PINT_thread_mgr_bmi_push+0x144)
>> [0x43c054]
>> [E 09:06:21.104143] [bt] pvfs2-cp(job_testcontext+0x15a) [0x43b87a]
>> [E 09:06:21.104151] [bt]
>> pvfs2-cp(PINT_client_state_machine_test+0x98) [0x40ff88]
>> [E 09:06:21.104158] [bt] pvfs2-cp(PVFS_sys_wait+0x63) [0x4103b3]
>> [E 09:06:21.104165] [bt] pvfs2-cp(PVFS_sys_io+0x6b) [0x41635b]
>> [E 09:06:21.104173] [bt] pvfs2-cp(main+0x372) [0x40d792]
>> [E 09:06:21.104181] [bt] /lib/libc.so.6(__libc_start_main+0xda)
>> [0x2aaaab0784ca]
>>
>> [E 09:09:46.596001] Error: ib_check_cq: entry id 0x5c4cc0 opcode RECV
>> error IBV_WC_WR_FLUSH_ERR.
>> [E 09:09:47.109736] [bt] pvfs2-cp(error+0xca) [0x44a1ca]
>> [E 09:09:47.109790] [bt] pvfs2-cp [0x448dc3]
>> [E 09:09:47.109799] [bt] pvfs2-cp [0x4492c6]
>> [E 09:09:47.109807] [bt] pvfs2-cp(BMI_testcontext+0x151) [0x433371]
>> [E 09:09:47.109816] [bt] pvfs2-cp(PINT_thread_mgr_bmi_push+0x144)
>> [0x43c054]
>> [E 09:09:47.109823] [bt] pvfs2-cp(job_testcontext+0x15a) [0x43b87a]
>> [E 09:09:47.109831] [bt]
>> pvfs2-cp(PINT_client_state_machine_test+0x98) [0x40ff88]
>> [E 09:09:47.109840] [bt] pvfs2-cp(PVFS_sys_wait+0x63) [0x4103b3]
>> [E 09:09:47.109847] [bt] pvfs2-cp(PVFS_sys_io+0x6b) [0x41635b]
>> [E 09:09:47.109856] [bt] pvfs2-cp(main+0x372) [0x40d792]
>> [E 09:09:47.109863] [bt] /lib/libc.so.6(__libc_start_main+0xda)
>> [0x2aaaab0784ca]
>>
>>> The other interesting thing to know is if you can recofigure PVFS to
>>> use only TCP, then run your bonnie test and get the same error.
>>>
>> Except for IB testing, I've had TCP specified in the pvfs2tab and mount
>> options and haven't been able to disrupt it; is that sufficient or
>> should I remove all references to IB? I repeated the pvfs2-cp using TCP
>> and didn't receive any errors.
>>
>> Tad
>> _______________________________________________
>> Pvfs2-users mailing list
>> Pvfs2-users at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>
>>
>>
>>
>
> _______________________________________________
> Pvfs2-users mailing list
> Pvfs2-users at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>
> !DSPAM:460006b4105153366512726!
>
More information about the Pvfs2-users
mailing list