[Pvfs2-developers] Re: segfault in openib code

Kyle Schochenmaier kschoche at scl.ameslab.gov
Mon Sep 18 09:08:40 EDT 2006


Pete Wyckoff wrote:
> kschoche at scl.ameslab.gov wrote on Fri, 15 Sep 2006 09:58 -0500:
>   
>> I've been trying to debug some issues with my MD server going down, or 
>> rather timing out and closing the connections for some reason, and 
>> canceling bmi jobs.  While doing so, I ran into a segfaulting issue in 
>> openib_close_connection:
>>     
>
> This job cancel scenario is tricky.  I plan to look at this soon, to
> figure out all the possible cases where things can hang up, but it
> won't be trivial.
>
>   
>> static void openib_close_connection(ib_connection_t *c)
>> {
>>    int ret;
>>    struct openib_connection_priv *oc = c->priv;
>>
>>    /* destroy the queue pairs */
>>
>> <snip>
>>
>>    free(oc);
>> }
>>
>> Since my gdb backtrace doesnt go into any ibv_* functions, I'm assuming 
>> this free() call is the culprit.
>> I'm not sure why this free() could be getting into a segfault, but I'm 
>> thinking it may be a good idea for now until we can work out why it's 
>> closing the connections, to put a check in there to make sure oc is 
>> still valid.
>>
>> Has anyone run into this or other issues with servers going down in openib?
>>     
>
> Something stomped on the malloc arena.  oc should be valid as
> mediated by the c->refcnt higher up.  I'm not sure how to test
> for validity.  There should be some interesting log messages
> though, like
>
>     debug(2, "%s: closing connection to %s", __func__, c->peername);
>
> or 
>
>     debug(1, "%s: refcnt non-zero %d, delaying free", __func__, c->refcnt);
>
> that may help us to figure this out if you turned on network
> logging.  (Please, more logs!)
>
> 		-- Pete
>
>
>   
Doh, I thought I had a log included here, I guess not, here's the log, 
note that the second close connection message?


[D 10:33:03.279050] memcache_register: hit [0] 0x2aaaaadc3010 len 262144 
(via 0x
2aaaaadc3010 len 262144) refcnt now 2.
[D 10:33:03.279060] BMI_post_send: addr: 5115, offset: 0x2aaaaae04010, 
size: 262
144, tag: 14
[D 10:33:03.279071] memcache_register: hit [0] 0x2aaaaae04010 len 262144 
(via 0x
2aaaaae04010 len 262144) refcnt now 2.
[D 10:33:03.279081] BMI_post_send: addr: 5115, offset: 0x2aaaaae45010, 
size: 131
072, tag: 14
[D 10:33:03.279092] memcache_register: hit [0] 0x2aaaaae45010 len 131072 
(via 0x
2aaaaae45010 len 262144) refcnt now 2.
[E 10:33:33.839001] Job time out: cancelling flow operation, job_id: 5152.
[E 10:33:33.839028] Flow proto cancel called on 0x2aaaaac00920
[E 10:33:33.839041] Flow proto error cleanup started on 0x2aaaaac00920, 
error_co
de: -1610612737
[D 10:33:33.839067] BMI_cancel: cancel id 5154
[D 10:33:33.839127] memcache_deregister: dec refcount [0] 0x2aaaaab91010 
len 262
144 count now 1.
[D 10:33:33.839146] memcache_deregister: dec refcount [0] 0x2aaaaad41010 
len 262
144 count now 1.
[D 10:33:33.839154] memcache_deregister: dec refcount [0] 0x2aaaaad82010 
len 262
144 count now 1.
[D 10:33:33.839162] memcache_deregister: dec refcount [0] 0x2aaaaadc3010 
len 262
144 count now 1.
[D 10:33:33.839170] memcache_deregister: dec refcount [0] 0x2aaaaae04010 
len 262
144 count now 1.
[D 10:33:33.839178] memcache_deregister: dec refcount [0] 0x2aaaaae45010 
len 131
072 count now 1.
[D 10:33:33.839189] BMI_cancel: cancel id 5155
[D 10:33:33.839200] BMI_cancel: cancel id 5158
[D 10:33:33.839210] BMI_cancel: cancel id 5159
[D 10:33:33.839221] BMI_cancel: cancel id 5157
[D 10:33:33.839232] BMI_cancel: cancel id 5156
[D 10:33:33.839242] BMI_cancel: cancel id 5153
[E 10:33:33.839253] Flow proto 0x2aaaaac00920 canceling a total of 7 BMI 
or Trov
e operations
[D 10:33:33.839260] test_sq: sq 0x603e60 cancelled.
[D 10:33:33.839282] test_sq: sq 0x603ef0 cancelled.
[D 10:33:33.839291] test_sq: sq 0x604100 cancelled.
[D 10:33:33.839299] test_sq: sq 0x6044c0 cancelled.
[D 10:33:33.839308] test_sq: sq 0x60bb50 cancelled.
[D 10:33:33.839319] BMI_testcontext completing: 5153
[D 10:33:33.839327] BMI_testcontext completing: 5154
[D 10:33:33.839334] BMI_testcontext completing: 5155
[D 10:33:33.839342] BMI_testcontext completing: 5156
[D 10:33:33.839349] BMI_testcontext completing: 5157
[D 10:33:33.839365] test_sq: sq 0x60bcc0 cancelled.
[D 10:33:33.839374] test_sq: sq 0x60be30 cancelled.
[D 10:33:33.839382] ib_close_connection: closing connection to 
10.1.5.218:59985.
[D 10:33:33.839963] ib_close_connection: closing connection to 0>`.

   
I'm not sure why we're getting two 'closing connection to' messages -- 
indeed this may be our problem now that I think of it, someone may be 
trying to close it twice -- but I can tell you that I always see two of 
these messages??

I'm also still seeing weird error codes, and am not sure if we've 
addressed this yet, but doubt thats our real problem.

hope this helps,

    -- Kyle

-- 
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory 



More information about the Pvfs2-developers mailing list