[Pvfs2-developers] Re: halloween bug (possible patch update)
pcarns at wastedcycles.org
Wed Oct 17 13:46:05 EDT 2007
Sam Lang wrote:
> Hi Phil,
> Thanks for the patch and the quick turnaround. I've made some comments
> inline. You're much more familiar with the design of the code (having
> written it :-), so maybe I'm missing the big picture, but I wonder if
> this is more workaround then we need right now.
> On Oct 17, 2007, at 10:33 AM, Phil Carns wrote:
>>> Actually, one more follow up on this specific patch to wrap up. The
>>> problem that I suspected actually does not occur, although it may
>>> have been by luck :) There is a tcp_addr_data->bmi_addr field that
>>> gets used as an argument to the forget_callback() function. That
>>> bmi_addr field is set to zero unless it is filled in by the
>>> addr_reg_callback() function. That means that on the client side,
>>> the forget_callback() function actually just fails because it
>>> searches for a BMI_addr_t of value 0 in the reference list.
>>> I just tested it out, and pvfs2-client-core was able to recover from
>>> a network error just fine with the patch in place.
>> I still agree that it would probably be better in the long run to
>> eliminate the bmi ref list, but I kept looking at the current
>> situation to get a fix in the meantime. The attached patch builds on
>> what is currently in trunk with a few changes:
>> - removes additional ref list lock that was added to
>> addr_list_callback() with one of the recent commits. I think this
>> cleanup was not directly related to the halloween bug. It appears
>> reasonable, but it introduced a lock ordering problem between the ref
>> list mutex and bmi_tcp's interface mutex.
>> - The new forget_callback() function had three possible issues: the
>> same lock ordering issue as in the above bullet (although the recent
>> commit to release the interface mutex may have also fixed it),
>> the possibility of deleting an address with a non-zero reference count,
> There's no way to send the response to the client once that connection
> is lost, so why keep the address reference around? The forget_callback
> was only getting called if the connection had closed. If a request was
> in progress (the only way the reference count could be nonzero), then
> an error is going to occur either way -- when the decrement is
> attempted, or the response send is attempted.
Before the id registration in BMI was done using the "safe" functions,
this was done to keep the server from attempting to communicate on an
address that doesn't exist. Nowadays it probbly doesn't matter- you
will get a graceful error either way. One minor issue, though, is that
if you keep the address around until communication is finished you will
get the correct error message in the logs for each attempted
communication (like connection reset, etc.). If the address is
destroyed, then the remaining attempts to communicate on it will always
get EPROTO. I don't know if that matters to anyone, though.
>> and also I think in bmi_tcp it didn't take care of freeing bmi_tcp's
>> internal method address.
> forget_callback is getting called before the method address is cleaned
> up and deallocated locally. Do you mean the forget callback doesn't
> call back into the method to do the cleanup?
I don't think the method address is being deallocated locally, if you
are talking about the tcp_forget_addr() function. It only does a
deallocation if the "dealloc_flag" is set. That flag is not set on any
error cases internal to bmi_tcp. On clients, this is done so that it
can just reuse the same address structure and re-open the socket when
communication is retried. On servers, this doesn't serve much purpose
except for stashing the error code for any pending communication
attempts (see above).
>> I changed the forget_callback implementation to just mark the
>> target addresses in a queue for the bmi control layer to look at
>> later rather than trying to immediately free the address. I think
>> this fixes all three problems by having the control layer be the
>> component that drives the address release from the top down (using
>> the same mechanism already in place for when the reference count
>> reaches zero).
> I have to be honest, adding another queue to manage addresses seems
> like heading in the wrong direction. It pushes the reference remove to
> some point in the future, when a separate request is going to get hit
> with needing to remove other references (the reference decrement blocks
> the state action). Depending on the number of operations in play, I
> could imagine random operations getting hit with a noticeable latency
> having to remove a bunch of other references from the list. I hear
> what your saying about a top-down approach, but that's not consistent
> with the register_callback triggering (from the bmi-tcp layer) the
> addition of a reference on a new connection event.
> I guess in general I would argue that the methods are the event
> producers, and they choose to either queue events for the job layer to
> manage or handle them internally. The middle bmi wrapper layer
> shouldn't also be managing things there. I know you're not arguing
> against that, but as a temporary fix, I'd rather see events get handled
> directly rather than queued by the bmi wrappers for later.
Yeah, I understand where you are coming from. I think it all goes back
to getting rid of the reference list to get rid of some of these issues
about who controls it. Maybe that is something that we can start
figuring out soon. I will be happy to help- I just was leary of
tackling that right now. As it is architected stands at the moment I
don't see a particularly elegant way to have both components (control
layer and the method) coordinate destroying those addresses.
>> - moved the code in the set_info() function for dropping a bmi
>> address into a subroutine so that the same code path is used for both
>> the forget_callback() and the set_info(BMI_DROP_ADDR) case.
>> Basically I think we already had a code path that did the right thing
>> (in terms of bmi_tcp) in place to get triggered when a bmi addr
>> reference count hit zero. The problem in this slowdown situation,
>> though, was that the socket was being closed _after_ the ref count
>> hit zero.
>> There was never any activity on the addr after that to trigger the
>> cleanup. The forget_callback() essentially is now giving methods a
>> way to tell the control layer to consider dropping the address again
>> independent of what the reference count has done.
> But what's to consider if the connection is closed?
It considers two things- is the reference count zero, and does the
method in question think the address should be destroyed or kept for
future use (this is probed using the get_info(BMI_DROP_ADDR_QUERY) call
into the method). All of the methods except for bmi_tcp ignore the
BMI_DROP_ADDR_QUERY because they don't have this particular problem
about how to dispose of addresses.
>> I'm still testing on this some, but I wanted to go ahead and throw it
>> out there for discussion.
> Again, thanks for the patch. If it turns out to fix problems and pass
> tests, I'm cool with adding it -- I guess I just like to argue. :-)
This is all good discussion- I think its great information to have for
doing a more thorough reworking of how the bmi addresses are handled.
More information about the Pvfs2-developers