[Pvfs2-developers] Re: the halloween bug fixed
Phil Carns
pcarns at wastedcycles.org
Thu Oct 11 13:28:19 EDT 2007
>> Thanks for tracking this down! I've been out of the office for a few
>> days so I am a little behind on the conversation. I have a comment
>> on the patch, though.
>>
>> I think that for the most part, it is calling
>> bmi_method_addr_forget_callback() on pretty much any tcp network
>> error, since it is being invoked from the tcp_forget_addr() function
>> which is a general purpose function.
>>
>> This is fine on the server side, but I suspect (haven't been able to
>> confirm yet) that it would cause problems on the client side.
>> Clients resolve a tcp://servername:3334 address into a
>> PVFS_BMI_addr_t once and then hang onto it forever. If a network
>> error occurs, then the state machines do not re-resolve it; they will
>> just retry communication on the same PVFS_BMI_addr_t, which will have
>> been invalidated by the forget_callback() function.
>>
>> Would it be better to only call bmi_method_addr_forget_callback() on
>> addresses within bmi_tcp that were registered using
>> bmi_method_addr_reg_callback()? I haven't looked yet, but that may
>> require an extra flag somewhere to record which addresses this
>> applies to. That way, the only person invalidating these things on
>> errors will be servers which have anonymous addresses that need to be
>> cleaned out. Clients with long lived address resolutions would not be
>> affected. It also makes a little more sense from an API point of
>> view if these two functions are companions that are called in the
>> same scenario.
>>
> Hi Phil,
>
> I can add a server flag to the tcp addr struct, and only call
> forget_addr in that case, but it seems like a bit of hack.
Actually, one more follow up on this specific patch to wrap up. The
problem that I suspected actually does not occur, although it may have
been by luck :) There is a tcp_addr_data->bmi_addr field that gets used
as an argument to the forget_callback() function. That bmi_addr field
is set to zero unless it is filled in by the addr_reg_callback()
function. That means that on the client side, the forget_callback()
function actually just fails because it searches for a BMI_addr_t of
value 0 in the reference list.
I just tested it out, and pvfs2-client-core was able to recover from a
network error just fine with the patch in place.
-Phil
More information about the Pvfs2-developers
mailing list