[Pvfs2-developers] Re: the halloween bug fixed

Phil Carns pcarns at wastedcycles.org
Thu Oct 11 13:28:19 EDT 2007


>> Thanks for tracking this down!  I've been out of the office for a  few 
>> days so I am a little behind on the conversation.  I have a  comment 
>> on the patch, though.
>>
>> I think that for the most part, it is calling  
>> bmi_method_addr_forget_callback() on pretty much any tcp network  
>> error, since it is being invoked from the tcp_forget_addr()  function 
>> which is a general purpose function.
>>
>> This is fine on the server side, but I suspect (haven't been able  to 
>> confirm yet) that it would cause problems on the client side.   
>> Clients resolve a tcp://servername:3334 address into a  
>> PVFS_BMI_addr_t once and then hang onto it forever.  If a network  
>> error occurs, then the state machines do not re-resolve it; they  will 
>> just retry communication on the same PVFS_BMI_addr_t, which  will have 
>> been invalidated by the forget_callback() function.
>>
>> Would it be better to only call bmi_method_addr_forget_callback()  on 
>> addresses within bmi_tcp that were registered using  
>> bmi_method_addr_reg_callback()?  I haven't looked yet, but that may  
>> require an extra flag somewhere to record which addresses this  
>> applies to.  That way, the only person invalidating these things on  
>> errors will be servers which have anonymous addresses that need to  be 
>> cleaned out. Clients with long lived address resolutions would  not be 
>> affected.  It also makes a little more sense from an API  point of 
>> view if these two functions are companions that are called  in the 
>> same scenario.
>>
> Hi Phil,
> 
> I can add a server flag to the tcp addr struct, and only call  
> forget_addr in that case, but it seems like a bit of hack.  

Actually, one more follow up on this specific patch to wrap up.  The 
problem that I suspected actually does not occur, although it may have 
been by luck :)  There is a tcp_addr_data->bmi_addr field that gets used 
as an argument to the forget_callback() function.  That bmi_addr field 
is set to zero unless it is filled in by the addr_reg_callback() 
function.  That means that on the client side, the forget_callback() 
function actually just fails because it searches for a BMI_addr_t of 
value 0 in the reference list.

I just tested it out, and pvfs2-client-core was able to recover from a 
network error just fine with the patch in place.

-Phil


More information about the Pvfs2-developers mailing list