[Pvfs2-developers] Re: the halloween bug fixed

Sam Lang slang at mcs.anl.gov
Tue Oct 9 09:31:01 EDT 2007


On Oct 9, 2007, at 8:02 AM, Scott Atchley wrote:

> On Oct 8, 2007, at 3:56 PM, Sam Lang wrote:
>
>> On Oct 8, 2007, at 1:34 PM, Pete Wyckoff wrote:
>>
>>> Tell me which of these things are true/false, and perhaps add more
>>> so we Scott and I can understand.
>>>
>>> 1. Address types
>>>
>>> - core PVFS refers to peers using a 64-bit opaque BMI_addr_t
>>
>> True.
>>
>>> - BMI has a list that maps BMI_addr_to to a method (e.g. mx or tcp)
>>>     and a struct method_addr *
>>
>> True.  I've been referring to this part of the code as the bmi  
>> control layer (there's references to that name in the code).  It  
>> includes the 'address reference list', which holds the ref_st_p  
>> instances.
>>
>>> - BMI methods do not see BMI_addr_t.  They see struct method_addr *.
>>>     That struct has a void * that is filled with method-specific
>>>     items, like a connection structure.
>>
>> True, although my patch changes that, because the address  
>> reference list is accessed based on the PVFS_BMI_addr_t.  The  
>> bmi_method_addr_reg_callback function returns a PVFS_BMI_addr_t,  
>> which the method is meant to store, and when it comes time to call  
>> bmi_method_addr_forget_callback, it passes that PVFS_BMI_addr_t it  
>> stored.
>>
>>> 2. Address creation
>>>
>>> - apps ask for peers by name:  tcp://myhost:1223/pvfs
>>>
>>> - these names are parsed by the appropriate method to return
>>>     an existing or new struct method_addr *.  Methods are expected
>>>     to keep a list of known peers?  IB does.
>>
>> I think they all do, but I'm not sure that they have to.  The  
>> reference list stores the method_addr* as well, so the particular  
>> method could just throw everything in there before calling  
>> bmi_method_addr_reg_callback.  If the method wants to iterate  
>> through its known addresses though, it needs to manage them itself  
>> as well.
>>
>>> - servers can also receive new connections, in which case they
>>>     must build a struct method_addr, and register it with core
>>>     BMI, through bmi_method_addr_reg_callback()
>>
>> True.
>>
>>> 3. Address destruction
>>>
>>> - currently never happens.  In theory, core BMI can request
>>>     that a method drop an address, perhaps when out of resources.
>>
>> Within the particular method (tcp being the one I'm looking at),  
>> the address is destroyed (tcp_forget_addr is called).  But the  
>> address reference is never being removed from the reference list.   
>> In the case of tcp, this is a big problem (the bug in question),  
>> because tcp calls bmi_method_addr_reg_callback for each new  
>> connection, not just each new peer that connects.
>
> Why not have tcp simply not register on each connection, but on  
> each new client. The handle_new_connection() function is only used  
> on the server.

I'm not sure that helps solve the problem.  Even with only a few  
hundred clients, with MPI-IO, a new process on each node is going to  
connect to the server.  Over time, a different client port number  
could be used, so the count of "peers" could still grow into the  
hundreds of thousands.

Also, TCP doesn't keep track of the peers or connections it has --  
its using the bmi control layer for that.

>
>> The patch implements a complementary callback function:  
>> bmi_method_addr_forget_callback, to be called by the individual  
>> methods when the connection is closed or the address is removed  
>> from the method's own management structures.  In the case of tcp,  
>> a connection is closed when poll or epoll returns an IN event on a  
>> socket (through testcontext) and the following read returns zero  
>> (EOF).  Its at this point the method now (with my patch) calls  
>> bmi_method_addr_forget_callback to remove the address reference  
>> from the list.
>
> I think it is stretching a bit for the server's BMI method to make  
> an assumption about the fate of a client due to a loss of  
> communication. The loss could be permanent or a re-connect attempt  
> may already be initiated.

Its not a loss of communication.  When the client closes its socket,  
a 0 byte message is sent to the server.  Reading that zero byte  
message is how the server knows the client is requesting the  
connection gets closed.

>
> Is the determination about when to give up on a client (e.g. after  
> some time period, via some userspace utility, etc.) better handled  
> at the PVFS2 layer and then use BMI_get_info(DROP_ADDR) to handle it?

For connections there's no timeout.  poll and epoll can return an  
error (ERESET or something), but if they don't the server assumes the  
connection is valid and never tries to clean it up.

-sam

>
>>> - BMI_addr_t and struct method_addr and method-internal state are
>>>     kept synchronized how?  Through which calls?  Who frees which
>>>     things when?
>>
>> On the server, PVFS_BMI_addr_t gets added to the reference list by  
>> bmi_method_addr_reg_callback.  It gets removed from the list (with  
>> my patch) by bmi_method_addr_forget_callback, before the method  
>> frees up the address structures it was managing internally.
>>
>>> I'm obviously wrong and confused about #3.  And maybe 2b makes
>>> a bad assumption.
>>
>> What you've described sounds right.
>>
>>> Then how does your new forget callback fit in with all this?
>>> Somehow I thought the endless list growth was purely a TCP problem,
>>> not something that needed core plumbing.
>>
>> I think its only a tcp problem because the tcp method calls  
>> bmi_method_addr_reg_callback for each new connection.  It looks  
>> like the other methods only call it for each new peer?  The list  
>> just doesn't get larger over time, so unless we plan to allow for  
>> millions of clients, its not an issue for other methods.  And in  
>> that case, they don't have to worry about calling the  
>> forget_callback.  I was assuming that they should in the interest  
>> of completeness.
>>
>> -sam
>



More information about the Pvfs2-developers mailing list