[Pvfs2-developers] Re: the halloween bug fixed
Sam Lang
slang at mcs.anl.gov
Mon Oct 8 12:13:54 EDT 2007
The attached patch is the proposed fix for this problem. When the
tcp method receives a disconnect from a peer, it invokes a callback
(bmi_method_addr_forget_callback) into the bmi control layer to
remove the address reference from the list. Maybe I should also add
a counter and limit on how bit the list can get, al though that would
involve potentially forcing long-lived connections to reconnect
periodically, and all methods would have to implement BMI_set_info
(DROP_ADDR).
With tcp, new connections are registered, even if they are from the
same host/port on the peer, whereas the other methods seem to only
register new host/port endpoints that haven't been seen before. So
its not completely clear to me when the other methods need to call
this callback, if at all. There needs to be a matching
bmi_method_addr_forget_callback for each
bmi_method_addr_reg_callback, but if the method only registers a
single address per client, the list won't keep growing, unless we
ever plan to support millions of clients.
With gm, the address is registered with the control layer, and
managed internally as well (gm_addr_add). The address is never
removed from the internal list though (gm_addr_del is never called).
Again, only new host/port pairs that haven't been seen are added to
the list, and registered with the control layer. gm doesn't
implement the BMI_set_info(DROP_ADDR) call, so addresses are not
expunged even if requested explicitly. My guess is we should
probably add a gm_addr_del for DROP_ADDR?
With ib, it looks like the server receives new connections and
registers them with the control layer, but the connections never get
closed, or the ib layer doesn't handle them? The only place I can
find where connections are dropped is if an explicit BMI_set_info
(DROP_ADDR), which doesn't get called from the server.
With mx, it looks like there's a limit on the number of connections
from a peer (BMX_PEER_RX_NUM == 20). As new connections are received
the idle connections are closed? Should
bmi_method_addr_forget_callback be called from there?
Thanks,
-sam
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bmi-addr-fix.patch
Type: application/octet-stream
Size: 12429 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-developers/attachments/20071008/2479421d/bmi-addr-fix.obj
-------------- next part --------------
On Oct 5, 2007, at 1:10 PM, Sam Lang wrote:
>
> On Oct 5, 2007, at 12:14 PM, Sam Lang wrote:
>
>>
>> On Oct 5, 2007, at 10:49 AM, Sam Lang wrote:
>>
>>>
>>> The obvious and easy fix is to have bmi-tcp return true from
>>> DROP_ADDR_QUERY for all address references. As far as I can
>>> tell, the only thing we save by keeping them around is a little
>>> memory allocation (the socket gets closed either way).
>>
>> This suggested fix isn't right. The DEC_ADDR_REF which decrements
>> the refcount to zero, is invoked after sending the final response,
>> but that's usually before the client (in the case of the admin
>> tools) closes the connection. It looks like its the
>> tcp_forget_addr in the bmi method that needs to call back out to
>> the bmi wrapper layer to remove the reference from the list. I
>> can call BMI_set_info(addr, BMI_TCP_CLOSE_SOCKET) from
>> tcp_forget_addr, but that seems a bit backwards...
>
> Actually it looks like we just need a companion function for
> bmi_method_addr_reg_callback.
> -sam
>
>>
>> -sam
>>
>>>
>>> In the changes I've been working on to get multiple address
>>> support in BMI, I've already replaced the linked list with a
>>> hashtable, which wouldn't have made the problem go away, but the
>>> degradation wouldn't have been quite as bad (may have made it
>>> harder to find, actually). Maybe its time to add some profiling
>>> info (perf stats?) to our basic list, queue and hash structures
>>> that would tell us how big they're getting.
>>>
>>> Anyway, thanks to all for contributing to the debugging process
>>> for this one.
>>>
>>> -sam
>>>
>>> On Sep 26, 2007, at 6:00 PM, Sam Lang wrote:
>>>
>>>>
>>>> Hi All,
>>>>
>>>> I've been trying to debug a problem with PVFS, where performance
>>>> degrades slowly with a long-lived (weeks and months) PVFS
>>>> volume. The degradation is significant -- simple metadata
>>>> operations are an order of magnitude slower after a month or
>>>> so. The behavior turns out to only occur with the VFS and pvfs2-
>>>> client daemon: performance of the admin tools (pvfs2-touch,
>>>> pvfs2-rm, etc.) to the same set of servers remains good.
>>>> Restarting the client daemon also fixes the problem, suggesting
>>>> that the long-lived open sockets are somehow the cause. The
>>>> slowness also appears to be at the servers not the clients: the
>>>> same kernel module and client daemon to a different filesystem
>>>> and set of servers doesn't exhibit the performance degradation.
>>>>
>>>> Also, I should mention that the system config is a little
>>>> different than usual. We have IO nodes mounting and unmounting
>>>> the PVFS volume (and stopping the client daemon) with each
>>>> user's job, which is fairly frequent, while on the login nodes,
>>>> the volume remains mounted for a long time (and where the
>>>> performance degrades).
>>>>
>>>> Our hunch here is that epoll or our use of epoll on the servers
>>>> is somehow to blame. Maybe the file descriptors opened on the
>>>> server for pvfs2-client-core are getting pushed down further and
>>>> further into the epoll set, which for some reason is growing
>>>> with new connections coming and going. This might be the case
>>>> if we were failing to remove sockets from the set on disconnect,
>>>> for example. It doesn't look like that's happening though, at
>>>> least for normal disconnects.
>>>>
>>>> Its a PITA to debug, because the servers have to remain running
>>>> for a long time (and the clients have to remain mounted) for the
>>>> problem to be visible. Rob suggested I use strace on the
>>>> servers to see what epoll was doing, and that showed some
>>>> interesting results. Basically, it looks like epoll_wait takes
>>>> significantly longer when clients are doing operations over the
>>>> VFS, rather than with the pvfs2 admin tools. Also, strace
>>>> reported epoll_ctl(..., EPOLL_CTL_ADD, ...)) getting called a
>>>> few times, even for the VFS ops, and in those cases its
>>>> returning EEXISTS.
>>>>
>>>> I noticed that we add a socket to the epoll set whenever we get
>>>> a new connection, or a read or write is posted
>>>> (enqueue_operation), but we only remove the socket from the
>>>> epoll set on errors or disconnects. So why are we adding it for
>>>> reads and writes? Any connected socket should already be in the
>>>> set, no? I think this may be why I'm seeing EEXISTS with strace.
>>>>
>>>> Also, is it safe to check the error from epoll_ctl in
>>>> BMI_socket_collection_[add|remove]?
>>>>
>>>> And finally, assuming PVFS is actually using epoll calls
>>>> properly, does anyone know of epoll bugs on a SUSE 2.6.5 kernel
>>>> that would cause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do
>>>> what its meant to? Googling epoll and SUSE 2.6.5 isn't turning
>>>> up anything...
>>>>
>>>> Thanks,
>>>> -sam
>>>>
>>>
>>
>
More information about the Pvfs2-developers
mailing list