[Pvfs2-developers] Re: the halloween bug fixed
Sam Lang
slang at mcs.anl.gov
Fri Oct 5 14:10:35 EDT 2007
On Oct 5, 2007, at 12:14 PM, Sam Lang wrote:
>
> On Oct 5, 2007, at 10:49 AM, Sam Lang wrote:
>
>>
>> The obvious and easy fix is to have bmi-tcp return true from
>> DROP_ADDR_QUERY for all address references. As far as I can tell,
>> the only thing we save by keeping them around is a little memory
>> allocation (the socket gets closed either way).
>
> This suggested fix isn't right. The DEC_ADDR_REF which decrements
> the refcount to zero, is invoked after sending the final response,
> but that's usually before the client (in the case of the admin
> tools) closes the connection. It looks like its the
> tcp_forget_addr in the bmi method that needs to call back out to
> the bmi wrapper layer to remove the reference from the list. I can
> call BMI_set_info(addr, BMI_TCP_CLOSE_SOCKET) from tcp_forget_addr,
> but that seems a bit backwards...
Actually it looks like we just need a companion function for
bmi_method_addr_reg_callback.
-sam
>
> -sam
>
>>
>> In the changes I've been working on to get multiple address
>> support in BMI, I've already replaced the linked list with a
>> hashtable, which wouldn't have made the problem go away, but the
>> degradation wouldn't have been quite as bad (may have made it
>> harder to find, actually). Maybe its time to add some profiling
>> info (perf stats?) to our basic list, queue and hash structures
>> that would tell us how big they're getting.
>>
>> Anyway, thanks to all for contributing to the debugging process
>> for this one.
>>
>> -sam
>>
>> On Sep 26, 2007, at 6:00 PM, Sam Lang wrote:
>>
>>>
>>> Hi All,
>>>
>>> I've been trying to debug a problem with PVFS, where performance
>>> degrades slowly with a long-lived (weeks and months) PVFS
>>> volume. The degradation is significant -- simple metadata
>>> operations are an order of magnitude slower after a month or so.
>>> The behavior turns out to only occur with the VFS and pvfs2-
>>> client daemon: performance of the admin tools (pvfs2-touch,
>>> pvfs2-rm, etc.) to the same set of servers remains good.
>>> Restarting the client daemon also fixes the problem, suggesting
>>> that the long-lived open sockets are somehow the cause. The
>>> slowness also appears to be at the servers not the clients: the
>>> same kernel module and client daemon to a different filesystem
>>> and set of servers doesn't exhibit the performance degradation.
>>>
>>> Also, I should mention that the system config is a little
>>> different than usual. We have IO nodes mounting and unmounting
>>> the PVFS volume (and stopping the client daemon) with each
>>> user's job, which is fairly frequent, while on the login nodes,
>>> the volume remains mounted for a long time (and where the
>>> performance degrades).
>>>
>>> Our hunch here is that epoll or our use of epoll on the servers
>>> is somehow to blame. Maybe the file descriptors opened on the
>>> server for pvfs2-client-core are getting pushed down further and
>>> further into the epoll set, which for some reason is growing with
>>> new connections coming and going. This might be the case if we
>>> were failing to remove sockets from the set on disconnect, for
>>> example. It doesn't look like that's happening though, at least
>>> for normal disconnects.
>>>
>>> Its a PITA to debug, because the servers have to remain running
>>> for a long time (and the clients have to remain mounted) for the
>>> problem to be visible. Rob suggested I use strace on the servers
>>> to see what epoll was doing, and that showed some interesting
>>> results. Basically, it looks like epoll_wait takes significantly
>>> longer when clients are doing operations over the VFS, rather
>>> than with the pvfs2 admin tools. Also, strace reported epoll_ctl
>>> (..., EPOLL_CTL_ADD, ...)) getting called a few times, even for
>>> the VFS ops, and in those cases its returning EEXISTS.
>>>
>>> I noticed that we add a socket to the epoll set whenever we get a
>>> new connection, or a read or write is posted (enqueue_operation),
>>> but we only remove the socket from the epoll set on errors or
>>> disconnects. So why are we adding it for reads and writes? Any
>>> connected socket should already be in the set, no? I think this
>>> may be why I'm seeing EEXISTS with strace.
>>>
>>> Also, is it safe to check the error from epoll_ctl in
>>> BMI_socket_collection_[add|remove]?
>>>
>>> And finally, assuming PVFS is actually using epoll calls
>>> properly, does anyone know of epoll bugs on a SUSE 2.6.5 kernel
>>> that would cause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do
>>> what its meant to? Googling epoll and SUSE 2.6.5 isn't turning
>>> up anything...
>>>
>>> Thanks,
>>> -sam
>>>
>>
>
More information about the Pvfs2-developers
mailing list