[Pvfs2-developers] Re: the halloween bug fixed

Sam Lang slang at mcs.anl.gov
Fri Oct 5 14:10:35 EDT 2007


On Oct 5, 2007, at 12:14 PM, Sam Lang wrote:

>
> On Oct 5, 2007, at 10:49 AM, Sam Lang wrote:
>
>>
>> The obvious and easy fix is to have bmi-tcp return true from  
>> DROP_ADDR_QUERY for all address references.  As far as I can tell,  
>> the only thing we save by keeping them around is a little memory  
>> allocation (the socket gets closed either way).
>
> This suggested fix isn't right.  The DEC_ADDR_REF which decrements  
> the refcount to zero, is invoked after sending the final response,  
> but that's usually before the client (in the case of the admin  
> tools) closes the connection.  It looks like its the  
> tcp_forget_addr in the bmi method that needs to call back out to  
> the bmi wrapper layer to remove the reference from the list.  I can  
> call BMI_set_info(addr, BMI_TCP_CLOSE_SOCKET) from tcp_forget_addr,  
> but that seems a bit backwards...

Actually it looks like we just need a companion function for  
bmi_method_addr_reg_callback.
-sam

>
> -sam
>
>>
>> In the changes I've been working on to get multiple address  
>> support in BMI, I've already replaced the linked list with a  
>> hashtable, which wouldn't have made the problem go away, but the  
>> degradation wouldn't have been quite as bad (may have made it  
>> harder to find, actually).  Maybe its time to add some profiling  
>> info (perf stats?) to our basic list, queue and hash structures  
>> that would tell us how big they're getting.
>>
>> Anyway, thanks to all for contributing to the debugging process  
>> for this one.
>>
>> -sam
>>
>> On Sep 26, 2007, at 6:00 PM, Sam Lang wrote:
>>
>>>
>>> Hi All,
>>>
>>> I've been trying to debug a problem with PVFS, where performance  
>>> degrades slowly with a long-lived (weeks and months) PVFS  
>>> volume.  The degradation is significant -- simple metadata  
>>> operations are an order of magnitude slower after a month or so.   
>>> The behavior turns out to only occur with the VFS and pvfs2- 
>>> client daemon:  performance of the admin tools (pvfs2-touch,  
>>> pvfs2-rm, etc.) to the same set of servers remains good.   
>>> Restarting the client daemon also fixes the problem, suggesting  
>>> that the long-lived open sockets are somehow the cause.  The  
>>> slowness also appears to be at the servers not the clients: the  
>>> same kernel module and client daemon to a different filesystem  
>>> and set of servers doesn't exhibit the performance degradation.
>>>
>>> Also, I should mention that the system config is a little  
>>> different than usual.  We have IO nodes mounting and unmounting  
>>> the PVFS volume  (and stopping the client daemon) with each  
>>> user's job, which is fairly frequent, while on the login nodes,  
>>> the volume remains mounted for a long time (and where the  
>>> performance degrades).
>>>
>>> Our hunch here is that epoll or our use of epoll on the servers  
>>> is somehow to blame.  Maybe the file descriptors opened on the  
>>> server for pvfs2-client-core are getting pushed down further and  
>>> further into the epoll set, which for some reason is growing with  
>>> new connections coming and going.  This might be the case if we  
>>> were failing to remove sockets from the set on disconnect, for  
>>> example.  It doesn't look like that's happening though, at least  
>>> for normal disconnects.
>>>
>>> Its a PITA to debug, because the servers have to remain running  
>>> for a long time (and the clients have to remain mounted) for the  
>>> problem to be visible.  Rob suggested I use strace on the servers  
>>> to see what epoll was doing, and that showed some interesting  
>>> results.  Basically, it looks like epoll_wait takes significantly  
>>> longer when clients are doing operations over the VFS, rather  
>>> than with the pvfs2 admin tools.  Also, strace reported epoll_ctl 
>>> (..., EPOLL_CTL_ADD, ...)) getting called a few times, even for  
>>> the VFS ops, and in those cases its returning EEXISTS.
>>>
>>> I noticed that we add a socket to the epoll set whenever we get a  
>>> new connection, or a read or write is posted (enqueue_operation),  
>>> but we only remove the socket from the epoll set on errors or  
>>> disconnects.  So why are we adding it for reads and writes?  Any  
>>> connected socket should already be in the set, no?  I think this  
>>> may be why I'm seeing EEXISTS with strace.
>>>
>>> Also, is it safe to check the error from epoll_ctl in  
>>> BMI_socket_collection_[add|remove]?
>>>
>>> And finally, assuming PVFS is actually using epoll calls  
>>> properly, does anyone know of epoll bugs on a SUSE 2.6.5 kernel  
>>> that would cause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do  
>>> what its meant to?  Googling epoll and SUSE 2.6.5 isn't turning  
>>> up anything...
>>>
>>> Thanks,
>>> -sam
>>>
>>
>



More information about the Pvfs2-developers mailing list