[Pvfs2-developers] the halloween bug fixed

Rob Ross rross at mcs.anl.gov
Fri Oct 5 12:04:49 EDT 2007


Well done Sam; thanks for tracking this one down. -- Rob

Sam Lang wrote:
> 
> The halloween bug (what I'm calling it -- its been haunting us for a 
> while now) is that we're adding address references to the bmi address 
> list, and never removing them.  In the prelude state machine, we make a 
> BMI_set_info(addr, BMI_INC_ADDR_REF) call, which iterates through all 
> the addresses in the reference list.  Its this step that is causing the 
> slowdown.  As new connections are made addr refs get added to the list 
> and never removed, so the pvfs2-client-core addr ref ends up at the 
> bottom of a very long list.
> 
> The addr refs aren't getting removed, because in BMI_set_info(addr, 
> BMI_DEC_ADDR_REF)  -- called from final_response -- the code queries the 
> bmi_tcp method on whether the address should be removed 
> BMI_tcp_get_info(BMI_DROP_ADDR_QUERY).  This function always returns 
> false (don't drop), unless there was a bmi error somewhere (ECANCEL is 
> probably the only one that happens in practice -- due to a timeout).
> 
> Since our state actions block the main server thread, this caused 
> degradation for all requests received during processing of requests from 
> a long-lived socket.  New connections hitting the server at different 
> times would have been fine though, which is what I was seeing with my 
> tests.
> 
> The obvious and easy fix is to have bmi-tcp return true from 
> DROP_ADDR_QUERY for all address references.  As far as I can tell, the 
> only thing we save by keeping them around is a little memory allocation 
> (the socket gets closed either way).
> 
> In the changes I've been working on to get multiple address support in 
> BMI, I've already replaced the linked list with a hashtable, which 
> wouldn't have made the problem go away, but the degradation wouldn't 
> have been quite as bad (may have made it harder to find, actually).  
> Maybe its time to add some profiling info (perf stats?) to our basic 
> list, queue and hash structures that would tell us how big they're getting.
> 
> Anyway, thanks to all for contributing to the debugging process for this 
> one.
> 
> -sam
> 
> On Sep 26, 2007, at 6:00 PM, Sam Lang wrote:
> 
>>
>> Hi All,
>>
>> I've been trying to debug a problem with PVFS, where performance 
>> degrades slowly with a long-lived (weeks and months) PVFS volume.  The 
>> degradation is significant -- simple metadata operations are an order 
>> of magnitude slower after a month or so.  The behavior turns out to 
>> only occur with the VFS and pvfs2-client daemon:  performance of the 
>> admin tools (pvfs2-touch, pvfs2-rm, etc.) to the same set of servers 
>> remains good.  Restarting the client daemon also fixes the problem, 
>> suggesting that the long-lived open sockets are somehow the cause.  
>> The slowness also appears to be at the servers not the clients: the 
>> same kernel module and client daemon to a different filesystem and set 
>> of servers doesn't exhibit the performance degradation.
>>
>> Also, I should mention that the system config is a little different 
>> than usual.  We have IO nodes mounting and unmounting the PVFS volume  
>> (and stopping the client daemon) with each user's job, which is fairly 
>> frequent, while on the login nodes, the volume remains mounted for a 
>> long time (and where the performance degrades).
>>
>> Our hunch here is that epoll or our use of epoll on the servers is 
>> somehow to blame.  Maybe the file descriptors opened on the server for 
>> pvfs2-client-core are getting pushed down further and further into the 
>> epoll set, which for some reason is growing with new connections 
>> coming and going.  This might be the case if we were failing to remove 
>> sockets from the set on disconnect, for example.  It doesn't look like 
>> that's happening though, at least for normal disconnects.
>>
>> Its a PITA to debug, because the servers have to remain running for a 
>> long time (and the clients have to remain mounted) for the problem to 
>> be visible.  Rob suggested I use strace on the servers to see what 
>> epoll was doing, and that showed some interesting results.  Basically, 
>> it looks like epoll_wait takes significantly longer when clients are 
>> doing operations over the VFS, rather than with the pvfs2 admin 
>> tools.  Also, strace reported epoll_ctl(..., EPOLL_CTL_ADD, ...)) 
>> getting called a few times, even for the VFS ops, and in those cases 
>> its returning EEXISTS.
>>
>> I noticed that we add a socket to the epoll set whenever we get a new 
>> connection, or a read or write is posted (enqueue_operation), but we 
>> only remove the socket from the epoll set on errors or disconnects.  
>> So why are we adding it for reads and writes?  Any connected socket 
>> should already be in the set, no?  I think this may be why I'm seeing 
>> EEXISTS with strace.
>>
>> Also, is it safe to check the error from epoll_ctl in 
>> BMI_socket_collection_[add|remove]?
>>
>> And finally, assuming PVFS is actually using epoll calls properly, 
>> does anyone know of epoll bugs on a SUSE 2.6.5 kernel that would cause 
>> epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do what its meant to?  
>> Googling epoll and SUSE 2.6.5 isn't turning up anything...
>>
>> Thanks,
>> -sam
>>
> 
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> 


More information about the Pvfs2-developers mailing list