[Pvfs2-developers] epoll fun

Sam Lang slang at mcs.anl.gov
Wed Sep 26 19:00:16 EDT 2007


Hi All,

I've been trying to debug a problem with PVFS, where performance  
degrades slowly with a long-lived (weeks and months) PVFS volume.   
The degradation is significant -- simple metadata operations are an  
order of magnitude slower after a month or so.  The behavior turns  
out to only occur with the VFS and pvfs2-client daemon:  performance  
of the admin tools (pvfs2-touch, pvfs2-rm, etc.) to the same set of  
servers remains good.  Restarting the client daemon also fixes the  
problem, suggesting that the long-lived open sockets are somehow the  
cause.  The slowness also appears to be at the servers not the  
clients: the same kernel module and client daemon to a different  
filesystem and set of servers doesn't exhibit the performance  
degradation.

Also, I should mention that the system config is a little different  
than usual.  We have IO nodes mounting and unmounting the PVFS  
volume  (and stopping the client daemon) with each user's job, which  
is fairly frequent, while on the login nodes, the volume remains  
mounted for a long time (and where the performance degrades).

Our hunch here is that epoll or our use of epoll on the servers is  
somehow to blame.  Maybe the file descriptors opened on the server  
for pvfs2-client-core are getting pushed down further and further  
into the epoll set, which for some reason is growing with new  
connections coming and going.  This might be the case if we were  
failing to remove sockets from the set on disconnect, for example.   
It doesn't look like that's happening though, at least for normal  
disconnects.

Its a PITA to debug, because the servers have to remain running for a  
long time (and the clients have to remain mounted) for the problem to  
be visible.  Rob suggested I use strace on the servers to see what  
epoll was doing, and that showed some interesting results.   
Basically, it looks like epoll_wait takes significantly longer when  
clients are doing operations over the VFS, rather than with the pvfs2  
admin tools.  Also, strace reported epoll_ctl(...,  
EPOLL_CTL_ADD, ...)) getting called a few times, even for the VFS  
ops, and in those cases its returning EEXISTS.

I noticed that we add a socket to the epoll set whenever we get a new  
connection, or a read or write is posted (enqueue_operation), but we  
only remove the socket from the epoll set on errors or disconnects.   
So why are we adding it for reads and writes?  Any connected socket  
should already be in the set, no?  I think this may be why I'm seeing  
EEXISTS with strace.

Also, is it safe to check the error from epoll_ctl in  
BMI_socket_collection_[add|remove]?

And finally, assuming PVFS is actually using epoll calls properly,  
does anyone know of epoll bugs on a SUSE 2.6.5 kernel that would  
cause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do what its meant  
to?  Googling epoll and SUSE 2.6.5 isn't turning up anything...

Thanks,
-sam


More information about the Pvfs2-developers mailing list