[Pvfs2-developers] epoll fun

Phil Carns pcarns at wastedcycles.org
Wed Oct 3 09:29:01 EDT 2007


We have seen something that might be related, but we aren't sure.  In 
our case, we had servers running 2.6.9 (with epoll enabled) and clients 
running an old 2.4 kernel (no epoll).

We saw that after about a week of very heavy loading, the clients would 
get extremely slow for both metadata and I/O operations.  We did not 
track it down to the level that you did, but it was clear that every 
operation was generically slower, rather than being able to point out 
one particular operation causing the slow down.

Unmounting and remounting did not help (note that pvfs2-client does not 
close sockets on an unmount).  However, the pvfs2-* command line 
utilities were still running full speed, and restarting either 
pvfs2-client or the pvfs2-server processes would clear up the problem. 
 From this we assumed it must have been a problem with sockets that are 
kept open for extended periods.  We suspected network errors to be a 
contributing factor, but were never able to conclusively determine that.

We no longer have this environment, however, so we can't really offer 
much more insight into what was going on.  So far we haven't seen it pop 
up anywhere else.

-Phil

Murali Vilayannur wrote:
> Hey Sam,
> Ugh..
> First off, really nice detective work!!!
> 
> 
>>degrades slowly with a long-lived (weeks and months) PVFS volume.
>>The degradation is significant -- simple metadata operations are an
>>order of magnitude slower after a month or so.  The behavior turns
>>out to only occur with the VFS and pvfs2-client daemon:  performance
>>of the admin tools (pvfs2-touch, pvfs2-rm, etc.) to the same set of
>>servers remains good.  Restarting the client daemon also fixes the
>>problem, suggesting that the long-lived open sockets are somehow the
>>cause.  The slowness also appears to be at the servers not the
>>clients: the same kernel module and client daemon to a different
>>filesystem and set of servers doesn't exhibit the performance
>>degradation.
>>
>>Also, I should mention that the system config is a little different
>>than usual.  We have IO nodes mounting and unmounting the PVFS
>>volume  (and stopping the client daemon) with each user's job, which
>>is fairly frequent, while on the login nodes, the volume remains
>>mounted for a long time (and where the performance degrades).
>>
>>Our hunch here is that epoll or our use of epoll on the servers is
>>somehow to blame.  Maybe the file descriptors opened on the server
>>for pvfs2-client-core are getting pushed down further and further
>>into the epoll set, which for some reason is growing with new
>>connections coming and going.  This might be the case if we were
>>failing to remove sockets from the set on disconnect, for example.
>>It doesn't look like that's happening though, at least for normal
>>disconnects.
> 
> 
> Just to make sure, can't we switch to a poll() based server and see if
> we have the same problem..
> 
> 
> 
>>Its a PITA to debug, because the servers have to remain running for a
>>long time (and the clients have to remain mounted) for the problem to
>>be visible.  Rob suggested I use strace on the servers to see what
>>epoll was doing, and that showed some interesting results.
>>Basically, it looks like epoll_wait takes significantly longer when
>>clients are doing operations over the VFS, rather than with the pvfs2
>>admin tools.  Also, strace reported epoll_ctl(...,
>>EPOLL_CTL_ADD, ...)) getting called a few times, even for the VFS
>>ops, and in those cases its returning EEXISTS.
>>
>>I noticed that we add a socket to the epoll set whenever we get a new
>>connection, or a read or write is posted (enqueue_operation), but we
>>only remove the socket from the epoll set on errors or disconnects.
>>So why are we adding it for reads and writes?  Any connected socket
>>should already be in the set, no?  I think this may be why I'm seeing
>>EEXISTS with strace.
> 
> 
> yep; Agreed; We shouldn;'t need to add it if it already exists. But
> that is not a bug as far
> as I can tell.
> 
>>Also, is it safe to check the error from epoll_ctl in
>>BMI_socket_collection_[add|remove]?
> 
> 
> Yep; We should be checking for the return value from these functions.
> Perhaps make the _add and _remove as inline functions with return values?
> 
> 
>>And finally, assuming PVFS is actually using epoll calls properly,
>>does anyone know of epoll bugs on a SUSE 2.6.5 kernel that would
>>cause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do what its meant
>>to?  Googling epoll and SUSE 2.6.5 isn't turning up anything...
> 
> 
> Nope. none that I can think of..
> thanks,
> Murali
> 
>>Thanks,
>>-sam
>>_______________________________________________
>>Pvfs2-developers mailing list
>>Pvfs2-developers at beowulf-underground.org
>>http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>
> 
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers



More information about the Pvfs2-developers mailing list