[Pvfs2-developers] epoll fun

Sam Lang slang at mcs.anl.gov
Wed Oct 3 13:39:51 EDT 2007


On Oct 3, 2007, at 10:26 AM, Sam Lang wrote:

>
> Hi Phil,
>
> Thanks for your response.  It sounds like its the same problem.   
> Its nice to hear that its reproduceable (I'm not insane!), and  
> makes it that much more important to figure out.
>
> I mention in my email to Murali that we see the same problem with  
> poll too.  So we've taken the approach of trying to write a  
> synthetic test that will reproduce the problem, which would help us  
> verify that its poll/epoll related, and not something else we're  
> doing in PVFS.  It would also give us something to show kernel  
> developers that its a problem.  With the test I wrote though, I  
> haven't been able to reproduce the problem (sigh).  I'm not sure it  
> exactly matches the behavior of BMI tcp, so I'm trying to go  
> through the BMI code to verify that.
>
> I've attached some plots of the poll and epoll timings (just the  
> timings of those system calls at the server).  I haven't added  
> labels, so I'll have to just explain what they show.  I ran a test  
> where I create and then delete 100 files using the VFS, and then  
> create/delete 100 files using the pvfs2-* tools.  I do this 10  
> times, and then sleep for a while, and start over.
>
> In the first plot (Iterations 1-10), the server and client daemon  
> had just been restarted.  You can tell which set of 100 operations  
> are being timed by the epoll_ctl(ADD/DEL) calls.  Even in the first  
> 10 iterations (each creating/deleting 2000 files), the slowdown is  
> visible.  The other plots (Iterations 10-20, 20-28, and 28-32) also  
> show the degradation.  Those are the epoll plots.


Somehow I managed to forget to mention that the x-axis is seconds  
since the re-start of the server and client, and the y-axis is the  
epoll time in seconds.
-sam

>
> <Iterations 1-10.png>
> <Iterations 10-20.png>
> <Iterations 20-28.png>
> <Iterations 28-32.png>
>
>
> With poll, I just created one plot, and there aren't add/del events  
> to show admin tools vs. the VFS.  I also didn't sleep as long  
> between iterations.  The poll timings on the pvfs2-client daemon  
> socket are still getting visibly slower though, as can be seen on  
> the right of the graph, where the 100 create/delete operations  
> through the admin tools are the thin lines starting from 0.
>
> <poll.png>
>
> For a while I was starting to think that somehow sockets weren't  
> getting removed from the epoll set properly, even though we were  
> doing epoll_ctl(...EPOLL_CTL_DEL...).  The fact that it still  
> exists with poll though suggests that its something else, as the fd  
> array passed to poll isn't growing (I checked).
>
> One thing I noticed:  For an unexpected receive, we do a peek on  
> the socket (recv(...MSG_PEEK)) and see if a full bmi header is  
> there.  If it is, we set the socket back to blocking mode, do a  
> blocking recv for the header, and set then set the socket back to  
> non-blocking mode.  Rob pointed out that we probably don't need to  
> set the socket to blocking.  Is there some reason we do that,  
> instead of just do the recv in non-blocking mode since we've  
> already checked that the bytes are there?
>
> Any other ideas on what's going on here are more than welcome. :-)
>
> -sam
>
> On Oct 3, 2007, at 8:29 AM, Phil Carns wrote:
>
>> We have seen something that might be related, but we aren't sure.   
>> In our case, we had servers running 2.6.9 (with epoll enabled) and  
>> clients running an old 2.4 kernel (no epoll).
>>
>> We saw that after about a week of very heavy loading, the clients  
>> would get extremely slow for both metadata and I/O operations.  We  
>> did not track it down to the level that you did, but it was clear  
>> that every operation was generically slower, rather than being  
>> able to point out one particular operation causing the slow down.
>>
>> Unmounting and remounting did not help (note that pvfs2-client  
>> does not close sockets on an unmount).  However, the pvfs2-*  
>> command line utilities were still running full speed, and  
>> restarting either pvfs2-client or the pvfs2-server processes would  
>> clear up the problem. From this we assumed it must have been a  
>> problem with sockets that are kept open for extended periods.  We  
>> suspected network errors to be a contributing factor, but were  
>> never able to conclusively determine that.
>>
>> We no longer have this environment, however, so we can't really  
>> offer much more insight into what was going on.  So far we haven't  
>> seen it pop up anywhere else.
>>
>> -Phil
>>
>> Murali Vilayannur wrote:
>>> Hey Sam,
>>> Ugh..
>>> First off, really nice detective work!!!
>>>> degrades slowly with a long-lived (weeks and months) PVFS volume.
>>>> The degradation is significant -- simple metadata operations are an
>>>> order of magnitude slower after a month or so.  The behavior turns
>>>> out to only occur with the VFS and pvfs2-client daemon:   
>>>> performance
>>>> of the admin tools (pvfs2-touch, pvfs2-rm, etc.) to the same set of
>>>> servers remains good.  Restarting the client daemon also fixes the
>>>> problem, suggesting that the long-lived open sockets are somehow  
>>>> the
>>>> cause.  The slowness also appears to be at the servers not the
>>>> clients: the same kernel module and client daemon to a different
>>>> filesystem and set of servers doesn't exhibit the performance
>>>> degradation.
>>>>
>>>> Also, I should mention that the system config is a little different
>>>> than usual.  We have IO nodes mounting and unmounting the PVFS
>>>> volume  (and stopping the client daemon) with each user's job,  
>>>> which
>>>> is fairly frequent, while on the login nodes, the volume remains
>>>> mounted for a long time (and where the performance degrades).
>>>>
>>>> Our hunch here is that epoll or our use of epoll on the servers is
>>>> somehow to blame.  Maybe the file descriptors opened on the server
>>>> for pvfs2-client-core are getting pushed down further and further
>>>> into the epoll set, which for some reason is growing with new
>>>> connections coming and going.  This might be the case if we were
>>>> failing to remove sockets from the set on disconnect, for example.
>>>> It doesn't look like that's happening though, at least for normal
>>>> disconnects.
>>> Just to make sure, can't we switch to a poll() based server and  
>>> see if
>>> we have the same problem..
>>>> Its a PITA to debug, because the servers have to remain running  
>>>> for a
>>>> long time (and the clients have to remain mounted) for the  
>>>> problem to
>>>> be visible.  Rob suggested I use strace on the servers to see what
>>>> epoll was doing, and that showed some interesting results.
>>>> Basically, it looks like epoll_wait takes significantly longer when
>>>> clients are doing operations over the VFS, rather than with the  
>>>> pvfs2
>>>> admin tools.  Also, strace reported epoll_ctl(...,
>>>> EPOLL_CTL_ADD, ...)) getting called a few times, even for the VFS
>>>> ops, and in those cases its returning EEXISTS.
>>>>
>>>> I noticed that we add a socket to the epoll set whenever we get  
>>>> a new
>>>> connection, or a read or write is posted (enqueue_operation),  
>>>> but we
>>>> only remove the socket from the epoll set on errors or disconnects.
>>>> So why are we adding it for reads and writes?  Any connected socket
>>>> should already be in the set, no?  I think this may be why I'm  
>>>> seeing
>>>> EEXISTS with strace.
>>> yep; Agreed; We shouldn;'t need to add it if it already exists. But
>>> that is not a bug as far
>>> as I can tell.
>>>> Also, is it safe to check the error from epoll_ctl in
>>>> BMI_socket_collection_[add|remove]?
>>> Yep; We should be checking for the return value from these  
>>> functions.
>>> Perhaps make the _add and _remove as inline functions with return  
>>> values?
>>>> And finally, assuming PVFS is actually using epoll calls properly,
>>>> does anyone know of epoll bugs on a SUSE 2.6.5 kernel that would
>>>> cause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do what its meant
>>>> to?  Googling epoll and SUSE 2.6.5 isn't turning up anything...
>>> Nope. none that I can think of..
>>> thanks,
>>> Murali
>>>> Thanks,
>>>> -sam
>>>> _______________________________________________
>>>> Pvfs2-developers mailing list
>>>> Pvfs2-developers at beowulf-underground.org
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2- 
>>>> developers
>>>>
>>> _______________________________________________
>>> Pvfs2-developers mailing list
>>> Pvfs2-developers at beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>
>



More information about the Pvfs2-developers mailing list