[Pvfs2-developers] epoll fun
slang at mcs.anl.gov
Wed Oct 3 13:39:51 EDT 2007
On Oct 3, 2007, at 10:26 AM, Sam Lang wrote:
> Hi Phil,
> Thanks for your response. It sounds like its the same problem.
> Its nice to hear that its reproduceable (I'm not insane!), and
> makes it that much more important to figure out.
> I mention in my email to Murali that we see the same problem with
> poll too. So we've taken the approach of trying to write a
> synthetic test that will reproduce the problem, which would help us
> verify that its poll/epoll related, and not something else we're
> doing in PVFS. It would also give us something to show kernel
> developers that its a problem. With the test I wrote though, I
> haven't been able to reproduce the problem (sigh). I'm not sure it
> exactly matches the behavior of BMI tcp, so I'm trying to go
> through the BMI code to verify that.
> I've attached some plots of the poll and epoll timings (just the
> timings of those system calls at the server). I haven't added
> labels, so I'll have to just explain what they show. I ran a test
> where I create and then delete 100 files using the VFS, and then
> create/delete 100 files using the pvfs2-* tools. I do this 10
> times, and then sleep for a while, and start over.
> In the first plot (Iterations 1-10), the server and client daemon
> had just been restarted. You can tell which set of 100 operations
> are being timed by the epoll_ctl(ADD/DEL) calls. Even in the first
> 10 iterations (each creating/deleting 2000 files), the slowdown is
> visible. The other plots (Iterations 10-20, 20-28, and 28-32) also
> show the degradation. Those are the epoll plots.
Somehow I managed to forget to mention that the x-axis is seconds
since the re-start of the server and client, and the y-axis is the
epoll time in seconds.
> <Iterations 1-10.png>
> <Iterations 10-20.png>
> <Iterations 20-28.png>
> <Iterations 28-32.png>
> With poll, I just created one plot, and there aren't add/del events
> to show admin tools vs. the VFS. I also didn't sleep as long
> between iterations. The poll timings on the pvfs2-client daemon
> socket are still getting visibly slower though, as can be seen on
> the right of the graph, where the 100 create/delete operations
> through the admin tools are the thin lines starting from 0.
> For a while I was starting to think that somehow sockets weren't
> getting removed from the epoll set properly, even though we were
> doing epoll_ctl(...EPOLL_CTL_DEL...). The fact that it still
> exists with poll though suggests that its something else, as the fd
> array passed to poll isn't growing (I checked).
> One thing I noticed: For an unexpected receive, we do a peek on
> the socket (recv(...MSG_PEEK)) and see if a full bmi header is
> there. If it is, we set the socket back to blocking mode, do a
> blocking recv for the header, and set then set the socket back to
> non-blocking mode. Rob pointed out that we probably don't need to
> set the socket to blocking. Is there some reason we do that,
> instead of just do the recv in non-blocking mode since we've
> already checked that the bytes are there?
> Any other ideas on what's going on here are more than welcome. :-)
> On Oct 3, 2007, at 8:29 AM, Phil Carns wrote:
>> We have seen something that might be related, but we aren't sure.
>> In our case, we had servers running 2.6.9 (with epoll enabled) and
>> clients running an old 2.4 kernel (no epoll).
>> We saw that after about a week of very heavy loading, the clients
>> would get extremely slow for both metadata and I/O operations. We
>> did not track it down to the level that you did, but it was clear
>> that every operation was generically slower, rather than being
>> able to point out one particular operation causing the slow down.
>> Unmounting and remounting did not help (note that pvfs2-client
>> does not close sockets on an unmount). However, the pvfs2-*
>> command line utilities were still running full speed, and
>> restarting either pvfs2-client or the pvfs2-server processes would
>> clear up the problem. From this we assumed it must have been a
>> problem with sockets that are kept open for extended periods. We
>> suspected network errors to be a contributing factor, but were
>> never able to conclusively determine that.
>> We no longer have this environment, however, so we can't really
>> offer much more insight into what was going on. So far we haven't
>> seen it pop up anywhere else.
>> Murali Vilayannur wrote:
>>> Hey Sam,
>>> First off, really nice detective work!!!
>>>> degrades slowly with a long-lived (weeks and months) PVFS volume.
>>>> The degradation is significant -- simple metadata operations are an
>>>> order of magnitude slower after a month or so. The behavior turns
>>>> out to only occur with the VFS and pvfs2-client daemon:
>>>> of the admin tools (pvfs2-touch, pvfs2-rm, etc.) to the same set of
>>>> servers remains good. Restarting the client daemon also fixes the
>>>> problem, suggesting that the long-lived open sockets are somehow
>>>> cause. The slowness also appears to be at the servers not the
>>>> clients: the same kernel module and client daemon to a different
>>>> filesystem and set of servers doesn't exhibit the performance
>>>> Also, I should mention that the system config is a little different
>>>> than usual. We have IO nodes mounting and unmounting the PVFS
>>>> volume (and stopping the client daemon) with each user's job,
>>>> is fairly frequent, while on the login nodes, the volume remains
>>>> mounted for a long time (and where the performance degrades).
>>>> Our hunch here is that epoll or our use of epoll on the servers is
>>>> somehow to blame. Maybe the file descriptors opened on the server
>>>> for pvfs2-client-core are getting pushed down further and further
>>>> into the epoll set, which for some reason is growing with new
>>>> connections coming and going. This might be the case if we were
>>>> failing to remove sockets from the set on disconnect, for example.
>>>> It doesn't look like that's happening though, at least for normal
>>> Just to make sure, can't we switch to a poll() based server and
>>> see if
>>> we have the same problem..
>>>> Its a PITA to debug, because the servers have to remain running
>>>> for a
>>>> long time (and the clients have to remain mounted) for the
>>>> problem to
>>>> be visible. Rob suggested I use strace on the servers to see what
>>>> epoll was doing, and that showed some interesting results.
>>>> Basically, it looks like epoll_wait takes significantly longer when
>>>> clients are doing operations over the VFS, rather than with the
>>>> admin tools. Also, strace reported epoll_ctl(...,
>>>> EPOLL_CTL_ADD, ...)) getting called a few times, even for the VFS
>>>> ops, and in those cases its returning EEXISTS.
>>>> I noticed that we add a socket to the epoll set whenever we get
>>>> a new
>>>> connection, or a read or write is posted (enqueue_operation),
>>>> but we
>>>> only remove the socket from the epoll set on errors or disconnects.
>>>> So why are we adding it for reads and writes? Any connected socket
>>>> should already be in the set, no? I think this may be why I'm
>>>> EEXISTS with strace.
>>> yep; Agreed; We shouldn;'t need to add it if it already exists. But
>>> that is not a bug as far
>>> as I can tell.
>>>> Also, is it safe to check the error from epoll_ctl in
>>> Yep; We should be checking for the return value from these
>>> Perhaps make the _add and _remove as inline functions with return
>>>> And finally, assuming PVFS is actually using epoll calls properly,
>>>> does anyone know of epoll bugs on a SUSE 2.6.5 kernel that would
>>>> cause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do what its meant
>>>> to? Googling epoll and SUSE 2.6.5 isn't turning up anything...
>>> Nope. none that I can think of..
>>>> Pvfs2-developers mailing list
>>>> Pvfs2-developers at beowulf-underground.org
>>> Pvfs2-developers mailing list
>>> Pvfs2-developers at beowulf-underground.org
More information about the Pvfs2-developers