[Pvfs2-developers] epoll fun
Sam Lang
slang at mcs.anl.gov
Wed Oct 3 11:26:25 EDT 2007
Hi Phil,
Thanks for your response. It sounds like its the same problem. Its
nice to hear that its reproduceable (I'm not insane!), and makes it
that much more important to figure out.
I mention in my email to Murali that we see the same problem with
poll too. So we've taken the approach of trying to write a synthetic
test that will reproduce the problem, which would help us verify that
its poll/epoll related, and not something else we're doing in PVFS.
It would also give us something to show kernel developers that its a
problem. With the test I wrote though, I haven't been able to
reproduce the problem (sigh). I'm not sure it exactly matches the
behavior of BMI tcp, so I'm trying to go through the BMI code to
verify that.
I've attached some plots of the poll and epoll timings (just the
timings of those system calls at the server). I haven't added
labels, so I'll have to just explain what they show. I ran a test
where I create and then delete 100 files using the VFS, and then
create/delete 100 files using the pvfs2-* tools. I do this 10 times,
and then sleep for a while, and start over.
In the first plot (Iterations 1-10), the server and client daemon had
just been restarted. You can tell which set of 100 operations are
being timed by the epoll_ctl(ADD/DEL) calls. Even in the first 10
iterations (each creating/deleting 2000 files), the slowdown is
visible. The other plots (Iterations 10-20, 20-28, and 28-32) also
show the degradation. Those are the epoll plots.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Iterations 1-10.png
Type: image/png
Size: 8416 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-developers/attachments/20071003/c9a1a2c0/Iterations1-10-0001.png
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Iterations 10-20.png
Type: image/png
Size: 8466 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-developers/attachments/20071003/c9a1a2c0/Iterations10-20-0001.png
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Iterations 20-28.png
Type: image/png
Size: 8741 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-developers/attachments/20071003/c9a1a2c0/Iterations20-28-0001.png
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Iterations 28-32.png
Type: image/png
Size: 9303 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-developers/attachments/20071003/c9a1a2c0/Iterations28-32-0001.png
-------------- next part --------------
With poll, I just created one plot, and there aren't add/del events
to show admin tools vs. the VFS. I also didn't sleep as long between
iterations. The poll timings on the pvfs2-client daemon socket are
still getting visibly slower though, as can be seen on the right of
the graph, where the 100 create/delete operations through the admin
tools are the thin lines starting from 0.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: poll.png
Type: image/png
Size: 6091 bytes
Desc: not available
Url : http://www.beowulf-underground.org/pipermail/pvfs2-developers/attachments/20071003/c9a1a2c0/poll-0001.png
-------------- next part --------------
For a while I was starting to think that somehow sockets weren't
getting removed from the epoll set properly, even though we were
doing epoll_ctl(...EPOLL_CTL_DEL...). The fact that it still exists
with poll though suggests that its something else, as the fd array
passed to poll isn't growing (I checked).
One thing I noticed: For an unexpected receive, we do a peek on the
socket (recv(...MSG_PEEK)) and see if a full bmi header is there. If
it is, we set the socket back to blocking mode, do a blocking recv
for the header, and set then set the socket back to non-blocking
mode. Rob pointed out that we probably don't need to set the socket
to blocking. Is there some reason we do that, instead of just do the
recv in non-blocking mode since we've already checked that the bytes
are there?
Any other ideas on what's going on here are more than welcome. :-)
-sam
On Oct 3, 2007, at 8:29 AM, Phil Carns wrote:
> We have seen something that might be related, but we aren't sure.
> In our case, we had servers running 2.6.9 (with epoll enabled) and
> clients running an old 2.4 kernel (no epoll).
>
> We saw that after about a week of very heavy loading, the clients
> would get extremely slow for both metadata and I/O operations. We
> did not track it down to the level that you did, but it was clear
> that every operation was generically slower, rather than being able
> to point out one particular operation causing the slow down.
>
> Unmounting and remounting did not help (note that pvfs2-client does
> not close sockets on an unmount). However, the pvfs2-* command
> line utilities were still running full speed, and restarting either
> pvfs2-client or the pvfs2-server processes would clear up the
> problem. From this we assumed it must have been a problem with
> sockets that are kept open for extended periods. We suspected
> network errors to be a contributing factor, but were never able to
> conclusively determine that.
>
> We no longer have this environment, however, so we can't really
> offer much more insight into what was going on. So far we haven't
> seen it pop up anywhere else.
>
> -Phil
>
> Murali Vilayannur wrote:
>> Hey Sam,
>> Ugh..
>> First off, really nice detective work!!!
>>> degrades slowly with a long-lived (weeks and months) PVFS volume.
>>> The degradation is significant -- simple metadata operations are an
>>> order of magnitude slower after a month or so. The behavior turns
>>> out to only occur with the VFS and pvfs2-client daemon: performance
>>> of the admin tools (pvfs2-touch, pvfs2-rm, etc.) to the same set of
>>> servers remains good. Restarting the client daemon also fixes the
>>> problem, suggesting that the long-lived open sockets are somehow the
>>> cause. The slowness also appears to be at the servers not the
>>> clients: the same kernel module and client daemon to a different
>>> filesystem and set of servers doesn't exhibit the performance
>>> degradation.
>>>
>>> Also, I should mention that the system config is a little different
>>> than usual. We have IO nodes mounting and unmounting the PVFS
>>> volume (and stopping the client daemon) with each user's job, which
>>> is fairly frequent, while on the login nodes, the volume remains
>>> mounted for a long time (and where the performance degrades).
>>>
>>> Our hunch here is that epoll or our use of epoll on the servers is
>>> somehow to blame. Maybe the file descriptors opened on the server
>>> for pvfs2-client-core are getting pushed down further and further
>>> into the epoll set, which for some reason is growing with new
>>> connections coming and going. This might be the case if we were
>>> failing to remove sockets from the set on disconnect, for example.
>>> It doesn't look like that's happening though, at least for normal
>>> disconnects.
>> Just to make sure, can't we switch to a poll() based server and
>> see if
>> we have the same problem..
>>> Its a PITA to debug, because the servers have to remain running
>>> for a
>>> long time (and the clients have to remain mounted) for the
>>> problem to
>>> be visible. Rob suggested I use strace on the servers to see what
>>> epoll was doing, and that showed some interesting results.
>>> Basically, it looks like epoll_wait takes significantly longer when
>>> clients are doing operations over the VFS, rather than with the
>>> pvfs2
>>> admin tools. Also, strace reported epoll_ctl(...,
>>> EPOLL_CTL_ADD, ...)) getting called a few times, even for the VFS
>>> ops, and in those cases its returning EEXISTS.
>>>
>>> I noticed that we add a socket to the epoll set whenever we get a
>>> new
>>> connection, or a read or write is posted (enqueue_operation), but we
>>> only remove the socket from the epoll set on errors or disconnects.
>>> So why are we adding it for reads and writes? Any connected socket
>>> should already be in the set, no? I think this may be why I'm
>>> seeing
>>> EEXISTS with strace.
>> yep; Agreed; We shouldn;'t need to add it if it already exists. But
>> that is not a bug as far
>> as I can tell.
>>> Also, is it safe to check the error from epoll_ctl in
>>> BMI_socket_collection_[add|remove]?
>> Yep; We should be checking for the return value from these functions.
>> Perhaps make the _add and _remove as inline functions with return
>> values?
>>> And finally, assuming PVFS is actually using epoll calls properly,
>>> does anyone know of epoll bugs on a SUSE 2.6.5 kernel that would
>>> cause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do what its meant
>>> to? Googling epoll and SUSE 2.6.5 isn't turning up anything...
>> Nope. none that I can think of..
>> thanks,
>> Murali
>>> Thanks,
>>> -sam
>>> _______________________________________________
>>> Pvfs2-developers mailing list
>>> Pvfs2-developers at beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>>
>> _______________________________________________
>> Pvfs2-developers mailing list
>> Pvfs2-developers at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
More information about the Pvfs2-developers
mailing list