[Pvfs2-developers] the halloween bug fixed
Rob Ross
rross at mcs.anl.gov
Fri Oct 5 12:04:49 EDT 2007
Well done Sam; thanks for tracking this one down. -- Rob
Sam Lang wrote:
>
> The halloween bug (what I'm calling it -- its been haunting us for a
> while now) is that we're adding address references to the bmi address
> list, and never removing them. In the prelude state machine, we make a
> BMI_set_info(addr, BMI_INC_ADDR_REF) call, which iterates through all
> the addresses in the reference list. Its this step that is causing the
> slowdown. As new connections are made addr refs get added to the list
> and never removed, so the pvfs2-client-core addr ref ends up at the
> bottom of a very long list.
>
> The addr refs aren't getting removed, because in BMI_set_info(addr,
> BMI_DEC_ADDR_REF) -- called from final_response -- the code queries the
> bmi_tcp method on whether the address should be removed
> BMI_tcp_get_info(BMI_DROP_ADDR_QUERY). This function always returns
> false (don't drop), unless there was a bmi error somewhere (ECANCEL is
> probably the only one that happens in practice -- due to a timeout).
>
> Since our state actions block the main server thread, this caused
> degradation for all requests received during processing of requests from
> a long-lived socket. New connections hitting the server at different
> times would have been fine though, which is what I was seeing with my
> tests.
>
> The obvious and easy fix is to have bmi-tcp return true from
> DROP_ADDR_QUERY for all address references. As far as I can tell, the
> only thing we save by keeping them around is a little memory allocation
> (the socket gets closed either way).
>
> In the changes I've been working on to get multiple address support in
> BMI, I've already replaced the linked list with a hashtable, which
> wouldn't have made the problem go away, but the degradation wouldn't
> have been quite as bad (may have made it harder to find, actually).
> Maybe its time to add some profiling info (perf stats?) to our basic
> list, queue and hash structures that would tell us how big they're getting.
>
> Anyway, thanks to all for contributing to the debugging process for this
> one.
>
> -sam
>
> On Sep 26, 2007, at 6:00 PM, Sam Lang wrote:
>
>>
>> Hi All,
>>
>> I've been trying to debug a problem with PVFS, where performance
>> degrades slowly with a long-lived (weeks and months) PVFS volume. The
>> degradation is significant -- simple metadata operations are an order
>> of magnitude slower after a month or so. The behavior turns out to
>> only occur with the VFS and pvfs2-client daemon: performance of the
>> admin tools (pvfs2-touch, pvfs2-rm, etc.) to the same set of servers
>> remains good. Restarting the client daemon also fixes the problem,
>> suggesting that the long-lived open sockets are somehow the cause.
>> The slowness also appears to be at the servers not the clients: the
>> same kernel module and client daemon to a different filesystem and set
>> of servers doesn't exhibit the performance degradation.
>>
>> Also, I should mention that the system config is a little different
>> than usual. We have IO nodes mounting and unmounting the PVFS volume
>> (and stopping the client daemon) with each user's job, which is fairly
>> frequent, while on the login nodes, the volume remains mounted for a
>> long time (and where the performance degrades).
>>
>> Our hunch here is that epoll or our use of epoll on the servers is
>> somehow to blame. Maybe the file descriptors opened on the server for
>> pvfs2-client-core are getting pushed down further and further into the
>> epoll set, which for some reason is growing with new connections
>> coming and going. This might be the case if we were failing to remove
>> sockets from the set on disconnect, for example. It doesn't look like
>> that's happening though, at least for normal disconnects.
>>
>> Its a PITA to debug, because the servers have to remain running for a
>> long time (and the clients have to remain mounted) for the problem to
>> be visible. Rob suggested I use strace on the servers to see what
>> epoll was doing, and that showed some interesting results. Basically,
>> it looks like epoll_wait takes significantly longer when clients are
>> doing operations over the VFS, rather than with the pvfs2 admin
>> tools. Also, strace reported epoll_ctl(..., EPOLL_CTL_ADD, ...))
>> getting called a few times, even for the VFS ops, and in those cases
>> its returning EEXISTS.
>>
>> I noticed that we add a socket to the epoll set whenever we get a new
>> connection, or a read or write is posted (enqueue_operation), but we
>> only remove the socket from the epoll set on errors or disconnects.
>> So why are we adding it for reads and writes? Any connected socket
>> should already be in the set, no? I think this may be why I'm seeing
>> EEXISTS with strace.
>>
>> Also, is it safe to check the error from epoll_ctl in
>> BMI_socket_collection_[add|remove]?
>>
>> And finally, assuming PVFS is actually using epoll calls properly,
>> does anyone know of epoll bugs on a SUSE 2.6.5 kernel that would cause
>> epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do what its meant to?
>> Googling epoll and SUSE 2.6.5 isn't turning up anything...
>>
>> Thanks,
>> -sam
>>
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
More information about the Pvfs2-developers
mailing list