[Pvfs2-developers] server threads and kernel timers
Sam Lang
slang at mcs.anl.gov
Mon Dec 11 15:34:44 EST 2006
This is some really nice analysis Pete. One thing we might consider
for reducing context switches would be to re-use the coalescing ideas
we've tried to apply for meta data syncs to context switches as
well. If we've got a lot of operations completing in a particular
module (trove for example), we can just signal once instead of for
each operation. The sync coalescing code already does this, but you
would only see any benefits from that with a bunch of clients.
The locking and context switches are inherent to our design of server
module separation and queueing framework. I think it would be hard
to replace the context switches without doing some serious redesign.
We might be able to eliminate the trove thread though. It doesn't do
anything but move items from the trove completion queue to the job
completion queue. Since that thread waits on a condition variable
(and gets signalled by trove), and then signals the job completion
condition variable, we're essentially doing a double context switch
when we only need one. Instead we could change the trove apis to
take a callback and user ptr, and have the callback add the completed
job to the completion queue directly. The bits of flow that use job
callbacks with trove would have to be changed too, but I think the
flows would benefit from the bmi callback being called directly from
trove as well. Does this seem reasonable?
-sam
On Dec 11, 2006, at 10:26 AM, Pete Wyckoff wrote:
> I've been looking at IB latency, and made some improvements.
> Thought I'd report to the list some more general observations too.
>
> We call gettimeofday() a lot on the server. We also do lots
> of pthreads mutex and condition-wait operations. These all have a
> significant cost, and show up in system-wide profiling.
>
> There are a few options for kernel-provided time services. On a
> single-processor setup, the TSC is your best option, as it uses a
> cycle counter in the processor. But on multi-processor machines,
> this rarely works due to the fact that they are not synchronized,
> hence the kernel disables it for SMP. If you have an HPET, that is
> supposed to be very fast and work for SMP, but we aren't so lucky
> here on our 2-way Opterons. Finally, the old slow fallback called
> "pmtimer" uses the PIT hardware, requiring inb/outb operations to
> get the time.
>
> Test setup: 1 client, 1 MD + IO server. Disable client acache. Put
> storage on a tmpfs. Create a single file in an empty file system.
> Use PVFS_sys_getattr() to get the attributes 10k times in a loop.
> The results are very repeatable with low standard deviation.
> Round-trip time to do one operation is:
>
> 4-threaded server, 2 cpu, pmtimer: 44 us
> 1-thread server, 2 cpu, pmtimer: 35 us
> 1-thread server, 1 cpu, TSC: 29 us
>
> Note the first line is the default build. You have to edit
> Makefile.in to get a single-threaded server.
>
> Using the slow pmtimer compared to the fast TSC costs 6 us (21%).
> Nothing to do about that but avoid using gettimeofday().
>
> Using four threads on the server adds another 9 us (26%). This
> comes from mutex and condition activity in the fast path of every
> operation.
>
> Looking at create times in the same scenario, the results are almost
> exactly multiplied by four, for the four RPCs necessary to do a
> create.
>
> I looked a bit at how to reduce some of the thread overheads, but
> was afraid to change anything significant. I'm not advocating
> getting rid of the threads, as perhaps they allow overlapping of
> operations, especially when both the network and disk and state
> machines are busy. But there's a lot of little locks to grab and
> release along the way for every trove op and every bmi op, and they
> add up, and there are many context switches that have to happen to
> push an op through its path on the server. I don't have any
> thoughts about how to simplify all that.
>
> If you actually do anything to real disk, none of this overhead will
> show up. But for those with battery-backed cache or solid state
> RAM disk, these overheads will be in the way.
>
> -- Pete
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
More information about the Pvfs2-developers
mailing list