[Pvfs2-developers] server threads and kernel timers

Sam Lang slang at mcs.anl.gov
Mon Dec 11 15:34:44 EST 2006


This is some really nice analysis Pete.  One thing we might consider  
for reducing context switches would be to re-use the coalescing ideas  
we've tried to apply for meta data syncs to context switches as  
well.  If we've got a lot of operations completing in a particular  
module (trove for example), we can just signal once instead of for  
each operation.  The sync coalescing code already does this, but you  
would only see any benefits from that with a bunch of clients.

The locking and context switches are inherent to our design of server  
module separation and queueing framework.  I think it would be hard  
to replace the context switches without doing some serious redesign.

We might be able to eliminate the trove thread though.  It doesn't do  
anything but move items from the trove completion queue to the job  
completion queue.  Since that thread waits on a condition variable  
(and gets signalled by trove), and then signals the job completion  
condition variable, we're essentially doing a double context switch  
when we only need one.  Instead we could change the trove apis to  
take a callback and user ptr, and have the callback add the completed  
job to the completion queue directly.  The bits of flow that use job  
callbacks with trove would have to be changed too, but I think the  
flows would benefit from the bmi callback being called directly from  
trove as well.  Does this seem reasonable?

-sam

On Dec 11, 2006, at 10:26 AM, Pete Wyckoff wrote:

> I've been looking at IB latency, and made some improvements.
> Thought I'd report to the list some more general observations too.
>
> We call gettimeofday() a lot on the server.  We also do lots
> of pthreads mutex and condition-wait operations.  These all have a
> significant cost, and show up in system-wide profiling.
>
> There are a few options for kernel-provided time services.  On a
> single-processor setup, the TSC is your best option, as it uses a
> cycle counter in the processor.  But on multi-processor machines,
> this rarely works due to the fact that they are not synchronized,
> hence the kernel disables it for SMP.  If you have an HPET, that is
> supposed to be very fast and work for SMP, but we aren't so lucky
> here on our 2-way Opterons.  Finally, the old slow fallback called
> "pmtimer" uses the PIT hardware, requiring inb/outb operations to
> get the time.
>
> Test setup: 1 client, 1 MD + IO server.  Disable client acache.  Put
> storage on a tmpfs.  Create a single file in an empty file system.
> Use PVFS_sys_getattr() to get the attributes 10k times in a loop.
> The results are very repeatable with low standard deviation.
> Round-trip time to do one operation is:
>
>     4-threaded server, 2 cpu, pmtimer:   44 us
>     1-thread   server, 2 cpu, pmtimer:   35 us
>     1-thread   server, 1 cpu, TSC:       29 us
>
> Note the first line is the default build.  You have to edit
> Makefile.in to get a single-threaded server.
>
> Using the slow pmtimer compared to the fast TSC costs 6 us (21%).
> Nothing to do about that but avoid using gettimeofday().
>
> Using four threads on the server adds another 9 us (26%).  This
> comes from mutex and condition activity in the fast path of every
> operation.
>
> Looking at create times in the same scenario, the results are almost
> exactly multiplied by four, for the four RPCs necessary to do a
> create.
>
> I looked a bit at how to reduce some of the thread overheads, but
> was afraid to change anything significant.  I'm not advocating
> getting rid of the threads, as perhaps they allow overlapping of
> operations, especially when both the network and disk and state
> machines are busy.  But there's a lot of little locks to grab and
> release along the way for every trove op and every bmi op, and they
> add up, and there are many context switches that have to happen to
> push an op through its path on the server.  I don't have any
> thoughts about how to simplify all that.
>
> If you actually do anything to real disk, none of this overhead will
> show up.  But for those with battery-backed cache or solid state
> RAM disk, these overheads will be in the way.
>
> 		-- Pete
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>



More information about the Pvfs2-developers mailing list