[Pvfs2-developers] patches: tuning options

Phil Carns pcarns at wastedcycles.org
Thu Aug 10 18:52:39 EDT 2006


Sam Lang wrote:
> 
> On Aug 10, 2006, at 4:04 PM, Phil Carns wrote:
> 
>> flow-proto-tuning.patch:
>> -----------
>> This patch adds "FlowBufferSizeBytes" and "FlowBuffersPerFlow"  
>> options to the configuration file format.  They allow you to  specify 
>> the buffer size that the default flow protocol will use as  well as 
>> the maximum number of buffers to use per flow.  Note that  if you 
>> change either of these parameters, then you need to remount  any 
>> active clients so that they pick up the configuration change  before 
>> performing any I/O.
>>
>> max-aio.patch:
>> ----------
>> This patch adds "TroveMaxConcurrentIO" to the configuration file  
>> format.  It allows you to specify the maximum number of I/O  
>> operations that trove will allow to proceed concurrently (currently  
>> 16).  Note from the previous email regarding AIO that depending on  
>> your access pattern, AIO may queue all of your operations anyway  
>> regardless of this setting.  It probably doesn't have much effect  
>> unless you are accessing more than one file at a time, or if you  are 
>> using an alternative to the stock AIO implementation.
>>
> 
> I had made the same change in Julian's branch, 

Oh, ok.  We will switch over when that stuff hits trunk.

> there are still a  couple 
> things that aren't clear to me about this max value though.   First, its 
> a global value for all outstanding lio_listio calls the  pvfs server 
> makes, but based on your previous email comments about  glibc's   
> one-thread-per-fd oddity, it seems like we only want that  value to max 
> out per datafile.  Also, after we hit the max we just  queue the 
> operations and post them once current ops have completed.   If librt 
> just queues ops and does them in FIFO order though, its  pretty much the 
> same thing.  Why not just let librt handle the  queuing?  If we were to 
> do ordering of the operations based on  offsets, then it would make 
> sense for us to queue, but we don't.    Are we better at queueing than 
> librt?

I agree that if you are using librt for aio, then this max value isn't 
doing much of anything :)  librt's queueing isn't exactly the same thing 
though.  librt allows N operations in flight at a time (where N can be 
tuned using aio_init) by way of limiting the maximum number of threads 
that it will spawn.  However, since it serializes on each fd, that limit 
never kicks in unless you are accessing N different files.  Otherwise it 
is really only going to do one thing at a time.  The librt source that I 
looked at happened to default N=16, just like Trove was.

I think the point of the aio limit in trove was to try to throttle I/O 
on the servers, but it turns out that librt was already throttling above 
an beyond; so the trove limit wasn't actually decreasing the number of 
posted I/O operations to the kernel any.

Maybe the throttling makes more sense when you bypass librt somehow (as 
in the previous patch) because then there is nothing to queue/throttle 
the operations besides trove?

At any rate, we decided to make this configurable before understanding 
the issues involved- it was just a hardcoded value we saw that looked 
like it should have been tunable.

> I know Julian was looking at performance of aio and found results  
> somewhere (I don't have a reference, sorry) that showed lio_listio  did 
> better in cases where multiple fds were passed to one lio_listio  
> operation (right now we just do one fd with multiple segments to one  
> lio_listio).  I wonder if that difference is based on the glibc  queuing 
> behavior that you describe.  

I would guess that the queueing behavior is the reason.  I can't imagine 
that using seperate files would make much difference once you get to the 
system call level.

> Just a curiousity, but I wonder  if the aio 
> performance would change if we were to post multiple trove  operations 
> in the same lio_listio call, or possibly even break up the  bstream into 
> multiple files based on strip size...sounds crazy  right? :-)

On the former question, I guess it depends on who is better a 
coalescing- the kernel disk scheduler or the trove queue?  Hopefully we 
can avoid splitting files up :)

-Phil



More information about the Pvfs2-developers mailing list