[Pvfs2-developers] CoalescingLowWatermark setting

Sam Lang slang at mcs.anl.gov
Thu Sep 21 12:20:01 EDT 2006



Pete Wyckoff wrote:
> I've been debugging why the metadata server calls fdatasync() five
> times during a single create operation.  (IO server separate and
> not considered here.)
> 
> In fs.conf, I had these StorageHints settings
> 
>     TroveSyncMeta no
>     TroveSyncData no
>     CoalescingHighWatermark infinity
>     CoalescingLowWatermark 1
> 
> (defaults from pvfs2-genconfig with trovesync off).
> 
> The login in dbpf-sync.c goes like this:
> 
>     if (!metadata_sync)
> 	++coalesce_count
> 	if (high_watermark > 0 && coalesce_count >= high_watermark)
> 	    coalesce_count = 0
> 	    sync
> 	if (num_pending_TROVE_SYNC_operations < low_watermark)
> 	    coalesce_count = 0
> 	    sync
> 
> No matter how low the low watermark, any trove operation marked as
> TROVE_SYNC will cause a full sync.  Changing
> "CoalescingLowWatermark" to 0 fixed that---no syncs
> 
> Do I understand this correctly?  Is setting the low WM to zero what
> was intended?  Any non-zero value of low WM will always cause
> immediate sync after every TROVE_SYNC operation---was this planned?

The intended behavior with TroveSyncMeta=no is to allow trove operations 
marked as TROVE_SYNC to be completed immediately.  It does this by 
moving the operation to the completion queue immediately as the first 
statement inside the if(!metadata_sync) block.  This allows the server 
thread to push through those operations (go to the next state actions, 
return responses, etc.) without waiting for the sync.  That said, the 
code you are referring to will behave in the way you described under 
'low load' conditions.  If there are no other operations in the dbpf op 
queue marked TROVE_SYNC (or less than whatever LWM is set to) when that 
second check is made, we sync.  By setting the LWM to 0, you're 
essentially saying that you don't want to ever sync under low load 
conditions.

> I'd like to have it sync every 5--10 ops, or from a timeout.  Is
> there some sort of idea that these TROVE_SYNC operations are so
> special that they must run immediately, every time?

The behavior of syncing every operation should only happen under low 
load, and other than delaying other operations that get posted during 
that sync, there shouldn't be any performance differences from not 
syncing at all.  That's the idea anyway.  Once more operations are 
queued (meaning they're not getting serviced immediately), the 
per-operation sync doesn't happen.

> 
> The five syncing MD operations in a create, for those keeping score,
> are:
> 
>     create dspace_create (sync)
>     setattr metafile distribution (sync)
>     setattr dspace_setattr (sync)
>     crdirent write_directory_entry (sync)
>     crdirent dspace_setattr (sync)
> 

If you look at this though, its only doing one sync per-request 
per-database:

request 1:     create dspace_create (dspace sync)
request 2:     setattr metafile distribution (keyval sync)
request 2:     setattr dspace_setattr (dspace sync)
request 3:     crdirent write_directory_entry (keyval sync)
request 3:     crdirent dspace_setattr (dspace sync)

> That's a lot of sync on both dspace and keyval dbs.  The total sync
> time adds 45 ms to the overall operation on a SATA disk.

I agree, but we don't at present group requests, so there's no way to 
tell the trove layer that an operation doesn't need to be synced, 
because another is coming right behind it.  We've talked about methods 
and techniques to fix this, but as I see it, there is information loss 
from client to server, and then further from server state-machines to 
trove layer.  Murali has been suggesting that we do transactions over an 
entire PVFS system interface call, which would only require two syncs 
(one for each db), but that means distributed transactions. :-) 
Julian's request-id work might be useful to us in figuring out whether 
to wait for a sync, esp. for the create case.  I'm not sure the behavior 
would be much different than what we have now though, the design of the 
sync coalescing code is really meant to perform well...err better (sync 
less frequently) under high-load conditions, since under load-load 
conditions it really shouldn't matter that you're syncing every time.

Just curious, you mentioned 5 calls to fdatasync() in a single create. 
That _should not_ happen, and is a bug if it does.  Its the db->sync 
call that we make 5 times (potentially, depending on parameters and 
load).  Are you seeing fdatasync() for metadata operations?  Also, have 
you see a big drop in metadata performance?

Let me know.

Thanks,

-sam
> 
> 		-- Pete
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> 


More information about the Pvfs2-developers mailing list