[Pvfs2-developers] Data-sync mode
slang at mcs.anl.gov
Tue Jun 13 15:44:54 EDT 2006
Thanks for sending out your thoughts and ideas. Even though we've
talked about most of this offline, I'm just going to summarize what I
said in case others want to comment.
On Jun 10, 2006, at 1:57 PM, Julian Martin Kunkel wrote:
> I looked a bit arround the implementation of the data sync mode,
> currently the PINT_flow_setinfo is called which sets the sync mode
> for each
> write operation of a flow. That means if 100 MByte are transfered
> for blocks
> with 256 Kbyte a sync happens, which ends up in quite a lot syncs.
> Maybe it would be nice if the client could specify in the IO request
> (PVFS_servreq_io) if the data should be synced instead of setting
> it per
> filesystem. Maybe the kernel interface can take benefit of this to
> save sync
> operations or this can be useful elsewhere ? Of course, this value
> can be
> filled by default with the filesystems TroveSyncData option.
> In MPI there is the explicit sync via MPI_File_sync, maybe we could
> rely on
> this for MPI apps ?
This also requires an additional flag to be added to the parameters
of PVFS_sys_io. The flag would specify whether to sync or not (or
could be extended for other uses). This saves a roundtrip between
client and server because the flag can be sent along with the IO
request (as Julian proposes), instead of doing a separate flush
When I was looking at the performance of small-io, the overall cost
of doing an extra roundtrip was negligible once the IO request sizes
were larger (~ 32K IIRC), so the benefit here may not be that great,
and modifying the system interface may not make it worthwhile.
At the same time, in the use case where clients want to specify a
data sync on a per IO request basis, allowing the server to know at
the beginning of an IO operation that it needs to be synced may help
improve the sync coalescing behavior, because it gives the server
more time to determine if multiple IO ops can be synced together.
> Independent of this questions, Rob mentioned that the sync policy
> maybe should
> be changed, too. For example to sync the data only after at the end
> of the
> flow and that data syncs could be coalesynced like the metadata
This is a good idea. In fact it sounds like we can just replace the
'TroveSyncData on' semantics to sync at the end of the entire IO op
instead of for each trove write call that flow makes. In other
words, we don't need to provide the user with the config option to
sync for each trove write.
> I think maybe the coalesyncing of operations should be handled by
> the trove
> module, because this knows which coalesync method is best for the
> implementation or should this be handled by a upper layer (e.g.
> job ?).
I would put it in the dbpf layer. The queuing of operations is
handled there (both metadata and io), so you can do you policy stuff
most easily from there. The trove layer just acts as a wrapper for
the underlying implementation, and the job layer is used by the
server thread for testing completion. Since the request scheduler
allows write ops on the same handle to be scheduled immediately, you
should be able to manage everything in dbpf.
> In case a I/O scheduler will be added to the Trove layer maybe
> small write
> requests can be combined like in ROMIO. Also the policy might
> depend on the
> servers I/O load and pending I/O jobs.
The problem with doing this on the server is that its hard to know in
advance that many small IO operations are being done together, unless
they're all sitting in the queue waiting to be serviced. I like the
pvfs2 stance of encouraging client-side data-sieving since in many
cases clients aren't acting independently (if that is the pvfs2
stance, perhaps I'm projecting :-)). In our discussion yesterday
RobL pointed out that the disk scheduler should be doing some amount
of read-ahead, so assuming that the disk operations are the expensive
part, doing many lio_listio calls instead of coalescing them into one
call may not actually matter.
> I will take care for modifications and evaluate possible policies
> if nobody
> else is currently working on this issues.
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
More information about the Pvfs2-developers