[Pvfs2-developers] Review: experimental readcaching code
Murali Vilayannur
murali.vilayannur at gmail.com
Wed Nov 1 21:43:48 EST 2006
Hi Pete,
> I must have misunderstood then. I thought that "pages" was a list
> of all the ones needed for this particular access, while
> "issue_pages" was the (possibly empty) subset that needed to have IO
> started on them. Then if IO failed, you would just want to kill off
> the issue_pages, not the entire set. But sounds like you know
> what's going on.
pages -> array of all pages needed for the access
issue_pages -> array of only those pages for which I/O needs to be done
page_list -> linked list of pages for which I/O is needed. this is
required by the read_cache_pages() function and is drained after the
cb finishes for each page.
Hopefully, I have done the right thing there. I will double check today.
>
> >> Little stuff: no need to cast return value of kmalloc(). There's
> >> also kzalloc() if you're going to do a memset(,0,) next.
> >
> > Is kzalloc() available on older kernels or should I add a configure time
> > option for that?
> > Will change this too..
>
> True, it's somewhat recent. Ho hum. Could write your own kzalloc()
> and use the config option, or just not bother and use the older
> interfaces. I don't care. Probably either is easier than
> submitting the code to the kernel so they maintain future changes
> like this. :)
yeah.. I think I will add the configure option and kzalloc stuff.
> Or some magic per-directory opt-in xattr flag. I like that
> thinking, as it parallels the immutable flag usage nicely.
Exactly.
>
> > In order to do write buffering efficiently, we should try to avoid doing a
> > read before we modify the page and also keep track of the dirty
> > ranges, but that seems like a hard thing to do from a module without vfs
> > changes. Aggregating writebacks for efficiency is also
> > an issue, but I suspect we could look at the ->writepages callback and see
> > if that will work for us.
>
> "Avoid doing a read before we modify the page?" Wouldn't it go
> something like this? I likely misunderstand again:
>
> read:
> if IS_CTOO || IS_IMMUTABLE
> if in page cache
> return data, dirty or not
> else
> do IO
> write:
> if IS_CTOO
> if in page cache
> update data, mark dirty
> else
> do IO, then update and mark dirty
>
> and just let pagecache writeback take care of its business as normal.
>
> Dirty ranges? Nah, don't bother. Just do full page granularity.
Oh ok.. I guess if we do whole page granularity we might be ok.. not
sure. think of the following scenario,
I am mostly worried about correctness issues with caching even in the
presence of non-overlapping data.
Initial file size is 0.
Client 1 on node 1 writes to file offsets 0 to 3000 and client 2 on
node 2 from 3001 to 6000. With a page size of 4096, they both share
file block 0.. What ends up on disk for file block 0 with write
buffering? Sadly, we cannot punt and say user error since the program
was operating with nonoverlapping offsets.
Perhaps, we could ask such users/programs from not using the write
buffering algorithm..
Will MPI-IO programs be able to prevent this kind of thing from
happening magically?
>
> And efficiency may just happen with the new writepages. If it
> doesn't, I don't think we should go out of our way to make this
> usage model particularly fast.
sounds good to me.
>
> > BTW: Do not try the patch just yet. I have done zero testing on it and it
> > might crash your machine.
>
> Heh. I'm too busy crashing it myself with SCSI patches.
Awesome. I am doing the same at work here :)
> P.S. Your gmail produces ugly mail. Can you turn off HTML, and
> maybe get it to do quoting right?
I turned off html. But i have no clue how to do the latter.
Thanks as always for the insightful reviews..
Murali
More information about the Pvfs2-developers
mailing list