[Pvfs2-developers] Review: experimental readcaching code

Murali Vilayannur murali.vilayannur at gmail.com
Wed Nov 1 21:43:48 EST 2006


Hi Pete,

> I must have misunderstood then.  I thought that "pages" was a list
> of all the ones needed for this particular access, while
> "issue_pages" was the (possibly empty) subset that needed to have IO
> started on them.  Then if IO failed, you would just want to kill off
> the issue_pages, not the entire set.  But sounds like you know
> what's going on.

pages -> array of all pages needed for the access
issue_pages -> array of only those pages for which I/O needs to be done
page_list -> linked list of pages for which I/O is needed. this is
required by the read_cache_pages() function and is drained after the
cb finishes for each page.
Hopefully, I have done the right thing there. I will double check today.
>
> >> Little stuff:  no need to cast return value of kmalloc().  There's
> >> also kzalloc() if you're going to do a memset(,0,) next.
> >
> > Is kzalloc() available on older kernels or should I add a configure time
> > option for that?
> > Will change this too..
>
> True, it's somewhat recent.  Ho hum.  Could write your own kzalloc()
> and use the config option, or just not bother and use the older
> interfaces.  I don't care.  Probably either is easier than
> submitting the code to the kernel so they maintain future changes
> like this.  :)

yeah.. I think I will add the configure option and kzalloc stuff.

> Or some magic per-directory opt-in xattr flag.  I like that
> thinking, as it parallels the immutable flag usage nicely.

Exactly.
>
> > In order to do write buffering efficiently, we should try to avoid doing a
> > read before we modify the page and also keep track of the dirty
> > ranges, but that seems like a hard thing to do from a module without vfs
> > changes. Aggregating writebacks for efficiency is also
> > an issue, but I suspect we could look at the ->writepages callback and see
> > if that will work for us.
>
> "Avoid doing a read before we modify the page?"  Wouldn't it go
> something like this?  I likely misunderstand again:
>
>     read:
>         if IS_CTOO || IS_IMMUTABLE
>             if in page cache
>                 return data, dirty or not
>             else
>                 do IO
>     write:
>         if IS_CTOO
>             if in page cache
>                 update data, mark dirty
>             else
>                 do IO, then update and mark dirty
>
> and just let pagecache writeback take care of its business as normal.
>
> Dirty ranges?  Nah, don't bother.  Just do full page granularity.

Oh ok.. I guess if we do whole page granularity we might be ok.. not
sure. think of the following scenario,
I am mostly worried about correctness issues with caching even in the
presence of non-overlapping data.
Initial file size is 0.
Client 1 on node 1 writes to file offsets 0 to 3000 and client 2 on
node 2 from 3001 to 6000. With a page size of 4096, they both share
file block 0.. What ends up on disk for file block 0 with write
buffering? Sadly, we cannot punt and say user error since the program
was operating with nonoverlapping offsets.
Perhaps, we could ask such users/programs from not using the write
buffering algorithm..
Will MPI-IO programs be able to prevent this kind of thing from
happening magically?

>
> And efficiency may just happen with the new writepages.  If it
> doesn't, I don't think we should go out of our way to make this
> usage model particularly fast.

sounds good to me.

>
> > BTW: Do not try the patch just yet. I have done zero testing on it and it
> > might crash your machine.
>
> Heh.  I'm too busy crashing it myself with SCSI patches.

Awesome. I am doing the same at work here :)

> P.S.  Your gmail produces ugly mail.  Can you turn off HTML, and
> maybe get it to do quoting right?

I turned off html. But i have no clue how to do the latter.
Thanks as always for the insightful reviews..
Murali


More information about the Pvfs2-developers mailing list