[PVFS-developers] FW: Inode Caching Problem

Rainey Alan - araine Alan.Rainey@acxiom.com
Thu, 18 Sep 2003 10:07:06 -0500


I'm forwarding this note to the list in hopes that someone who is more
experienced with the PVFS kernel module can help.  I have very little
experience with this kernel module/VFS layer/etc.  Basically, we have a
program which reproduces a bug consistently and I'm trying to fix it,
although I may not know the best way to fix it.

Being new to this code, I'm pretty uncomfortable with trying to modify the
upcall or downcall.  I did find a faster solution than doing a
pvfs_revalidate_inode in the kernel every time, though.  Basically, we can
first check to see if the filename in the dentry matches the one in the
pvfs_inop(inode)->name.  This saves you from all revalidations except the
ones that are necessary, however it is still an added compare on every
read/write.  I'll attach the patch here:


diff -ur pvfs-kernel-1.6.0/file.c pvfs-kernel-1.6.0.fix/file.c
--- pvfs-kernel-1.6.0/file.c    Fri Aug  1 16:06:20 2003
+++ pvfs-kernel-1.6.0.fix/file.c        Thu Sep 18 09:25:46 2003
@@ -240,6 +240,15 @@
 
        inode = file->f_dentry->d_inode;
 
+   /* Added by Alan Rainey 9-10-2003 */
+   if(strcmp(file->f_dentry->d_iname, (char
*)(strrchr(pvfs_inop(inode)->name, '/')+1)))
+   {
+          if ((error = pvfs_revalidate_inode(file->f_dentry)) < 0) {
+             PEXIT;
+             return error;
+          }
+   }
+
        memset(buf, 0, count);
 
        PDEBUG(D_FILE, "pvfs_readpage called for %s (%ld), offset %ld, size
%ld\n",
@@ -296,6 +305,15 @@
         */
        inode = file->f_dentry->d_inode;
 
+   /* Added by Alan Rainey 9-10-2003 */
+   if(strcmp(file->f_dentry->d_iname, (char
*)(strrchr(pvfs_inop(inode)->name, '/')+1)))
+   {
+          if ((error = pvfs_revalidate_inode(file->f_dentry)) < 0) {
+             PEXIT;
+             return error;
+          }
+   }
+
        PDEBUG(D_FILE, "pvfs_file_read called for %s (%ld), offset %ld, size
%ld\n",
        pvfs_inop(inode)->name, (unsigned long) pvfs_inop(inode)->handle,
        (long) pvfs_pos, (long) count);
@@ -361,6 +379,15 @@
         */
        inode = file->f_dentry->d_inode;
 
+   /* Added by Alan Rainey 9-10-2003 */
+   if(strcmp(file->f_dentry->d_iname, (char
*)(strrchr(pvfs_inop(inode)->name, '/')+1)))
+   {
+          if ((error = pvfs_revalidate_inode(file->f_dentry)) < 0) {
+             PEXIT;
+             return error;
+          }
+   }
+
        PDEBUG(D_FILE, "pvfs_file_write called for %s (%ld), offset %ld,
size %ld\n",
        pvfs_inop(inode)->name, (unsigned long) pvfs_inop(inode)->handle,
        (long) pvfs_pos, (long) count);



To test the speed, I wrote a simple program that spawned 20 processes and
each process would read a section of a common input file and write its own
output file (a multiprocess file splitter).  This program uses standard I/O
calls.  I ran this on a gigabyte file and measured the performance from a
single PVFS client.  Here are some stats:


               100M Buffer
--- Nopatch ---     --- Patch1 ---       --- NEWPATCH ---
186 Seconds         186 Seconds          186 Seconds
187 Seconds         186 Seconds
186 Seconds         186 Seconds

               100K Buffer
--- Nopatch ---     --- Patch1 ---       --- NEWPATCH ---
239 Seconds         330 Seconds          239 Seconds
240 Seconds         332 Seconds          239 Seconds
239 Seconds         328 Seconds


Notice that with the first patch I sent, if a lot of calls were being made
to read/write functions in the kernel (using a small buffer = more I/O ops),
it did significantly impact performance.  However, with the new patch above,
the speed seems to be the same as it was without the patch.

I believe that this patch is the right solution, but as I've said I am
inexperienced with this code.  I really want/need to fix this problem, but
I'm to the point where I need some help from someone who can either validate
the above patch or possibly offer a better solution.

Any help is *greatly* appreciated!

Alan

> Alan,
> 
> I see what you are saying.  Good description.
> 
> So the problem really is that (a) we're caching file names in the kernel, 
> and (b) the ext2 partitions that we rely on for our handles are giving us 
> back the same handle numbers really quickly.  These together give us a 
> situation that is trouble.
> 
> It seems to me that one thing that we could do is pass a { name, handle } 
> pair from the kernel out to the pvfsd rather than just passing the name.  
> This could be used to validate that the name does in fact match with the 
> handle in an efficient manner (i.e. from the handle that is returned at 
> open time).
> 
> Then we would somehow need to deal with the case where the { name, handle
}
> pair is no longer valid.  Probably we can just force a revalidation and go

> on?
> 
> The reason I think that this is a better approach is that it moves the 
> issue of revalidation down to the pvfsd and can be performed less 
> frequently.
> 
> Does this make sense?  Does someone have a specific suggestion for dealing

> with performing the revalidation when the pvfsd detects that something is 
> amiss?
> 
> Thanks,
> 
> Rob
> 
> On Tue, 16 Sep 2003, Rainey Alan - araine wrote:
> 
>> Right now I don't see a way around doing the revalidate for every
read/write
>> request.  Let me explain, and maybe you can help me resolve this (and
other)
>> issues in a better way.
>> 
>> I'll try to be brief here, there is more info about this on the list.
>> 
>> Simple example:
>> 
>> Box A creates file1 and file2 in that order.
>> Below are the pertinent contents of some of the variables in the struct
file
>> * sent to a read/write operation (pvfs_file_write(), etc.).  The inode
>> handles are an example of what we see.
>> 
>> file1:
>> file->f_dentry->d_iname = "file1"
>> pvfs_inop(file->f_dentry->d_inode)->name = "file1"
>> pvfs_inop(file->f_dentry->d_inode)->handle = 1
>> 
>> file2:
>> file->f_dentry->d_iname = "file2"
>> pvfs_inop(file->f_dentry->d_inode)->name = "file2"
>> pvfs_inop(file->f_dentry->d_inode)->handle = 2
>> 
>> 
>> Now, from Box B we remove file1 and file2.  Then, these files get
recreated
>> on Box A, but possibly in a different order.  We see this:
>> 
>> file1:
>> file->f_dentry->d_iname = "file1"
>> pvfs_inop(file->f_dentry->d_inode)->name = "file2"
>> pvfs_inop(file->f_dentry->d_inode)->handle = 2
>> 
>> file2:
>> file->f_dentry->d_iname = "file2"
>> pvfs_inop(file->f_dentry->d_inode)->name = "file1"
>> pvfs_inop(file->f_dentry->d_inode)->handle = 1
>> 
>> 
>> What happened was that this time when the files were created, file1 got
>> inode 2 and file2 got inode 1.  When the kernel went and found cached
>> versions of inodes for handles 1 and 2, it assigned them, but the data in
>> file->f_dentry->d_inode->u.generic_ip was not updated with the correct
>> filename.  Where and when should this be updated?  If this caching
happens
>> in the kernel and there is no entry point into the pvfs kernel module to
>> modify this information in the inode cache before the inode gets assigned
to
>> this dentry, then I don't see another option besides revalidation of the
>> inode.
>> 
>> One option could be to simply check for this situation (if
>> strcmp(file->f_dentry->d_iname,
pvfs_inop(file->f_dentry->d_inode)->name))
>> before calling pvfs_revalidate_inode.
>> 
>> If there is a more central place where we could fix it so that this
>> situation in pvfs_file_[read|write] never occurs, that would obviously be
>> best.  We see many other cache coherency problems that *could* be caused
by
>> this same thing (seek happens in wrong file, etc.).
>> 
>> I appreciate any help you can provide.







**********************************************************************
The information contained in this communication is
confidential, is intended only for the use of the recipient
named above, and may be legally privileged.
If the reader of this message is not the intended
recipient, you are hereby notified that any dissemination, 
distribution, or copying of this communication is strictly
prohibited.
If you have received this communication in error,
please re-send this communication to the sender and
delete the original message or any copy of it from your
computer system. Thank You.