[Pvfs2-developers] duplicate entries in directory listing

Murali Vilayannur murali.vilayannur at gmail.com
Tue Oct 10 01:16:26 EDT 2006


Hi,
Phil, attached patch fixes the behavior that you described.
Basically the problem is exactly what Sam had described.
There is no point retrying a readdir of 32 entries when the actual ls 
command may have issued previous readdir's that are not retried.
That said, I am sure this will cause other problems such as deleted 
files showing up until the entire listing completes ..
thanks,
Murali
>
> On Oct 9, 2006, at 3:23 PM, Phil Carns wrote:
>
>> Phil Carns wrote:
>>>> I started thinking about some more possible ideas, but I realized 
>>>> after looking closer at the code that I don't actually see why 
>>>> duplicates would occur in the first place with the algorithm that 
>>>> is being used :)  I apologize if this has been discussed a few 
>>>> times already, but could we walk through it one more time?
>>>>
>>>> I know that the request protocol uses a token (integer based) to 
>>>> keep track of position.  However, the pcache converts this into a 
>>>> particular key based on where the last iteration left off.  This 
>>>> key contains the handle as well as the alphanumeric name of the entry.
>>>>
>>>> Trove then does a c_get on that key with the DB_SET flag, which 
>>>> should put the cursor at the proper position.  If the entry has 
>>>> been deleted (which is not happening in my case- I am only creating 
>>>> files), then it retries the c_get with the DB_SET_RANGE flag which 
>>>> should set the cursor at the next position.  "next" in this case is 
>>>> defined by the comparison function, PINT_trove_dbpf_keyval_compare().
>>>>
>>>> The keyval_comare() function sorts the keys based on handle value, 
>>>> then key length, then stncmp of the key name.
>>>>
>>>> This means that essentially we are indexing off of the name of the 
>>>> entry rather than a position in the database.
>>>>
>>>> So how could inserting a new entry between readdir requests cause a 
>>>> duplicate?  The old entry that is stored in the pcache should still 
>>>> be valid.  If the newly inserted entry comes after it (according to 
>>>> the keyval_comare() sort order), then we should see it as we 
>>>> continue iterating.  If the new entry comes before it, then it 
>>>> should not show up (we don't back up in the directory listing).  It 
>>>> doesn't seem like there should be any combination that causes it to 
>>>> show up twice.
>>>>
>>>> Is c_get() not traversing the db in the order defined by the 
>>>> keyval_comare() function?
>>>>
>>>> The only other danger that I see is that if the pcache_lookup() 
>>>> fails, the code falls back to stepping linearly through the db to 
>>>> the token position which I could imagine might have ordering 
>>>> implications. However, I am only talking to the server from a 
>>>> single client, so I don't see why it would ever miss the pcache 
>>>> lookup.
>>>>
>>>> I just want to confirm that there is actually an algorithm problem 
>>>> here rather than just a bug in the code somewhere.
>>> Oh, or is the problem in how the end of the directory is detected?  
>>> Does the client do something like issuing a readdir until it gets a 
>>> response with zero entries?  I haven't looked at how this works yet, 
>>> but I imagine that could throw a wrench into things if the directory 
>>> gets additional entries between when the server first indicates that 
>>> it has reached the end and when the client gives up on asking for more.
>>
>> I just tried repeating the test a few times, replacing the "ls" in my 
>> test script with either "pvfs2-ls" or "pvfs2-ls -al".  I cannot 
>> trigger the problem when using pvfs2-ls.
>>
>> If I switch back to "ls" or "/bin/ls" the problem shows up reliably.
>>
>> Is there anything fundamentally different between how pvfs2-ls works 
>> and how the vfs readdir path works, or is pvfs2-ls somehow getting 
>> luckier with the timing?
>
> I think I may see the problem.  There's some bits in the 
> dir.c:pvfs2_readdir kernel module code that check the directory 
> version (the mtime) and if its different than the previous value, we 
> start over from position 0, except the filldir callback (what the vfs 
> reports back to the user for each entry) has already been called for 
> the previous 32 entries, so you will see those entries twice.  The 
> number of duplicates you got confirms this theory (5024, 5344, 5120, 
> and 6048 are all multiples of 32).
>
> Rob mentioned that this has been around for a while.  Maybe the 
> changes I made to the request scheduler cause it to occur more 
> frequently, but even then, we don't queue crdirents between readdir 
> calls, so it seems like it would still have been possible.  On the 
> same machine (same kernel module), I would imagine the directory 
> version could get updated on the crdirent (file create), but then you 
> would miss the newly created files -- in fact the PVFS_sys_create 
> doesn't return the parent's mtime anyway so it can't get updated right 
> now anyway.
>
> It seems like this directory_version check is a little unnecessary to 
> be honest.  If I do an ls, I don't mind only seeing the results from 
> the start of the command, even if I don't see the entries added while 
> the command is running.  Better to miss some than report some twice, 
> but maybe I've got the semantics all wrong.
>
> An interesting side effect of having the pcache (I know I just bagged 
> on it earlier today) is that it allows us to sort of keep track of 
> readdirs that are "in progress".  For a crdirent, we might be able to 
> tack onto all the pcache entries with that handle the new entry, 
> allowing it to be returned in the next readdir.  This would avoid the 
> need for a client side version check entirely, and no entries would be 
> "missed".  It wouldn't be perfect of course.  The pcache is an 
> in-memory structure so server restarts and LRU expulsion will probably 
> cause missed entries again.
>
> This is probably obvious to everyone by now, but using the component 
> name instead of an index for the position doesn't fix this particular 
> bug/behavior.
>
> -sam
>
>
>>
>> -Phil
>>
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>

-------------- next part --------------
Index: src/kernel/linux-2.6/dir.c
===================================================================
RCS file: /projects/cvsroot/pvfs2-1/src/kernel/linux-2.6/dir.c,v
retrieving revision 1.45
diff -u -r1.45 dir.c
--- src/kernel/linux-2.6/dir.c	20 Sep 2006 22:59:53 -0000	1.45
+++ src/kernel/linux-2.6/dir.c	10 Oct 2006 05:11:10 -0000
@@ -125,7 +125,6 @@
     pvfs2_kernel_op_t *new_op = NULL;
     pvfs2_inode_t *pvfs2_inode = PVFS2_I(dentry->d_inode);
 
-  restart_readdir:
 
     pos = (PVFS_ds_position)file->f_pos;
     /* are we done? */
@@ -133,14 +132,11 @@
     {
         gossip_debug(GOSSIP_DIR_DEBUG, "Skipping to graceful termination path since we are done\n");
         pvfs2_inode->directory_version = 0;
-        pvfs2_inode->num_readdir_retries =
-            PVFS2_NUM_READDIR_RETRIES;
         return 0;
     }
 
     gossip_debug(GOSSIP_DIR_DEBUG, "pvfs2_readdir called on %s (pos=%d, "
-                "retry=%d, v=%llu)\n", dentry->d_name.name, (int)pos,
-                (int)pvfs2_inode->num_readdir_retries,
+                "v=%llu)\n", dentry->d_name.name, (int)pos,
                 llu(pvfs2_inode->directory_version));
 
     switch (pos)
@@ -268,29 +264,11 @@
                         rhandle.readdir_response.directory_version;
             }
 
-            if (pvfs2_inode->num_readdir_retries > -1)
+            if (pvfs2_inode->directory_version !=
+                rhandle.readdir_response.directory_version)
             {
-                if (pvfs2_inode->directory_version !=
-                    rhandle.readdir_response.directory_version)
-                {
-                    gossip_debug(GOSSIP_DIR_DEBUG, "detected directory change on listing; "
-                                "starting over\n");
-
-                    file->f_pos = 0;
-                    pvfs2_inode->directory_version =
-                        rhandle.readdir_response.directory_version;
-
-                    readdir_handle_dtor(&rhandle);
-                    op_release(new_op);
-                    pvfs2_inode->num_readdir_retries--;
-                    goto restart_readdir;
-                }
-            }
-            else
-            {
-                gossip_debug(GOSSIP_DIR_DEBUG, "Giving up on readdir retries to avoid "
-                            "possible livelock (%d tries attempted)\n",
-                            PVFS2_NUM_READDIR_RETRIES);
+                pvfs2_inode->directory_version =
+                    rhandle.readdir_response.directory_version;
             }
 
             for (i = 0; i < rhandle.readdir_response.pvfs_dirent_outcount; i++)
@@ -307,8 +285,6 @@
                 {
 graceful_termination_path:
                     pvfs2_inode->directory_version = 0;
-                    pvfs2_inode->num_readdir_retries = PVFS2_NUM_READDIR_RETRIES;
-
                     ret = 0;
                     break;
                 }
@@ -521,7 +497,6 @@
         filldirplus_lite = info->u.plus_lite.filldirplus_lite;
     }
 
-restart_readdir:
 
     pos = (PVFS_ds_position)file->f_pos;
     /* are we done? */
@@ -529,13 +504,10 @@
     {
         gossip_debug(GOSSIP_DIR_DEBUG, "Skipping to graceful termination path since we are done\n");
         pvfs2_inode->directory_version = 0;
-        pvfs2_inode->num_readdir_retries =
-            PVFS2_NUM_READDIR_RETRIES;
         return 0;
     }
     gossip_debug(GOSSIP_DIR_DEBUG, "pvfs2_readdirplus called on %s (pos=%d, "
-                "retry=%d, v=%llu)\n", dentry->d_name.name, (int)pos,
-                (int)pvfs2_inode->num_readdir_retries,
+                "v=%llu)\n", dentry->d_name.name, (int)pos,
                 llu(pvfs2_inode->directory_version));
 
     switch (pos)
@@ -722,29 +694,11 @@
                         rhandle.readdirplus_response.directory_version;
                 }
 
-                if (pvfs2_inode->num_readdir_retries > -1)
-                {
-                    if (pvfs2_inode->directory_version !=
-                        rhandle.readdirplus_response.directory_version)
-                    {
-                        gossip_debug(GOSSIP_DIR_DEBUG, "detected directory change on listing; "
-                                    "starting over\n");
-
-                        file->f_pos = 0;
-                        pvfs2_inode->directory_version =
-                            rhandle.readdirplus_response.directory_version;
-
-                        readdirplus_handle_dtor(&rhandle);
-                        op_release(new_op);
-                        pvfs2_inode->num_readdir_retries--;
-                        goto restart_readdir;
-                    }
-                }
-                else
+                if (pvfs2_inode->directory_version !=
+                    rhandle.readdirplus_response.directory_version)
                 {
-                    gossip_debug(GOSSIP_DIR_DEBUG, "Giving up on readdirplus retries to avoid "
-                                "possible livelock (%d tries attempted)\n",
-                                PVFS2_NUM_READDIR_RETRIES);
+                    pvfs2_inode->directory_version =
+                        rhandle.readdirplus_response.directory_version;
                 }
 
                 for (i = 0; i < rhandle.readdirplus_response.pvfs_dirent_outcount; i++)
@@ -861,8 +815,6 @@
                     {
 graceful_termination_path:
                         pvfs2_inode->directory_version = 0;
-                        pvfs2_inode->num_readdir_retries =
-                            PVFS2_NUM_READDIR_RETRIES;
 
                         ret = 0;
                         break;
Index: src/kernel/linux-2.6/pvfs2-kernel.h
===================================================================
RCS file: /projects/cvsroot/pvfs2-1/src/kernel/linux-2.6/pvfs2-kernel.h,v
retrieving revision 1.134
diff -u -r1.134 pvfs2-kernel.h
--- src/kernel/linux-2.6/pvfs2-kernel.h	29 Sep 2006 16:48:13 -0000	1.134
+++ src/kernel/linux-2.6/pvfs2-kernel.h	10 Oct 2006 05:11:11 -0000
@@ -144,7 +144,6 @@
 #define PVFS2_SEEK_END                 0x00000002
 #define PVFS2_MAX_NUM_OPTIONS          0x00000004
 #define PVFS2_MAX_MOUNT_OPT_LEN        0x00000080
-#define PVFS2_NUM_READDIR_RETRIES      0x0000000A
 #define PVFS2_MAX_FSKEY_LEN            64
 
 #define MAX_DEV_REQ_UPSIZE (2*sizeof(int32_t) +   \
@@ -358,9 +357,8 @@
 typedef struct
 {
     PVFS_object_ref refn;
-    int num_readdir_retries;
-    uint64_t directory_version;
     char *link_target;
+    uint64_t directory_version;
     /*
      * Reading/Writing Extended attributes need to acquire the appropriate
      * reader/writer semaphore on the pvfs2_inode_t structure.
Index: src/kernel/linux-2.6/pvfs2-utils.c
===================================================================
RCS file: /projects/cvsroot/pvfs2-1/src/kernel/linux-2.6/pvfs2-utils.c,v
retrieving revision 1.141
diff -u -r1.141 pvfs2-utils.c
--- src/kernel/linux-2.6/pvfs2-utils.c	26 Sep 2006 03:44:17 -0000	1.141
+++ src/kernel/linux-2.6/pvfs2-utils.c	10 Oct 2006 05:11:12 -0000
@@ -1915,7 +1915,6 @@
         pvfs2_inode->refn.fs_id = PVFS_FS_ID_NULL;
         pvfs2_inode->last_failed_block_index_read = 0;
         pvfs2_inode->link_target = NULL;
-        pvfs2_inode->num_readdir_retries = PVFS2_NUM_READDIR_RETRIES;
         pvfs2_inode->directory_version = 0;
         pvfs2_inode->error_code = 0;
         SetInitFlag(pvfs2_inode);
@@ -1931,7 +1930,6 @@
     pvfs2_inode->refn.handle = PVFS_HANDLE_NULL;
     pvfs2_inode->refn.fs_id = PVFS_FS_ID_NULL;
     pvfs2_inode->last_failed_block_index_read = 0;
-    pvfs2_inode->num_readdir_retries = PVFS2_NUM_READDIR_RETRIES;
     pvfs2_inode->directory_version = 0;
     pvfs2_inode->error_code = 0;
 }


More information about the Pvfs2-developers mailing list