[Pvfs2-developers] duplicate entries in directory listing
Murali Vilayannur
murali.vilayannur at gmail.com
Tue Oct 10 01:16:26 EDT 2006
Hi,
Phil, attached patch fixes the behavior that you described.
Basically the problem is exactly what Sam had described.
There is no point retrying a readdir of 32 entries when the actual ls
command may have issued previous readdir's that are not retried.
That said, I am sure this will cause other problems such as deleted
files showing up until the entire listing completes ..
thanks,
Murali
>
> On Oct 9, 2006, at 3:23 PM, Phil Carns wrote:
>
>> Phil Carns wrote:
>>>> I started thinking about some more possible ideas, but I realized
>>>> after looking closer at the code that I don't actually see why
>>>> duplicates would occur in the first place with the algorithm that
>>>> is being used :) I apologize if this has been discussed a few
>>>> times already, but could we walk through it one more time?
>>>>
>>>> I know that the request protocol uses a token (integer based) to
>>>> keep track of position. However, the pcache converts this into a
>>>> particular key based on where the last iteration left off. This
>>>> key contains the handle as well as the alphanumeric name of the entry.
>>>>
>>>> Trove then does a c_get on that key with the DB_SET flag, which
>>>> should put the cursor at the proper position. If the entry has
>>>> been deleted (which is not happening in my case- I am only creating
>>>> files), then it retries the c_get with the DB_SET_RANGE flag which
>>>> should set the cursor at the next position. "next" in this case is
>>>> defined by the comparison function, PINT_trove_dbpf_keyval_compare().
>>>>
>>>> The keyval_comare() function sorts the keys based on handle value,
>>>> then key length, then stncmp of the key name.
>>>>
>>>> This means that essentially we are indexing off of the name of the
>>>> entry rather than a position in the database.
>>>>
>>>> So how could inserting a new entry between readdir requests cause a
>>>> duplicate? The old entry that is stored in the pcache should still
>>>> be valid. If the newly inserted entry comes after it (according to
>>>> the keyval_comare() sort order), then we should see it as we
>>>> continue iterating. If the new entry comes before it, then it
>>>> should not show up (we don't back up in the directory listing). It
>>>> doesn't seem like there should be any combination that causes it to
>>>> show up twice.
>>>>
>>>> Is c_get() not traversing the db in the order defined by the
>>>> keyval_comare() function?
>>>>
>>>> The only other danger that I see is that if the pcache_lookup()
>>>> fails, the code falls back to stepping linearly through the db to
>>>> the token position which I could imagine might have ordering
>>>> implications. However, I am only talking to the server from a
>>>> single client, so I don't see why it would ever miss the pcache
>>>> lookup.
>>>>
>>>> I just want to confirm that there is actually an algorithm problem
>>>> here rather than just a bug in the code somewhere.
>>> Oh, or is the problem in how the end of the directory is detected?
>>> Does the client do something like issuing a readdir until it gets a
>>> response with zero entries? I haven't looked at how this works yet,
>>> but I imagine that could throw a wrench into things if the directory
>>> gets additional entries between when the server first indicates that
>>> it has reached the end and when the client gives up on asking for more.
>>
>> I just tried repeating the test a few times, replacing the "ls" in my
>> test script with either "pvfs2-ls" or "pvfs2-ls -al". I cannot
>> trigger the problem when using pvfs2-ls.
>>
>> If I switch back to "ls" or "/bin/ls" the problem shows up reliably.
>>
>> Is there anything fundamentally different between how pvfs2-ls works
>> and how the vfs readdir path works, or is pvfs2-ls somehow getting
>> luckier with the timing?
>
> I think I may see the problem. There's some bits in the
> dir.c:pvfs2_readdir kernel module code that check the directory
> version (the mtime) and if its different than the previous value, we
> start over from position 0, except the filldir callback (what the vfs
> reports back to the user for each entry) has already been called for
> the previous 32 entries, so you will see those entries twice. The
> number of duplicates you got confirms this theory (5024, 5344, 5120,
> and 6048 are all multiples of 32).
>
> Rob mentioned that this has been around for a while. Maybe the
> changes I made to the request scheduler cause it to occur more
> frequently, but even then, we don't queue crdirents between readdir
> calls, so it seems like it would still have been possible. On the
> same machine (same kernel module), I would imagine the directory
> version could get updated on the crdirent (file create), but then you
> would miss the newly created files -- in fact the PVFS_sys_create
> doesn't return the parent's mtime anyway so it can't get updated right
> now anyway.
>
> It seems like this directory_version check is a little unnecessary to
> be honest. If I do an ls, I don't mind only seeing the results from
> the start of the command, even if I don't see the entries added while
> the command is running. Better to miss some than report some twice,
> but maybe I've got the semantics all wrong.
>
> An interesting side effect of having the pcache (I know I just bagged
> on it earlier today) is that it allows us to sort of keep track of
> readdirs that are "in progress". For a crdirent, we might be able to
> tack onto all the pcache entries with that handle the new entry,
> allowing it to be returned in the next readdir. This would avoid the
> need for a client side version check entirely, and no entries would be
> "missed". It wouldn't be perfect of course. The pcache is an
> in-memory structure so server restarts and LRU expulsion will probably
> cause missed entries again.
>
> This is probably obvious to everyone by now, but using the component
> name instead of an index for the position doesn't fix this particular
> bug/behavior.
>
> -sam
>
>
>>
>> -Phil
>>
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
-------------- next part --------------
Index: src/kernel/linux-2.6/dir.c
===================================================================
RCS file: /projects/cvsroot/pvfs2-1/src/kernel/linux-2.6/dir.c,v
retrieving revision 1.45
diff -u -r1.45 dir.c
--- src/kernel/linux-2.6/dir.c 20 Sep 2006 22:59:53 -0000 1.45
+++ src/kernel/linux-2.6/dir.c 10 Oct 2006 05:11:10 -0000
@@ -125,7 +125,6 @@
pvfs2_kernel_op_t *new_op = NULL;
pvfs2_inode_t *pvfs2_inode = PVFS2_I(dentry->d_inode);
- restart_readdir:
pos = (PVFS_ds_position)file->f_pos;
/* are we done? */
@@ -133,14 +132,11 @@
{
gossip_debug(GOSSIP_DIR_DEBUG, "Skipping to graceful termination path since we are done\n");
pvfs2_inode->directory_version = 0;
- pvfs2_inode->num_readdir_retries =
- PVFS2_NUM_READDIR_RETRIES;
return 0;
}
gossip_debug(GOSSIP_DIR_DEBUG, "pvfs2_readdir called on %s (pos=%d, "
- "retry=%d, v=%llu)\n", dentry->d_name.name, (int)pos,
- (int)pvfs2_inode->num_readdir_retries,
+ "v=%llu)\n", dentry->d_name.name, (int)pos,
llu(pvfs2_inode->directory_version));
switch (pos)
@@ -268,29 +264,11 @@
rhandle.readdir_response.directory_version;
}
- if (pvfs2_inode->num_readdir_retries > -1)
+ if (pvfs2_inode->directory_version !=
+ rhandle.readdir_response.directory_version)
{
- if (pvfs2_inode->directory_version !=
- rhandle.readdir_response.directory_version)
- {
- gossip_debug(GOSSIP_DIR_DEBUG, "detected directory change on listing; "
- "starting over\n");
-
- file->f_pos = 0;
- pvfs2_inode->directory_version =
- rhandle.readdir_response.directory_version;
-
- readdir_handle_dtor(&rhandle);
- op_release(new_op);
- pvfs2_inode->num_readdir_retries--;
- goto restart_readdir;
- }
- }
- else
- {
- gossip_debug(GOSSIP_DIR_DEBUG, "Giving up on readdir retries to avoid "
- "possible livelock (%d tries attempted)\n",
- PVFS2_NUM_READDIR_RETRIES);
+ pvfs2_inode->directory_version =
+ rhandle.readdir_response.directory_version;
}
for (i = 0; i < rhandle.readdir_response.pvfs_dirent_outcount; i++)
@@ -307,8 +285,6 @@
{
graceful_termination_path:
pvfs2_inode->directory_version = 0;
- pvfs2_inode->num_readdir_retries = PVFS2_NUM_READDIR_RETRIES;
-
ret = 0;
break;
}
@@ -521,7 +497,6 @@
filldirplus_lite = info->u.plus_lite.filldirplus_lite;
}
-restart_readdir:
pos = (PVFS_ds_position)file->f_pos;
/* are we done? */
@@ -529,13 +504,10 @@
{
gossip_debug(GOSSIP_DIR_DEBUG, "Skipping to graceful termination path since we are done\n");
pvfs2_inode->directory_version = 0;
- pvfs2_inode->num_readdir_retries =
- PVFS2_NUM_READDIR_RETRIES;
return 0;
}
gossip_debug(GOSSIP_DIR_DEBUG, "pvfs2_readdirplus called on %s (pos=%d, "
- "retry=%d, v=%llu)\n", dentry->d_name.name, (int)pos,
- (int)pvfs2_inode->num_readdir_retries,
+ "v=%llu)\n", dentry->d_name.name, (int)pos,
llu(pvfs2_inode->directory_version));
switch (pos)
@@ -722,29 +694,11 @@
rhandle.readdirplus_response.directory_version;
}
- if (pvfs2_inode->num_readdir_retries > -1)
- {
- if (pvfs2_inode->directory_version !=
- rhandle.readdirplus_response.directory_version)
- {
- gossip_debug(GOSSIP_DIR_DEBUG, "detected directory change on listing; "
- "starting over\n");
-
- file->f_pos = 0;
- pvfs2_inode->directory_version =
- rhandle.readdirplus_response.directory_version;
-
- readdirplus_handle_dtor(&rhandle);
- op_release(new_op);
- pvfs2_inode->num_readdir_retries--;
- goto restart_readdir;
- }
- }
- else
+ if (pvfs2_inode->directory_version !=
+ rhandle.readdirplus_response.directory_version)
{
- gossip_debug(GOSSIP_DIR_DEBUG, "Giving up on readdirplus retries to avoid "
- "possible livelock (%d tries attempted)\n",
- PVFS2_NUM_READDIR_RETRIES);
+ pvfs2_inode->directory_version =
+ rhandle.readdirplus_response.directory_version;
}
for (i = 0; i < rhandle.readdirplus_response.pvfs_dirent_outcount; i++)
@@ -861,8 +815,6 @@
{
graceful_termination_path:
pvfs2_inode->directory_version = 0;
- pvfs2_inode->num_readdir_retries =
- PVFS2_NUM_READDIR_RETRIES;
ret = 0;
break;
Index: src/kernel/linux-2.6/pvfs2-kernel.h
===================================================================
RCS file: /projects/cvsroot/pvfs2-1/src/kernel/linux-2.6/pvfs2-kernel.h,v
retrieving revision 1.134
diff -u -r1.134 pvfs2-kernel.h
--- src/kernel/linux-2.6/pvfs2-kernel.h 29 Sep 2006 16:48:13 -0000 1.134
+++ src/kernel/linux-2.6/pvfs2-kernel.h 10 Oct 2006 05:11:11 -0000
@@ -144,7 +144,6 @@
#define PVFS2_SEEK_END 0x00000002
#define PVFS2_MAX_NUM_OPTIONS 0x00000004
#define PVFS2_MAX_MOUNT_OPT_LEN 0x00000080
-#define PVFS2_NUM_READDIR_RETRIES 0x0000000A
#define PVFS2_MAX_FSKEY_LEN 64
#define MAX_DEV_REQ_UPSIZE (2*sizeof(int32_t) + \
@@ -358,9 +357,8 @@
typedef struct
{
PVFS_object_ref refn;
- int num_readdir_retries;
- uint64_t directory_version;
char *link_target;
+ uint64_t directory_version;
/*
* Reading/Writing Extended attributes need to acquire the appropriate
* reader/writer semaphore on the pvfs2_inode_t structure.
Index: src/kernel/linux-2.6/pvfs2-utils.c
===================================================================
RCS file: /projects/cvsroot/pvfs2-1/src/kernel/linux-2.6/pvfs2-utils.c,v
retrieving revision 1.141
diff -u -r1.141 pvfs2-utils.c
--- src/kernel/linux-2.6/pvfs2-utils.c 26 Sep 2006 03:44:17 -0000 1.141
+++ src/kernel/linux-2.6/pvfs2-utils.c 10 Oct 2006 05:11:12 -0000
@@ -1915,7 +1915,6 @@
pvfs2_inode->refn.fs_id = PVFS_FS_ID_NULL;
pvfs2_inode->last_failed_block_index_read = 0;
pvfs2_inode->link_target = NULL;
- pvfs2_inode->num_readdir_retries = PVFS2_NUM_READDIR_RETRIES;
pvfs2_inode->directory_version = 0;
pvfs2_inode->error_code = 0;
SetInitFlag(pvfs2_inode);
@@ -1931,7 +1930,6 @@
pvfs2_inode->refn.handle = PVFS_HANDLE_NULL;
pvfs2_inode->refn.fs_id = PVFS_FS_ID_NULL;
pvfs2_inode->last_failed_block_index_read = 0;
- pvfs2_inode->num_readdir_retries = PVFS2_NUM_READDIR_RETRIES;
pvfs2_inode->directory_version = 0;
pvfs2_inode->error_code = 0;
}
More information about the Pvfs2-developers
mailing list