[Pvfs2-developers] duplicate entries in directory listing

Sam Lang slang at mcs.anl.gov
Mon Oct 9 19:23:05 EDT 2006


On Oct 9, 2006, at 3:23 PM, Phil Carns wrote:

> Phil Carns wrote:
>>> I started thinking about some more possible ideas, but I realized  
>>> after looking closer at the code that I don't actually see why  
>>> duplicates would occur in the first place with the algorithm that  
>>> is being used :)  I apologize if this has been discussed a few  
>>> times already, but could we walk through it one more time?
>>>
>>> I know that the request protocol uses a token (integer based) to  
>>> keep track of position.  However, the pcache converts this into a  
>>> particular key based on where the last iteration left off.  This  
>>> key contains the handle as well as the alphanumeric name of the  
>>> entry.
>>>
>>> Trove then does a c_get on that key with the DB_SET flag, which  
>>> should put the cursor at the proper position.  If the entry has  
>>> been deleted (which is not happening in my case- I am only  
>>> creating files), then it retries the c_get with the DB_SET_RANGE  
>>> flag which should set the cursor at the next position.  "next" in  
>>> this case is defined by the comparison function,  
>>> PINT_trove_dbpf_keyval_compare().
>>>
>>> The keyval_comare() function sorts the keys based on handle  
>>> value, then key length, then stncmp of the key name.
>>>
>>> This means that essentially we are indexing off of the name of  
>>> the entry rather than a position in the database.
>>>
>>> So how could inserting a new entry between readdir requests cause  
>>> a duplicate?  The old entry that is stored in the pcache should  
>>> still be valid.  If the newly inserted entry comes after it  
>>> (according to the keyval_comare() sort order), then we should see  
>>> it as we continue iterating.  If the new entry comes before it,  
>>> then it should not show up (we don't back up in the directory  
>>> listing).  It doesn't seem like there should be any combination  
>>> that causes it to show up twice.
>>>
>>> Is c_get() not traversing the db in the order defined by the  
>>> keyval_comare() function?
>>>
>>> The only other danger that I see is that if the pcache_lookup()  
>>> fails, the code falls back to stepping linearly through the db to  
>>> the token position which I could imagine might have ordering  
>>> implications. However, I am only talking to the server from a  
>>> single client, so I don't see why it would ever miss the pcache  
>>> lookup.
>>>
>>> I just want to confirm that there is actually an algorithm  
>>> problem here rather than just a bug in the code somewhere.
>> Oh, or is the problem in how the end of the directory is  
>> detected?  Does the client do something like issuing a readdir  
>> until it gets a response with zero entries?  I haven't looked at  
>> how this works yet, but I imagine that could throw a wrench into  
>> things if the directory gets additional entries between when the  
>> server first indicates that it has reached the end and when the  
>> client gives up on asking for more.
>
> I just tried repeating the test a few times, replacing the "ls" in  
> my test script with either "pvfs2-ls" or "pvfs2-ls -al".  I cannot  
> trigger the problem when using pvfs2-ls.
>
> If I switch back to "ls" or "/bin/ls" the problem shows up reliably.
>
> Is there anything fundamentally different between how pvfs2-ls  
> works and how the vfs readdir path works, or is pvfs2-ls somehow  
> getting luckier with the timing?

I think I may see the problem.  There's some bits in the  
dir.c:pvfs2_readdir kernel module code that check the directory  
version (the mtime) and if its different than the previous value, we  
start over from position 0, except the filldir callback (what the vfs  
reports back to the user for each entry) has already been called for  
the previous 32 entries, so you will see those entries twice.  The  
number of duplicates you got confirms this theory (5024, 5344, 5120,  
and 6048 are all multiples of 32).

Rob mentioned that this has been around for a while.  Maybe the  
changes I made to the request scheduler cause it to occur more  
frequently, but even then, we don't queue crdirents between readdir  
calls, so it seems like it would still have been possible.  On the  
same machine (same kernel module), I would imagine the directory  
version could get updated on the crdirent (file create), but then you  
would miss the newly created files -- in fact the PVFS_sys_create  
doesn't return the parent's mtime anyway so it can't get updated  
right now anyway.

It seems like this directory_version check is a little unnecessary to  
be honest.  If I do an ls, I don't mind only seeing the results from  
the start of the command, even if I don't see the entries added while  
the command is running.  Better to miss some than report some twice,  
but maybe I've got the semantics all wrong.

An interesting side effect of having the pcache (I know I just bagged  
on it earlier today) is that it allows us to sort of keep track of  
readdirs that are "in progress".  For a crdirent, we might be able to  
tack onto all the pcache entries with that handle the new entry,  
allowing it to be returned in the next readdir.  This would avoid the  
need for a client side version check entirely, and no entries would  
be "missed".  It wouldn't be perfect of course.  The pcache is an in- 
memory structure so server restarts and LRU expulsion will probably  
cause missed entries again.

This is probably obvious to everyone by now, but using the component  
name instead of an index for the position doesn't fix this particular  
bug/behavior.

-sam


>
> -Phil
>



More information about the Pvfs2-developers mailing list