[Pvfs2-developers] Server crash bug(s)

Sam Lang slang at mcs.anl.gov
Thu May 29 11:49:45 EDT 2008


Great find guys.  It looks like this was introduced with the SM  
changes a while back -- maybe no one removes the execute bit from  
their directories or we hopefully would have seen this sooner?   
Another motivating instance for getting a good unit testing framework  
and code coverage analysis setup.

Can you commit the fix to head?

Thanks,

-sam

On May 28, 2008, at 3:29 PM, Nicholas Mills wrote:

> Ok we narrowed it down to the lookup state machine. It seems like one
> of the states was returning complete (1) after posting a job. As a
> result the state machine was being freed while the job was still in
> progress.
>
> We changed the return value from SM_ACTION_COMPLETE to the return
> value of the job and the server stopped crashing in all of our
> previous test cases. A patch against HEAD is attached.
>
> --Nick
>
> On Wed, May 28, 2008 at 2:47 PM, David Bonnie <dbonnie at parl.clemson.edu 
> > wrote:
>> Hey all -
>>
>> Nick and I seem to have found a fairly hefty bug with the server  
>> crashing
>> when copying to/from a directory.  Obviously this could cause some  
>> serious
>> problems if someone were to crash the server in the middle of writing
>> files.
>>
>> Here's what we've got so far:
>>
>> Copying to a PVFS folder (using pvfs2-cp) from both local and pvfs2  
>> share
>> space:
>> Permissions (of destination folder) / Result / Error
>>
>> 000 / Failure / server crashes on an assert(0)
>> 100 / Success / NA
>> 200 / Failure / server crashes with a "double free or corruption"  
>> error
>> 300 / Success / NA
>> 400 / Failure / server crashes on an assert(0)
>> 500 / Success / NA
>> 600 / Failure / server crashes on an assert(0)
>> 700 / Success / NA
>>
>> For 400 and 600, the server debug log says the following:
>> "SM current state or trtbl is invalid"
>> "state-machine-fns.c:241 PINT_state_machine_next assertion(0)"
>>
>> As you can see, any write to a folder without execute permissions  
>> will
>> crash the server.
>>
>>
>> We checked the same things for reading from a PVFS folder (using  
>> pvfs2-cp):
>> Permissions (of source folder) / Result / Error
>>
>> 000 / Failure / server crashes on an assert(0)
>> 100 / Sucess / NA
>> 200 / Failure / server crashes on the same assertion on line 241 as  
>> above
>> 300 / Failure / server doesn't crash, but client will segfault
>> 400 / Failure / server crashes on the same assertion on line 241 as  
>> above
>> 500 / Success / NA
>> 600 / Failure / server crashes on the same assertion on line 241 as  
>> above
>> 700 / Success / NA
>>
>> pvfs2-ls -l completes as normal for any combination of permissions.
>>
>> It seems like one (or more) of the state machines are dumping out  
>> early
>> and throwing the whole thing out of whack.  We recreated the  
>> storage space
>> between each run that failed to ensure that we weren't working with a
>> corrupted filespace (since the server was aborting).  Any ideas?
>>
>> This is happening with the code from HEAD on Red Hat Enterprise 5.
>>
>> - Dave
>>
>> _______________________________________________
>> Pvfs2-developers mailing list
>> Pvfs2-developers at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>
> <lookup.patch>_______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers



More information about the Pvfs2-developers mailing list