[PVFS2-developers] pvfs2 failover almost there

Phil Carns pcarns at parl.clemson.edu
Fri Jun 18 09:33:46 EDT 2004


> When one node goes down, the client (pvfs2-cp) reports 
> 
> Error: bmi_tcp: Connection reset by peer
> Warning: BMI attempting reconnect.
> Error: bmi_tcp: Connection refused
> Error: poorly formatted protocol message received.
>    Protocol version mismatch: received version 0 when expecting version 501.
>    Please verify your PVFS2 installation and make sure that the version is
>    consistent.
> msgpairarray decode error: Protocol not supported
> PVFS_sys_write: Protocol not supported
> Error: short write
>
> Any suggestions? 

I think that you will need to turn on some more debugging messages (or 
maybe add new ones) to tell what is going on.  The "poorly formatted.." 
and "decode error..." messages indicate a problem with the request 
decoder interpretting an acknowledgement.  Since there isn't another 
"attempting reconnect" message between the last communication failure 
and the attempt to decode the response, I would guess that there is a 
bug in the error handling path in there somewhere.  The decoder should 
not be called after getting a communication error.  Maybe adding a 
debugging message where the recv is interpretted (that prints the job 
error code) would be helpful to determine for sure what is going on.  It 
is also possible that there is a bug further down that causes the job 
interface to return garbage or something.

Just a guess if you are fishing for bug theories :)

The PVFS2 system interface doesn't have any concept of a "short write". 
  That message came out of pvfs2-cp:

ret = generic_write(&dest, buffer, total_written,
     buffer_size, &credentials);
if (ret != current_size)
{
     fprintf(stderr, "Error: short write\n");
     ret = -1;
     goto main_out;
}

That error message is a little misleading because the generic_write 
function in their could be returning a negative error code; I suspect it 
should have just reported that the write failed rather than that the 
write was short for the pvfs2 case.

-Phil


More information about the PVFS2-developers mailing list