[PVFS2-developers] pvfs2 failover almost there
pcarns at parl.clemson.edu
Fri Jun 18 09:33:46 EDT 2004
> When one node goes down, the client (pvfs2-cp) reports
> Error: bmi_tcp: Connection reset by peer
> Warning: BMI attempting reconnect.
> Error: bmi_tcp: Connection refused
> Error: poorly formatted protocol message received.
> Protocol version mismatch: received version 0 when expecting version 501.
> Please verify your PVFS2 installation and make sure that the version is
> msgpairarray decode error: Protocol not supported
> PVFS_sys_write: Protocol not supported
> Error: short write
> Any suggestions?
I think that you will need to turn on some more debugging messages (or
maybe add new ones) to tell what is going on. The "poorly formatted.."
and "decode error..." messages indicate a problem with the request
decoder interpretting an acknowledgement. Since there isn't another
"attempting reconnect" message between the last communication failure
and the attempt to decode the response, I would guess that there is a
bug in the error handling path in there somewhere. The decoder should
not be called after getting a communication error. Maybe adding a
debugging message where the recv is interpretted (that prints the job
error code) would be helpful to determine for sure what is going on. It
is also possible that there is a bug further down that causes the job
interface to return garbage or something.
Just a guess if you are fishing for bug theories :)
The PVFS2 system interface doesn't have any concept of a "short write".
That message came out of pvfs2-cp:
ret = generic_write(&dest, buffer, total_written,
if (ret != current_size)
fprintf(stderr, "Error: short write\n");
ret = -1;
That error message is a little misleading because the generic_write
function in their could be returning a negative error code; I suspect it
should have just reported that the write failed rather than that the
write was short for the pvfs2 case.
More information about the PVFS2-developers