[Pvfs2-developers] Unexpected flow protocol error using unequally
distribution of data with MPI
Sam Lang
slang at mcs.anl.gov
Mon Mar 12 10:52:01 EST 2007
Hi Julian,
Those flow error messages are either coming from bmi or trove. Based
on the error, my guess would be that the request processing on the
server may tell flow to expect more data (BMI messages), but the
request processing doesn't match up and the client has already sent
everything it has to the server. That's just a guess though.
The error code is EINVAL, so maybe the request processing actually
fails in the flow code. Could you set the server debug level to
'all' and send us the output?
Thanks,
-sam
On Mar 12, 2007, at 6:37 AM, Julian Martin Kunkel wrote:
> Hi guys,
> I found another unexpected behavior :(
> This time I get in trouble when I create a unbalanced distribution
> over the
> datafiles with MPI_Type_struct. I tried with 5 dataservers and with 2
> dataservers, the example I will give here is for 2 dataservers.
> The datatype I use for the view places 64KByte on one server and
> 128KByte on
> another server.
> blocklens[0] = 1;
> blocklens[1] = 128*1024;
> blocklens[2] = 64*1024;
> blocklens[3] = 1;
> indices[0] = 0;
> indices[1] = 0;
> indices[2] = (128+64)*1024;
> indices[3] = (128+128)*1024;
> old_types[0] = MPI_LB;
> old_types[1] = MPI_BYTE;
> old_types[2] = MPI_BYTE;
> old_types[3] = MPI_UB;
>
> I attached a program which demonstrated the problem for 2
> dataservers, it
> writes 100MByte per iteration.
> Once I write more than 1500MByte with MPI_File_write I always get
> on the
> server machines:
> [E 12:18:30.682262] handle_io_error: flow proto error cleanup
> started on
> 0x81669f0, error_code: -1073742095
> [E 12:18:30.682312] handle_io_error: flow proto 0x81669f0 canceled 0
> operations, will clean up.
> [E 12:18:30.682326] handle_io_error: flow proto 0x81669f0 error
> cleanup
> finished, error_code: -1073742095
> [E 12:18:30.709711] handle_io_error: flow proto error cleanup
> started on
> 0x81508c8, error_code: -1073742095
> [E 12:18:30.710381] handle_io_error: flow proto 0x81508c8 canceled 1
> operations, will clean up.
> [E 12:18:30.710544] handle_io_error: flow proto 0x81508c8 error
> cleanup
> finished, error_code: -1073742095
>
> This is reproducable, I tried maybe 10 times with different
> programs using
> this patterns. With this program the flow error occurs on iteration
> 15 when
> the file will be about 2GByte big ..
> On disk of the 2 dataserver with ls I get 1GByte per datafile, with
> du I can
> see the holes, about 1.1 GByte is needed on one machine and
> 514MByte on the
> other server which seems to resemble the 2:1 distribution correctly...
>
> With 5 dataservers I tried the following imbalanced distribution
> 10,10,10,10,9
> (which means the last server get 10% less data per iteration) and
> get the
> same problem once the file is bigger than 2GByte... This does not
> occur if
> the amount of data is distributed evenly in each iteration...
>
> Thanks for helping me out :)
> julian
> <unexpected-pvfs2-flow-error.c>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
More information about the Pvfs2-developers
mailing list