[Pvfs2-developers] Strange behavior with level2 (MPI-IO.c)
Julian Martin Kunkel
Julian.Kunkel at web.de
Thu Mar 8 11:50:52 EST 2007
Hi,
We see a rather strange and wrong behavior with PVFS2 using a file view with
MPI-IO using different levels :)
mpiexec -np 2 ./MPI-IO -i 4 -f pvfs2://pvfs2/test -s 10 level0
0000000 0000 0000 0000 0000 0000 0101 0101 0101
0000010 0101 0101 0101 0101 0101 0101 0101 0101
*
0000030 0101 0000 0000 0000 0000 0000 0000 0000
0000040 0000 0000 0000
0000046
mpiexec -np 2 ./MPI-IO -i 4 -f pvfs2://pvfs2/test -s 10 level2
0000000 0000 0000 0000 0000 0000 0101 0101 0101
0000010 0101 0101 0000 0000 0000 0000 0000 0101
0000020 0101 0101 0101 0101 0000 0000 0000 0000
0000030 0000 0101 0101 0101 0101 0101 0101 0101
0000040 0101 0101 0101
0000046
With this level in addition the number of bytes which are transfered between
client and servers does not match the amount of data it should be...
With a level3(non-contig, coll) and level1 (coll, contig) it looks correct
like:
0000000 0000 0000 0000 0000 0000 0101 0101 0101
0000010 0101 0101 0000 0000 0000 0000 0000 0101
0000020 0101 0101 0101 0101 0000 0000 0000 0000
0000030 0000 0101 0101 0101 0101 0101 0000 0000
0000040 0000 0000 0000 0101 0101 0101 0101 0101
0000050
Minimum setup where this error ocurred was with 3 data servers. However,
sometimes for examples with 4 dataservers the bug may disappear. Using 5
dataservers and a bigger file (500K) (mpiexec -np 4 ./MPI-IO -i 10 -f
pvfs2://pvfs2/test -s 50K level2) shows that the content of the file is
different for different runs. The md5sum might be for example:
c809928d82ca72e00469283f2450c5f0
7d215f060b113f81c2210ac6e8e4c6d9
b4ca34c8a8a7b06a9b6d29e4b78964c3
Software: PVFS2 03/08/07 CVS and the new tiled-types-for-mkuhn.diff patch with
the current mpich2-1.0.5-p3...
I did some runs for the levels with valgrind this showed (among other reported
issues) in level0 and level2 the following:
==18294== Invalid read of size 4
==18294== at 0x80EF461: ADIOI_PVFS2_WriteStrided (ad_pvfs2_write.c:392)
==18294== by 0x80AA299: MPIOI_File_write (write.c:156)
==18294== by 0x80A9C80: PMPI_File_write (write.c:52)
==18294== by 0x8056706: ??? (log_mpi_io.c:871)
==18294== by 0x804ACDA: Test_level0 (MPI-IO.c:75)
==18294== by 0x804B699: main (MPI-IO.c:309)
==18294== Address 0x4771460 is 0 bytes after a block of size 8 alloc'd
==18294== at 0x401B867: malloc (vg_replace_malloc.c:149)
==18294== by 0x80B505C: ADIOI_Malloc_fn (malloc.c:50)
==18294== by 0x80B4D66: ADIOI_Optimize_flattened (flatten.c:759)
==18294== by 0x80B3036: ADIOI_Flatten_datatype (flatten.c:79)
==18294== by 0x80BF8C8: ADIO_Set_view (ad_set_view.c:52)
==18294== by 0x80AA85A: PMPI_File_set_view (set_view.c:138)
==18294== by 0x8055CDE: MPI_File_set_view (log_mpi_io.c:611)
==18294== by 0x804AC80: Test_level0 (MPI-IO.c:70)
==18294== by 0x804B699: main (MPI-IO.c:309)
Similar for reads in ReadStrided...
These issues are not reported for the other levels and look rather suspicious
for me...
The following issue is common for all levels:
==18315== Conditional jump or move depends on uninitialised value(s)
==18315== at 0x8121869: PINT_distribute (pint-request.c:740)
==18315== by 0x811FB0B: PINT_process_request (pint-request.c:322)
==18315== by 0x8139641: small_io_completion_fn (sys-small-io.sm:257)
==18315== by 0x8180DD9: msgpairarray_completion_fn (msgpairarray.sm:547)
==18315== by 0x812A648: PINT_state_machine_next (state-machine-fns.h:158)
==18315== by 0x8129D3D: PINT_client_state_machine_test
(client-state-machine.c:559)
==18315== by 0x812A1C3: PINT_client_wait_internal
(client-state-machine.c:733)
==18315== by 0x812A3C5: PVFS_sys_wait (client-state-machine.c:861)
==18315== by 0x813300A: PVFS_sys_io (sys-io.sm:351)
==18315== by 0x80ECCCD: ADIOI_PVFS2_ReadStrided (ad_pvfs2_read.c:500)
==18315== by 0x80A9571: MPIOI_File_read (read.c:151)
==18315== by 0x80A8F58: PMPI_File_read (read.c:52)
Thanks,
Julian
More information about the Pvfs2-developers
mailing list