[Pvfs2-developers] Re: noncontig-test
Scott Atchley
atchley at myri.com
Mon Aug 6 14:49:38 EDT 2007
Walt,
It does not crash, but it clearly should not have 0 or negative
values for bytes. Kyle's original post is that this succeeds when he
uses TCP but does not succeed when using MX. He thought (and I agree)
that changing the BMI method, in general, should not change the result.
I have seen applications break when moving from TCP to MX when the
developer overlooks possible race conditions (e.g. assume that all
nodes progress in lock-step when, in fact, some may progress faster
than others).
I do not know if the values are set (BMI, PVFS2 or MPICH2). I plan to
do a little more debugging to try to narrow it down.
Scott
On Aug 6, 2007, at 1:20 PM, walt wrote:
> What do you mean when you say "fails?" What you have shown here
> SHOULD produce an error - it should not crash. The bytemax should
> not be less than bytes, and in any case should not be negative. It
> seems that the caller has for some reason passed an inproperly set
> up result structure.
>
> I haven't check the bmi code, but this appears to be a module that
> is trying to decide which servers have part of the data for this
> request. For this we usually set the bytemax to 1 (which says if
> there is at least one byte on this server, stop and let us know).
> Maybe we should add an error check for a negative bytemax, but at
> least in this case it should have called gossip_error.
>
> Walt
>
> Scott Atchley wrote:
>> Hi Sam,
>> Kyle sent me the code and I compiled it this morning.
>> First, I was using mpich2-mx compiled with PVFS2 support. It
>> failed with the error that MX was already initialized. Both mpich2-
>> mx and bmi_mx are calling mx_init(). I changed bmi_mx to ignore
>> MX_ALREADY_INITIALIZED.
>> Second, I do not see any errors returned in bmi_mx. It fails in
>> PINT_process_request (see call trace below). The request has segs
>> = 0, bytemax = -1291, and bytes = 0.
>> It could well be that these values are incorrect due to a bug in
>> bmi_mx that is not flagging an error, but I have no idea.
>> Can you take a look at this?
>> Thanks,
>> Scott
>> 0: (gdb) b PINT_process_request
>> 0: Breakpoint 2 at 0x4701c8: file src/io/description/pint-
>> request.c, line 72.
>> 0: (gdb) run -fname pvfs2://mnt/pvfs2/atchley/blah -fsize 1 -timing
>> 0: Continuing.
>> 0: ========= Parameter space dump =========
>> 0: filename: pvfs2://mnt/pvfs2/atchley/blah ionodes
>> 0: file size (MB): 1 buffer size 0
>> 0: vector length: 10 element count: 1 vector count: 0
>> 0: striping factor: 0 striping size: -1 collective buffer size: 0
>> 0: loops: 1 displacement 0
>> 0: ========= Dump done =========
>> 0: #* no verification possible!
>> 0: calling noncontigmem_noncontigfile(pvfs2://mnt/pvfs2/atchley/
>> blah, 0x0x2aaaaaaab010, 1048560)
>> 0:
>> 0: # testing noncontiguous in memory, noncontiguous in file using
>> independent I/O
>> 0: # vector count = 26214 - access count = 26214
>> 0: calling MPI_File_open(pvfs2://mnt/pvfs2/atchley/blah)
>> 0: calling MPI_File_set_view()
>> 0: calling MPI_File_seek()
>> 0: calling MPI_File_write()
>> 0: [New Thread 1082132816 (LWP 29290)]
>> 0: [New Thread 1090525520 (LWP 29291)]
>> 0:
>> 0: Breakpoint 2, PINT_process_request (req=0x6aea50, mem=0x6aeb00,
>> 0: rfdata=0x7fffd112b880, result=0x7fffd112b850, mode=2)
>> 0: at src/io/description/pint-request.c:72
>> 0: 72 void *temp_space = NULL; /* temp copy of req
>> state for size call */
>> 0: (gdb) 0: (gdb) bt
>> 0: #0 PINT_process_request (req=0x6aea50, mem=0x6aeb00,
>> rfdata=0x7fffd112b880,
>> 0: result=0x7fffd112b850, mode=2) at src/io/description/pint-
>> request.c:72
>> 0: #1 0x00000000004844e0 in io_find_target_datafiles
>> (mem_req=0x6ad160,
>> 0: file_req=0x6ae960, file_req_offset=0, dist_p=0x6ae9c0,
>> fs_id=1825963815,
>> 0: io_type=PVFS_IO_WRITE, input_handle_array=0x6b9510,
>> input_handle_count=4,
>> 0: handle_index_array=0x6b9240,
>> handle_index_out_count=0x7fffd112b944,
>> 0: sio_handle_index_array=0x6aea30,
>> sio_handle_index_count=0x7fffd112b940)
>> 0: at src/client/sysint/sys-io.sm:2320
>> 0: #2 0x0000000000480010 in io_datafile_setup_msgpairs
>> (sm_p=0x6ba4a0,
>> 0: js_p=0x7fffd112b9f0) at src/client/sysint/sys-io.sm:489
>> 0: #3 0x0000000000476a66 in PINT_state_machine_next (s=0x6ba4a0,
>> 0: r=0x7fffd112b9f0) at ./src/common/misc/state-machine-fns.h:
>> 158
>> 0: #4 0x0000000000476645 in PINT_client_state_machine_post
>> (sm_p=0x6ba4a0,
>> 0: pvfs_sys_op=6, op_id=0x7fffd112bb30, user_ptr=0x0)
>> 0: at src/client/sysint/client-state-machine.c:312
>> 0: #5 0x000000000047f9fc in PVFS_isys_io (ref=
>> 0: {handle = 1048563, fs_id = 1825963815, __pad1 = 0},
>> file_req=0x6ae960,
>> 0: file_req_offset=0, buffer=0x0, mem_req=0x6ad160,
>> credentials=0x6b8ea0,
>> 0: resp_p=0x7fffd112bba0, io_type=PVFS_IO_WRITE,
>> op_id=0x7fffd112bb30,
>> 0: user_ptr=0x0) at src/client/sysint/sys-io.sm:328
>> 0: #6 0x000000000047facf in PVFS_sys_io (ref=
>> 0: {handle = 1048563, fs_id = 1825963815, __pad1 = 0},
>> file_req=0x6ae960,
>> 0: file_req_offset=0, buffer=0x0, mem_req=0x6ad160,
>> credentials=0x6b8ea0,
>> 0: resp_p=0x7fffd112bba0, io_type=PVFS_IO_WRITE)
>> 0: at src/client/sysint/sys-io.sm:351
>> 0: #7 0x0000000000458cb2 in ADIOI_PVFS2_WriteStrided (fd=0x6b8d00,
>> 0: buf=0x2aaaaaaab010, count=26214, datatype=-1946157050,
>> file_ptr_type=101,
>> 0: offset=0, status=0x7fffd112be30, error_code=0x7fffd112bd70)
>> 0: at /nfs/home/atchley/projects/mpich2/mpich2-
>> snap-200706132016/src/mpi/romio/adio/ad_pvfs2/ad_pvfs2_write.c:
>> 1001 0: #8 0x000000000041afcb in MPIOI_File_write
>> (mpi_fh=0x6b8d00, offset=0,
>> 0: file_ptr_type=101, buf=0x2aaaaaaab010, count=26214,
>> datatype=-1946157050,
>> 0: myname=0x63ac74 "MPI_FILE_WRITE", status=0x7fffd112be30)
>> 0: at /nfs/home/atchley/projects/mpich2/mpich2-
>> snap-200706132016/src/mpi/romio/mpi-io/write.c:156 0: #9
>> 0x000000000041aafd in PMPI_File_write (mpi_fh=0x6b8d00,
>> 0: buf=0x2aaaaaaab010, count=26214, datatype=-1946157050,
>> 0: status=0x7fffd112be30)
>> 0: at /nfs/home/atchley/projects/mpich2/mpich2-
>> snap-200706132016/src/mpi/romio/mpi-io/write.c:52 0: #10
>> 0x000000000040461e in noncontigmem_noncontigfile (
>> 0: filename=0x668110 "pvfs2://mnt/pvfs2/atchley/blah",
>> buf=0x2aaaaaaab010,
>> 0: bufsize=1048560, dtype=-1946157050, offset=0, displs=0,
>> finfo=-1677721600,
>> 0: veclen=10, elmtcount=1, veccount=26214) at noncontig.c:185
>> 0: #11 0x000000000040738d in main (argc=1, argv=0x7fffd112c608)
>> 0: at noncontig.c:1020
>> 0: (gdb) s
>> 0: 74 PVFS_offset contig_offset = 0; /* temp for offset
>> of a contig region */
>> 0: (gdb)
>> 0: 78 if (!PINT_IS_MEMREQ(mode))
>> 0: (gdb)
>> 0: 79 gossip_debug(GOSSIP_REQUEST_DEBUG,
>> 0: (gdb)
>> 0: 81 gossip_debug
>> (GOSSIP_REQUEST_DEBUG,"PINT_process_request\n");
>> 0: (gdb)
>> 0: 83 if (!req)
>> 0: (gdb)
>> 0: 88 if (!result || !result->segmax || !result->bytemax)
>> 0: (gdb) p *result
>> 0: $1 = {offset_array = 0x7fffd112b8a8, size_array =
>> 0x7fffd112b8a0, segmax = 1,
>> 0: segs = 0, bytemax = -1291, bytes = 0}
>> _______________________________________________
>> Pvfs2-developers mailing list
>> Pvfs2-developers at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>> <walt.vcf>
More information about the Pvfs2-developers
mailing list