[Pvfs2-developers] Re: noncontig-test

Scott Atchley atchley at myri.com
Mon Aug 6 14:49:38 EDT 2007


Walt,

It does not crash, but it clearly should not have 0 or negative  
values for bytes. Kyle's original post is that this succeeds when he  
uses TCP but does not succeed when using MX. He thought (and I agree)  
that changing the BMI method, in general, should not change the result.

I have seen applications break when moving from TCP to MX when the  
developer overlooks possible race conditions (e.g. assume that all  
nodes progress in lock-step when, in fact, some may progress faster  
than others).

I do not know if the values are set (BMI, PVFS2 or MPICH2). I plan to  
do a little more debugging to try to narrow it down.

Scott

On Aug 6, 2007, at 1:20 PM, walt wrote:

> What do you mean when you say "fails?"  What you have shown here  
> SHOULD produce an error - it should not crash.  The bytemax should  
> not be less than bytes, and in any case should not be negative.  It  
> seems that the caller has for some reason passed an inproperly set  
> up result structure.
>
> I haven't check the bmi code, but this appears to be a module that  
> is trying to decide which servers have part of the data for this  
> request. For this we usually set the bytemax to 1 (which says if  
> there is at least one byte on this server, stop and let us know).   
> Maybe we should add an error check for a negative bytemax, but at  
> least in this case it should have called gossip_error.
>
> Walt
>
> Scott Atchley wrote:
>> Hi Sam,
>> Kyle sent me the code and I compiled it this morning.
>> First, I was using mpich2-mx compiled with PVFS2 support. It  
>> failed with the error that MX was already initialized. Both mpich2- 
>> mx and bmi_mx are calling mx_init(). I changed bmi_mx to ignore  
>> MX_ALREADY_INITIALIZED.
>> Second, I do not see any errors returned in bmi_mx. It fails in  
>> PINT_process_request (see call trace below). The request has segs  
>> = 0,  bytemax = -1291, and bytes = 0.
>> It could well be that these values are incorrect due to a bug in  
>> bmi_mx that is not flagging an error, but I have no idea.
>> Can you take a look at this?
>> Thanks,
>> Scott
>> 0:  (gdb) b PINT_process_request
>> 0:  Breakpoint 2 at 0x4701c8: file src/io/description/pint- 
>> request.c, line 72.
>> 0:  (gdb) run -fname pvfs2://mnt/pvfs2/atchley/blah -fsize 1 -timing
>> 0:  Continuing.
>> 0:  ========= Parameter space dump =========
>> 0:  filename: pvfs2://mnt/pvfs2/atchley/blah  ionodes
>> 0:  file size (MB): 1 buffer size 0
>> 0:  vector length: 10 element count: 1 vector count: 0
>> 0:  striping factor: 0 striping size: -1 collective buffer size: 0
>> 0:  loops: 1 displacement 0
>> 0:  ========= Dump done            =========
>> 0:  #* no verification possible!
>> 0:  calling noncontigmem_noncontigfile(pvfs2://mnt/pvfs2/atchley/ 
>> blah, 0x0x2aaaaaaab010, 1048560)
>> 0:
>> 0:  # testing noncontiguous in memory, noncontiguous in file using  
>> independent I/O
>> 0:  # vector count = 26214 - access count = 26214
>> 0:  calling MPI_File_open(pvfs2://mnt/pvfs2/atchley/blah)
>> 0:  calling MPI_File_set_view()
>> 0:  calling MPI_File_seek()
>> 0:  calling MPI_File_write()
>> 0:  [New Thread 1082132816 (LWP 29290)]
>> 0:  [New Thread 1090525520 (LWP 29291)]
>> 0:
>> 0:  Breakpoint 2, PINT_process_request (req=0x6aea50, mem=0x6aeb00,
>> 0:      rfdata=0x7fffd112b880, result=0x7fffd112b850, mode=2)
>> 0:      at src/io/description/pint-request.c:72
>> 0:  72          void *temp_space = NULL;    /* temp copy of req  
>> state for size call */
>> 0:  (gdb) 0:  (gdb) bt
>> 0:  #0  PINT_process_request (req=0x6aea50, mem=0x6aeb00,  
>> rfdata=0x7fffd112b880,
>> 0:      result=0x7fffd112b850, mode=2) at src/io/description/pint- 
>> request.c:72
>> 0:  #1  0x00000000004844e0 in io_find_target_datafiles  
>> (mem_req=0x6ad160,
>> 0:      file_req=0x6ae960, file_req_offset=0, dist_p=0x6ae9c0,  
>> fs_id=1825963815,
>> 0:      io_type=PVFS_IO_WRITE, input_handle_array=0x6b9510,  
>> input_handle_count=4,
>> 0:      handle_index_array=0x6b9240,  
>> handle_index_out_count=0x7fffd112b944,
>> 0:      sio_handle_index_array=0x6aea30,  
>> sio_handle_index_count=0x7fffd112b940)
>> 0:      at src/client/sysint/sys-io.sm:2320
>> 0:  #2  0x0000000000480010 in io_datafile_setup_msgpairs  
>> (sm_p=0x6ba4a0,
>> 0:      js_p=0x7fffd112b9f0) at src/client/sysint/sys-io.sm:489
>> 0:  #3  0x0000000000476a66 in PINT_state_machine_next (s=0x6ba4a0,
>> 0:      r=0x7fffd112b9f0) at ./src/common/misc/state-machine-fns.h: 
>> 158
>> 0:  #4  0x0000000000476645 in PINT_client_state_machine_post  
>> (sm_p=0x6ba4a0,
>> 0:      pvfs_sys_op=6, op_id=0x7fffd112bb30, user_ptr=0x0)
>> 0:      at src/client/sysint/client-state-machine.c:312
>> 0:  #5  0x000000000047f9fc in PVFS_isys_io (ref=
>> 0:        {handle = 1048563, fs_id = 1825963815, __pad1 = 0},  
>> file_req=0x6ae960,
>> 0:      file_req_offset=0, buffer=0x0, mem_req=0x6ad160,  
>> credentials=0x6b8ea0,
>> 0:      resp_p=0x7fffd112bba0, io_type=PVFS_IO_WRITE,  
>> op_id=0x7fffd112bb30,
>> 0:      user_ptr=0x0) at src/client/sysint/sys-io.sm:328
>> 0:  #6  0x000000000047facf in PVFS_sys_io (ref=
>> 0:        {handle = 1048563, fs_id = 1825963815, __pad1 = 0},  
>> file_req=0x6ae960,
>> 0:      file_req_offset=0, buffer=0x0, mem_req=0x6ad160,  
>> credentials=0x6b8ea0,
>> 0:      resp_p=0x7fffd112bba0, io_type=PVFS_IO_WRITE)
>> 0:      at src/client/sysint/sys-io.sm:351
>> 0:  #7  0x0000000000458cb2 in ADIOI_PVFS2_WriteStrided (fd=0x6b8d00,
>> 0:      buf=0x2aaaaaaab010, count=26214, datatype=-1946157050,  
>> file_ptr_type=101,
>> 0:      offset=0, status=0x7fffd112be30, error_code=0x7fffd112bd70)
>> 0:      at /nfs/home/atchley/projects/mpich2/mpich2- 
>> snap-200706132016/src/mpi/romio/adio/ad_pvfs2/ad_pvfs2_write.c: 
>> 1001 0:  #8  0x000000000041afcb in MPIOI_File_write  
>> (mpi_fh=0x6b8d00, offset=0,
>> 0:      file_ptr_type=101, buf=0x2aaaaaaab010, count=26214,  
>> datatype=-1946157050,
>> 0:      myname=0x63ac74 "MPI_FILE_WRITE", status=0x7fffd112be30)
>> 0:      at /nfs/home/atchley/projects/mpich2/mpich2- 
>> snap-200706132016/src/mpi/romio/mpi-io/write.c:156 0:  #9   
>> 0x000000000041aafd in PMPI_File_write (mpi_fh=0x6b8d00,
>> 0:      buf=0x2aaaaaaab010, count=26214, datatype=-1946157050,
>> 0:      status=0x7fffd112be30)
>> 0:      at /nfs/home/atchley/projects/mpich2/mpich2- 
>> snap-200706132016/src/mpi/romio/mpi-io/write.c:52 0:  #10  
>> 0x000000000040461e in noncontigmem_noncontigfile (
>> 0:      filename=0x668110 "pvfs2://mnt/pvfs2/atchley/blah",  
>> buf=0x2aaaaaaab010,
>> 0:      bufsize=1048560, dtype=-1946157050, offset=0, displs=0,  
>> finfo=-1677721600,
>> 0:      veclen=10, elmtcount=1, veccount=26214) at noncontig.c:185
>> 0:  #11 0x000000000040738d in main (argc=1, argv=0x7fffd112c608)
>> 0:      at noncontig.c:1020
>> 0:  (gdb) s
>> 0:  74          PVFS_offset  contig_offset = 0; /* temp for offset  
>> of a contig region */
>> 0:  (gdb)
>> 0:  78          if (!PINT_IS_MEMREQ(mode))
>> 0:  (gdb)
>> 0:  79          gossip_debug(GOSSIP_REQUEST_DEBUG,
>> 0:  (gdb)
>> 0:  81          gossip_debug 
>> (GOSSIP_REQUEST_DEBUG,"PINT_process_request\n");
>> 0:  (gdb)
>> 0:  83          if (!req)
>> 0:  (gdb)
>> 0:  88          if (!result || !result->segmax || !result->bytemax)
>> 0:  (gdb) p *result
>> 0:  $1 = {offset_array = 0x7fffd112b8a8, size_array =  
>> 0x7fffd112b8a0, segmax = 1,
>> 0:    segs = 0, bytemax = -1291, bytes = 0}
>> _______________________________________________
>> Pvfs2-developers mailing list
>> Pvfs2-developers at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>> <walt.vcf>



More information about the Pvfs2-developers mailing list