[PVFS-users] Scaling problems with PVFS 1.6.2

Kent F. Milfeld milfeld at tacc.utexas.edu
Tue May 4 23:08:21 EDT 2004


Hi,

  I was wondering if anybody has encountered a scalability 

  problem with PVFS like the one we are encountering.

 

  On a 282 node Pentium 4 xeon system with 12- iods (one iod on each of
12 xeon nodes), 

  running Linux Kernel, 2.4.20-30.7smp

  (Kernel has been updated from a RH 7.3 distribution in Rocks.)

  We are running 1.6 with the ~Jan. patches.

 

    [root at lonestar d.test110]# rpm -qa | grep pvfs

    pvfs-1.6.2-2

    contrib-pvfs-config-1.6.2-1

    pvfs-kernel-1.6.2-1

 

  

  Up to 128 clients, all seems to go well for a simple write

  of 64MB from each client using MPI-I/O.  However, when 

  writing from 256 clients, some processes have a write error of 

  8288 and the processes generally hang.  The processes that show

  the offending error remain in a "do_sel" state. I can split

  the node list in half, and execute the code on each half separately

  (but not simultaneously) with 64MB writes, and each run works

  fine.  I can then launch two mpiruns, each with a different half

  of the original host list, and I get the same write errors,

  (and even a new error*). Every time I run a 256 processor job

  (or simultaneously execute two 128-processor runs) the error

  occurs on different nodes.

 

  So, from the above experiment I can assume that nothing is wrong

  with the nodes, since single 128-processor runs that exercise a

  set of 256 nodes work fine.

 

  The errors is (usually occurs on about 6-8 proccessor):

 

 

     pvfs_write: build_rw_jobs failed

    rank=         115  WRITE IOERR=        8288    host=compute-10-1

 

   pvfs_write: build_rw_jobs failed

    rank=          85  WRITE IOERR=        8288    host=compute-9-20

 

   pvfs_write: build_rw_jobs failed

    rank=         172  WRITE IOERR=        8288    host=compute-7-21

 

 

The new error* from the two simultaneous 128-processor executions is:

 rank=           6  WRITE IOERR=        8288    host=compute-5-12


      

   nbrecv: recv: Connection reset by peer

   unable to receive ack from 0x6bfb0692:22555 on 11 within timeout

   pvfs_write: build_rw_jobs failed

    rank=          54  WRITE IOERR=        8288    host=compute-5-16


 

 

A ps shows that the processes that have an io error are waiting on 

the do_sel event:

 

  F S UID        PID  PPID  C PRI  NI ADDR    SZ WCHAN  STIME TTY
TIME CMD

000 S milfeld  19148 19147  0  69   0    - 25208 do_sel 15:33 ?
00:00:03 ./a.out64

000 R milfeld  19175 19174 91  79   0    - 25219 -      15:33 ?
00:27:15 ./a.out64

 

 

 

I don't find any errors in the /var/log/kern and /var/log/messages files

on the nodes that exhibit the IO error.

 

We are running over a GigE switch network.

 

------------------------------------------------------------------------
-

 

Questions:

 

  o Has anybody seen this behavior before?

  o Where in PVFS is the execution failing?

  o What would be the next step in the analysis, 

    if there is no apparent cause for the problem.

 

  o Should we dive into version 2? (Does is "scale" better?)

 

Thanks for your help.

 

Kent Milfeld

  

  

 

 

 

 

 

 

 

------------------------------------------------------------------------

Fortran Source for simple write.

 

 

module comm_mpi

   include 'mpif.h'

   integer                            :: ntasks, irank, ierr_mpi 

   integer                            :: namelen

   character(32)                      :: name

end module

 

module comm_mpi_io

    integer, parameter                 :: KI8_comm=selected_int_kind(12)

    integer, parameter                 :: NTIMES=1

    integer, parameter                 :: NTEST  =          64

    integer, parameter                 :: NBYTES4=1024*1024*64

    integer(kind=KI8_comm), parameter  :: NBYTES8=1024*1024*64

    real*8, dimension(NBYTES4/8)       :: buf

    real*8                             :: bcalc

 

    character(8 )                      :: csuffix

    character(31)                      :: filename

end module

 

module comm

    integer, parameter                 :: KI8=selected_int_kind(12)

end module

 

 

program block_sequence

   

   use comm

   use comm_mpi

   use comm_mpi_io

 

   implicit none

 

   integer     :: nargs, iargc, length_fn

   integer     :: i, knt, j

 

   integer                          :: fh, status(MPI_STATUS_SIZE)

   integer(kind=KI8)                :: offset

 

   integer :: istart,iend,itar0,itar1,irate

   real*8  :: tdiff,fmbps

 

 

                        !MPI Setup

 

   call mpi_init(ierr_mpi)

   call mpi_comm_size(MPI_COMM_WORLD, ntasks, ierr_mpi)

   call mpi_comm_rank(MPI_COMM_WORLD,  irank, ierr_mpi)

   call MPI_Get_processor_name(name, namelen, ierr_mpi);

 

   call file_name(csuffix,NTEST,ntasks)

   filename = "test_"//csuffix

   filename = "/mnt/pvfs/milfeld/test_"//csuffix

 

   offset = irank*NBYTES8

 

   call system_clock(count_rate=irate)

   call system_clock(count=itar0)

   call system_clock(count=itar1)

 

     do j = 1,NBYTES4/8

        buf(j) = j + irank*(NBYTES4/8)

     end do

 

                        !  call system("hostname; ls -l /mnt/pvfs")

   do i=1, NTIMES

   

                        ! MPI_MODE_CREATE+MPI_MODE_RDWR = 8 + 4

   

         call MPI_FILE_OPEN(MPI_COMM_WORLD, filename,
&

     &                      MPI_MODE_CREATE+MPI_MODE_RDWR,
MPI_INFO_NULL, &

     &                      fh, ierr_mpi)

   print*,"rank=",irank," OPEN IOERR=",ierr_mpi," filename=",filename

   call MPI_BARRIER(MPI_COMM_WORLD, ierr_mpi)

   call system_clock(count=istart)

 

         call MPI_FILE_SEEK( fh, offset,  MPI_SEEK_SET,
ierr_mpi)

         call MPI_FILE_WRITE(fh, buf, NBYTES4, MPI_BYTE, status,
ierr_mpi)

         print*,"rank=",irank," WRITE IOERR=",ierr_mpi,"   host=",name

   call MPI_BARRIER(MPI_COMM_WORLD, ierr_mpi)

         call MPI_FILE_CLOSE(fh, ierr_mpi)

         print*,"rank=",irank," CLOSE IOERR=",ierr_mpi

   

   call MPI_BARRIER(MPI_COMM_WORLD, ierr_mpi)

   call system_clock(count=iend)

 

 

     do j = 1,NBYTES4/8

        buf(j) = 0.0D0

     end do

   

     

    

!        call MPI_FILE_OPEN(MPI_COMM_WORLD, filename,
&

!    &                                     MPI_MODE_RDWR, MPI_INFO_NULL,
&

!    &                     fh, ierr_mpi)

!  

!  call MPI_BARRIER(MPI_COMM_WORLD, ierr_mpi)

!  call system_clock(count=istart)

!

!     call MPI_FILE_SEEK(fh, offset,  MPI_SEEK_SET,         ierr_mpi)

!     call MPI_FILE_READ(fh, buf, NBYTES4, MPI_BYTE, status, ierr_mpi)

!

!!    do j = 1,NBYTES4/8

!!       bcalc = j + irank*(NBYTES4/8) 

!!       

!!       if( buf(j) .ne. bcalc ) then

!!          print*, "task=",irank,"  diff rear/calc =",buf(j),bcalc

!!          stop

!!       end if

!!    end do

!  

!     call MPI_FILE_CLOSE(fh, ierr_mpi)

!  call MPI_BARRIER(MPI_COMM_WORLD, ierr_mpi)

!  call system_clock(count=iend)

   

   end do

   

   if(irank .eq. 0) then

      tdiff = real(iend-istart - (itar1-itar0))/real(irate)

      fmbps = NTIMES*(NBYTES4/(1024*1024))*ntasks/tdiff

      print*,' time = ',tdiff, '  MB/sec =',fmbps,'  for ',ntasks,'
tasks.'

      write(*,'(" tasks sec MB/s: ",i4,f9.3,f8.3)') ntasks, tdiff,fmbps

   endif

 

 

 

   call MPI_Finalize(ierr_mpi) 

 

   

end program

 

 

subroutine file_name(csuffix,itest,irank)

 

   integer           :: itest,irank

   character(LEN= 8) :: csuffix

 

   character(LEN=19) format

   character(LEN= 9) crank

 

   crank="        "

   if(irank .ge.   0 .and. irank .lt.   10) crank = '"_00",i1)'

   if(irank .ge.  10 .and. irank .lt.  100) crank = ' "_0",i2)'

   if(irank .ge. 100 .and. irank .lt. 1000) crank = '  "_",i3)'

 

   format="                 "

   if(itest .ge.    0 .and. itest .lt.    10) format =
'("000",i1,'//crank

   if(itest .ge.   10 .and. itest .lt.   100) format = '(
"00",i2,'//crank

   if(itest .ge.  100 .and. itest .lt.  1000) format = '(
"0",i3,'//crank

   if(itest .ge. 1000 .and. itest .lt. 10000) format = '(
i4,'//crank

 

 

   !print*,"FORMAT:",format

   write(csuffix,format) itest,irank

 

end subroutine

 

Kent Milfeld  Ph.D.  Research Associate
Texas Advanced Computing Center
The University of Texas at Austin
http://www.tacc.utexas.edu/  

(512) 475-9411 (main)
(512) 475-9458 (direct)
(512) 475-9445 (fax)
milfeld at tacc.utexas.edu 

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.beowulf-underground.org/pipermail/pvfs-users/attachments/20040504/f2253119/attachment-0001.htm


More information about the PVFS-users mailing list