[PVFS-users] Scaling problems with PVFS 1.6.2
Kent F. Milfeld
milfeld at tacc.utexas.edu
Tue May 4 23:08:21 EDT 2004
Hi,
I was wondering if anybody has encountered a scalability
problem with PVFS like the one we are encountering.
On a 282 node Pentium 4 xeon system with 12- iods (one iod on each of
12 xeon nodes),
running Linux Kernel, 2.4.20-30.7smp
(Kernel has been updated from a RH 7.3 distribution in Rocks.)
We are running 1.6 with the ~Jan. patches.
[root at lonestar d.test110]# rpm -qa | grep pvfs
pvfs-1.6.2-2
contrib-pvfs-config-1.6.2-1
pvfs-kernel-1.6.2-1
Up to 128 clients, all seems to go well for a simple write
of 64MB from each client using MPI-I/O. However, when
writing from 256 clients, some processes have a write error of
8288 and the processes generally hang. The processes that show
the offending error remain in a "do_sel" state. I can split
the node list in half, and execute the code on each half separately
(but not simultaneously) with 64MB writes, and each run works
fine. I can then launch two mpiruns, each with a different half
of the original host list, and I get the same write errors,
(and even a new error*). Every time I run a 256 processor job
(or simultaneously execute two 128-processor runs) the error
occurs on different nodes.
So, from the above experiment I can assume that nothing is wrong
with the nodes, since single 128-processor runs that exercise a
set of 256 nodes work fine.
The errors is (usually occurs on about 6-8 proccessor):
pvfs_write: build_rw_jobs failed
rank= 115 WRITE IOERR= 8288 host=compute-10-1
pvfs_write: build_rw_jobs failed
rank= 85 WRITE IOERR= 8288 host=compute-9-20
pvfs_write: build_rw_jobs failed
rank= 172 WRITE IOERR= 8288 host=compute-7-21
The new error* from the two simultaneous 128-processor executions is:
rank= 6 WRITE IOERR= 8288 host=compute-5-12
nbrecv: recv: Connection reset by peer
unable to receive ack from 0x6bfb0692:22555 on 11 within timeout
pvfs_write: build_rw_jobs failed
rank= 54 WRITE IOERR= 8288 host=compute-5-16
A ps shows that the processes that have an io error are waiting on
the do_sel event:
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY
TIME CMD
000 S milfeld 19148 19147 0 69 0 - 25208 do_sel 15:33 ?
00:00:03 ./a.out64
000 R milfeld 19175 19174 91 79 0 - 25219 - 15:33 ?
00:27:15 ./a.out64
I don't find any errors in the /var/log/kern and /var/log/messages files
on the nodes that exhibit the IO error.
We are running over a GigE switch network.
------------------------------------------------------------------------
-
Questions:
o Has anybody seen this behavior before?
o Where in PVFS is the execution failing?
o What would be the next step in the analysis,
if there is no apparent cause for the problem.
o Should we dive into version 2? (Does is "scale" better?)
Thanks for your help.
Kent Milfeld
------------------------------------------------------------------------
Fortran Source for simple write.
module comm_mpi
include 'mpif.h'
integer :: ntasks, irank, ierr_mpi
integer :: namelen
character(32) :: name
end module
module comm_mpi_io
integer, parameter :: KI8_comm=selected_int_kind(12)
integer, parameter :: NTIMES=1
integer, parameter :: NTEST = 64
integer, parameter :: NBYTES4=1024*1024*64
integer(kind=KI8_comm), parameter :: NBYTES8=1024*1024*64
real*8, dimension(NBYTES4/8) :: buf
real*8 :: bcalc
character(8 ) :: csuffix
character(31) :: filename
end module
module comm
integer, parameter :: KI8=selected_int_kind(12)
end module
program block_sequence
use comm
use comm_mpi
use comm_mpi_io
implicit none
integer :: nargs, iargc, length_fn
integer :: i, knt, j
integer :: fh, status(MPI_STATUS_SIZE)
integer(kind=KI8) :: offset
integer :: istart,iend,itar0,itar1,irate
real*8 :: tdiff,fmbps
!MPI Setup
call mpi_init(ierr_mpi)
call mpi_comm_size(MPI_COMM_WORLD, ntasks, ierr_mpi)
call mpi_comm_rank(MPI_COMM_WORLD, irank, ierr_mpi)
call MPI_Get_processor_name(name, namelen, ierr_mpi);
call file_name(csuffix,NTEST,ntasks)
filename = "test_"//csuffix
filename = "/mnt/pvfs/milfeld/test_"//csuffix
offset = irank*NBYTES8
call system_clock(count_rate=irate)
call system_clock(count=itar0)
call system_clock(count=itar1)
do j = 1,NBYTES4/8
buf(j) = j + irank*(NBYTES4/8)
end do
! call system("hostname; ls -l /mnt/pvfs")
do i=1, NTIMES
! MPI_MODE_CREATE+MPI_MODE_RDWR = 8 + 4
call MPI_FILE_OPEN(MPI_COMM_WORLD, filename,
&
& MPI_MODE_CREATE+MPI_MODE_RDWR,
MPI_INFO_NULL, &
& fh, ierr_mpi)
print*,"rank=",irank," OPEN IOERR=",ierr_mpi," filename=",filename
call MPI_BARRIER(MPI_COMM_WORLD, ierr_mpi)
call system_clock(count=istart)
call MPI_FILE_SEEK( fh, offset, MPI_SEEK_SET,
ierr_mpi)
call MPI_FILE_WRITE(fh, buf, NBYTES4, MPI_BYTE, status,
ierr_mpi)
print*,"rank=",irank," WRITE IOERR=",ierr_mpi," host=",name
call MPI_BARRIER(MPI_COMM_WORLD, ierr_mpi)
call MPI_FILE_CLOSE(fh, ierr_mpi)
print*,"rank=",irank," CLOSE IOERR=",ierr_mpi
call MPI_BARRIER(MPI_COMM_WORLD, ierr_mpi)
call system_clock(count=iend)
do j = 1,NBYTES4/8
buf(j) = 0.0D0
end do
! call MPI_FILE_OPEN(MPI_COMM_WORLD, filename,
&
! & MPI_MODE_RDWR, MPI_INFO_NULL,
&
! & fh, ierr_mpi)
!
! call MPI_BARRIER(MPI_COMM_WORLD, ierr_mpi)
! call system_clock(count=istart)
!
! call MPI_FILE_SEEK(fh, offset, MPI_SEEK_SET, ierr_mpi)
! call MPI_FILE_READ(fh, buf, NBYTES4, MPI_BYTE, status, ierr_mpi)
!
!! do j = 1,NBYTES4/8
!! bcalc = j + irank*(NBYTES4/8)
!!
!! if( buf(j) .ne. bcalc ) then
!! print*, "task=",irank," diff rear/calc =",buf(j),bcalc
!! stop
!! end if
!! end do
!
! call MPI_FILE_CLOSE(fh, ierr_mpi)
! call MPI_BARRIER(MPI_COMM_WORLD, ierr_mpi)
! call system_clock(count=iend)
end do
if(irank .eq. 0) then
tdiff = real(iend-istart - (itar1-itar0))/real(irate)
fmbps = NTIMES*(NBYTES4/(1024*1024))*ntasks/tdiff
print*,' time = ',tdiff, ' MB/sec =',fmbps,' for ',ntasks,'
tasks.'
write(*,'(" tasks sec MB/s: ",i4,f9.3,f8.3)') ntasks, tdiff,fmbps
endif
call MPI_Finalize(ierr_mpi)
end program
subroutine file_name(csuffix,itest,irank)
integer :: itest,irank
character(LEN= 8) :: csuffix
character(LEN=19) format
character(LEN= 9) crank
crank=" "
if(irank .ge. 0 .and. irank .lt. 10) crank = '"_00",i1)'
if(irank .ge. 10 .and. irank .lt. 100) crank = ' "_0",i2)'
if(irank .ge. 100 .and. irank .lt. 1000) crank = ' "_",i3)'
format=" "
if(itest .ge. 0 .and. itest .lt. 10) format =
'("000",i1,'//crank
if(itest .ge. 10 .and. itest .lt. 100) format = '(
"00",i2,'//crank
if(itest .ge. 100 .and. itest .lt. 1000) format = '(
"0",i3,'//crank
if(itest .ge. 1000 .and. itest .lt. 10000) format = '(
i4,'//crank
!print*,"FORMAT:",format
write(csuffix,format) itest,irank
end subroutine
Kent Milfeld Ph.D. Research Associate
Texas Advanced Computing Center
The University of Texas at Austin
http://www.tacc.utexas.edu/
(512) 475-9411 (main)
(512) 475-9458 (direct)
(512) 475-9445 (fax)
milfeld at tacc.utexas.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.beowulf-underground.org/pipermail/pvfs-users/attachments/20040504/f2253119/attachment-0001.htm
More information about the PVFS-users
mailing list