[PVFS-users] random ll_pvfs_file_write ...downcall

Kent Milfeld milfeld at tacc.utexas.edu
Fri Feb 13 15:39:03 EST 2004


Hi Rob,
   1.) We (John Casu) rebuilt PVFS 1.6.2 with the patches you send out
       on 1/29, and reinstalled the iod's, pvfd's and mgr (with defaults).  
       (We run two iod's on each of 15 servers; code is running on 16
       processors and mpi-io writing 32MB from each processor. Errors
       are intermittent.)

   2.) The nodes are PowerEdge 1750/2650 nodes from Dell:
       3.06 GHz dual-processor Xeon nodes with ServerWorks Boards
        "Grand Champ" Chipset.  OS:
       Linux: 2.4.18-27.7.xsmp #1 SMP

   3.) We are still getting the following errors:

Feb 13 13:55:47 compute-1-0 kernel: (ll_pvfs.c, 665): ll_pvfs_file_write got
error in downcall
Feb 13 13:55:48 compute-1-0 kernel: (ll_pvfs.c, 659): ll_pvfs_file_write
failed on 2600347

And 

Feb 13 14:42:10 compute-1-1 kernel: (ll_pvfs.c, 464): ll_pvfs_getmeta failed
on downcall for 146.6.250.1:3000/pvfs-meta/test26

    4.) Is it possible for this to be a time-out problem that can be fixed
        by increasing a time parameter in the code.


Kent
Texas Advanced Computing Center
www.tacc.utexas.edu/general/staff
Please use consulting form at:  www.tacc.utexas.edu/consulting
 
>-----Original Message-----
>From: Rob Ross [mailto:rross at mcs.anl.gov]
>Sent: Friday, February 13, 2004 1:19 AM
>To: Kent F. Milfeld
>Cc: pvfs-users at beowulf-underground.org
>Subject: Re: [PVFS-users] random ll_pvfs_file_write ...downcall
>
>Hi,
>
>What kind of processors do you have?  Have you applied the recent patch
>posted to pvfs-users?
>
>Thanks,
>
>Rob
>
>On Fri, 13 Feb 2004, Kent F. Milfeld wrote:
>
>> Hi,
>>
>>
>>
>>   We just recently installed 1.6.2 (after successfully running 1.5.x).
>>
>>   When I run 16-processor mpi-io jobs, the IO will sometimes fail
>>
>>   with the following error information in the code (usually /mnt/pvfs
>>
>>   will become unmounted):
>>
>>
>>
>> ...
>>
>>  rank=           9  CLOSE IOERR=           0
>>
>>  rank=          10  WRITE IOERR=           0    host=compute-9-23
>>
>>
>>
>>  rank=          10  CLOSE IOERR=           0
>>
>>  rank=           8  WRITE IOERR=           0    host=compute-9-30
>>
>>
>>
>>  rank=           8  CLOSE IOERR=           0
>>
>>  rank=           6  WRITE IOERR=        8288    host=compute-1-7
>>
>> ...
>>
>>
>>
>>   In the /var/log/kern on compute-1-7 I found the following information:
>>
>>
>>
>>
>>
>>
>>
>> ************************************************************************
>> ********
>>
>>
>>
>> Two runs,  one about 00:09:30 and the other about ~00:17.
>>
>> compute-1-11
>>
>> Feb 13 00:09:30 compute-1-11 kernel: (ll_pvfs.c, 665):
>> ll_pvfs_file_write got error in downcall
>>
>> Feb 13 00:14:42 compute-1-11 kernel: (ll_pvfs.c, 459): ll_pvfs_getmeta
>> failed on enqueue for 146.6.250.1:3000/pvfs-meta
>>
>> compute-2-28
>>
>> compute-1-0
>>
>> compute-4-31
>>
>> compute-2-4
>>
>> compute-1-9
>>
>> compute-1-12
>>
>> Feb 13 00:18:56 compute-1-12 kernel: (pvfsdev.c, 1118): pvfsdev:
>> setup_buffer() failure.
>>
>> Feb 13 00:18:56 compute-1-12 kernel: (ll_pvfs.c, 659):
>> ll_pvfs_file_write failed on 2600340
>>
>> compute-2-31
>>
>>
>>
>> Some results from two days earlier:
>>
>> Feb 10 15:27:25 compute-1-7 kernel: pvfs: debug = 0x0, maxsz = 16777216
>> bytes, buffer = dynamic, major = 0
>>
>> Feb 11 16:04:14 compute-1-7 kernel: (ll_pvfs.c, 233): ll_pvfs_create
>> failed on enqueue for 146.6.250.1:3000/pvfs-meta/test18
>>
>> Feb 11 16:04:14 compute-1-7 kernel: (ll_pvfs.c, 87): ll_pvfs_lookup
>> failed on enqueue for 146.6.250.1:3000/pvfs-meta/test18
>>
>> Feb 12 17:56:32 compute-1-7 kernel: (ll_pvfs.c, 665): ll_pvfs_file_write
>> got error in downcall
>>
>>
>>
>>
>>
>>
>>
>> *********************************************************
>>
>>
>>
>> [root at compute-1-30 root]# rpm -qa | grep pvfs
>>
>> pvfs-1.6.2-1
>>
>> contrib-pvfs-config-1.6.2-1
>>
>> pvfs-kernel-1.6.2-1
>>
>>
>>
>>
>>
>>
>>
>> Any idea of what might be happening?
>>
>>
>>
>> Thanks,
>>
>> Kent Milfeld
>>
>> TACC, Texas Advanced Computing Center
>>
>>
>>
>> Kent Milfeld  Ph.D.  Research Associate
>> Texas Advanced Computing Center
>> The University of Texas at Austin
>> http://www.tacc.utexas.edu/
>>
>> (512) 475-9411 (main)
>> (512) 475-9458 (direct)
>> (512) 475-9445 (fax)
>> milfeld at tacc.utexas.edu
>>
>>
>>
>>
>>
>>
>>
>>



More information about the PVFS-users mailing list