[PVFS-users] random ll_pvfs_file_write ...downcall
Kent Milfeld
milfeld at tacc.utexas.edu
Fri Feb 13 15:39:03 EST 2004
Hi Rob,
1.) We (John Casu) rebuilt PVFS 1.6.2 with the patches you send out
on 1/29, and reinstalled the iod's, pvfd's and mgr (with defaults).
(We run two iod's on each of 15 servers; code is running on 16
processors and mpi-io writing 32MB from each processor. Errors
are intermittent.)
2.) The nodes are PowerEdge 1750/2650 nodes from Dell:
3.06 GHz dual-processor Xeon nodes with ServerWorks Boards
"Grand Champ" Chipset. OS:
Linux: 2.4.18-27.7.xsmp #1 SMP
3.) We are still getting the following errors:
Feb 13 13:55:47 compute-1-0 kernel: (ll_pvfs.c, 665): ll_pvfs_file_write got
error in downcall
Feb 13 13:55:48 compute-1-0 kernel: (ll_pvfs.c, 659): ll_pvfs_file_write
failed on 2600347
And
Feb 13 14:42:10 compute-1-1 kernel: (ll_pvfs.c, 464): ll_pvfs_getmeta failed
on downcall for 146.6.250.1:3000/pvfs-meta/test26
4.) Is it possible for this to be a time-out problem that can be fixed
by increasing a time parameter in the code.
Kent
Texas Advanced Computing Center
www.tacc.utexas.edu/general/staff
Please use consulting form at: www.tacc.utexas.edu/consulting
>-----Original Message-----
>From: Rob Ross [mailto:rross at mcs.anl.gov]
>Sent: Friday, February 13, 2004 1:19 AM
>To: Kent F. Milfeld
>Cc: pvfs-users at beowulf-underground.org
>Subject: Re: [PVFS-users] random ll_pvfs_file_write ...downcall
>
>Hi,
>
>What kind of processors do you have? Have you applied the recent patch
>posted to pvfs-users?
>
>Thanks,
>
>Rob
>
>On Fri, 13 Feb 2004, Kent F. Milfeld wrote:
>
>> Hi,
>>
>>
>>
>> We just recently installed 1.6.2 (after successfully running 1.5.x).
>>
>> When I run 16-processor mpi-io jobs, the IO will sometimes fail
>>
>> with the following error information in the code (usually /mnt/pvfs
>>
>> will become unmounted):
>>
>>
>>
>> ...
>>
>> rank= 9 CLOSE IOERR= 0
>>
>> rank= 10 WRITE IOERR= 0 host=compute-9-23
>>
>>
>>
>> rank= 10 CLOSE IOERR= 0
>>
>> rank= 8 WRITE IOERR= 0 host=compute-9-30
>>
>>
>>
>> rank= 8 CLOSE IOERR= 0
>>
>> rank= 6 WRITE IOERR= 8288 host=compute-1-7
>>
>> ...
>>
>>
>>
>> In the /var/log/kern on compute-1-7 I found the following information:
>>
>>
>>
>>
>>
>>
>>
>> ************************************************************************
>> ********
>>
>>
>>
>> Two runs, one about 00:09:30 and the other about ~00:17.
>>
>> compute-1-11
>>
>> Feb 13 00:09:30 compute-1-11 kernel: (ll_pvfs.c, 665):
>> ll_pvfs_file_write got error in downcall
>>
>> Feb 13 00:14:42 compute-1-11 kernel: (ll_pvfs.c, 459): ll_pvfs_getmeta
>> failed on enqueue for 146.6.250.1:3000/pvfs-meta
>>
>> compute-2-28
>>
>> compute-1-0
>>
>> compute-4-31
>>
>> compute-2-4
>>
>> compute-1-9
>>
>> compute-1-12
>>
>> Feb 13 00:18:56 compute-1-12 kernel: (pvfsdev.c, 1118): pvfsdev:
>> setup_buffer() failure.
>>
>> Feb 13 00:18:56 compute-1-12 kernel: (ll_pvfs.c, 659):
>> ll_pvfs_file_write failed on 2600340
>>
>> compute-2-31
>>
>>
>>
>> Some results from two days earlier:
>>
>> Feb 10 15:27:25 compute-1-7 kernel: pvfs: debug = 0x0, maxsz = 16777216
>> bytes, buffer = dynamic, major = 0
>>
>> Feb 11 16:04:14 compute-1-7 kernel: (ll_pvfs.c, 233): ll_pvfs_create
>> failed on enqueue for 146.6.250.1:3000/pvfs-meta/test18
>>
>> Feb 11 16:04:14 compute-1-7 kernel: (ll_pvfs.c, 87): ll_pvfs_lookup
>> failed on enqueue for 146.6.250.1:3000/pvfs-meta/test18
>>
>> Feb 12 17:56:32 compute-1-7 kernel: (ll_pvfs.c, 665): ll_pvfs_file_write
>> got error in downcall
>>
>>
>>
>>
>>
>>
>>
>> *********************************************************
>>
>>
>>
>> [root at compute-1-30 root]# rpm -qa | grep pvfs
>>
>> pvfs-1.6.2-1
>>
>> contrib-pvfs-config-1.6.2-1
>>
>> pvfs-kernel-1.6.2-1
>>
>>
>>
>>
>>
>>
>>
>> Any idea of what might be happening?
>>
>>
>>
>> Thanks,
>>
>> Kent Milfeld
>>
>> TACC, Texas Advanced Computing Center
>>
>>
>>
>> Kent Milfeld Ph.D. Research Associate
>> Texas Advanced Computing Center
>> The University of Texas at Austin
>> http://www.tacc.utexas.edu/
>>
>> (512) 475-9411 (main)
>> (512) 475-9458 (direct)
>> (512) 475-9445 (fax)
>> milfeld at tacc.utexas.edu
>>
>>
>>
>>
>>
>>
>>
>>
More information about the PVFS-users
mailing list