[PVFS-users] setup_buffer() failure 8288 error on write pvfsdev: cannot allocate memory
Kent Milfeld
milfeld@tacc.utexas.edu
Fri, 30 Jan 2004 16:49:04 -0600
Hi,
On our first installation of pvfs we used "buffer=mapped",
a simple pvfs test produced the memory problems below.
After receiving advise not to used mapped (and after seeing
some earlier problems at the archive site, we backed down on using
the mapping.
So, I unmounted the file system, stopped the client daemons, and
removed the pvfs module. Using the command
insmod pvfs
I obtained the defaults, dynamic and 16MB buffer, on the
282 nodes. After restarting the daemons, and remounting /mnt/pvfs
and running the test again (the same (MPI-IO) code to check
each node, singly) I discovered that there were
37 nodes out of 282 that I would still consistently get
the memory (buffer) error (from the /var/log/kern file). (These
were probably nodes were somehow problematic happened when PVFS was
running earlier in the "mapped" mode.)
BY REBOOTING THOSE PROBLEM NODES, THE PROBLEM WENT AWAY.
So I'm very happy about that!
But it bothers me that I don't know what happened
to the problem nodes. I did discover that if I wrote records that
were below 1MB, they worked just fine, but if the write calls were
sending 1MB or over, I would got the error. (See below.)
I did not reboot two of the problem nodes so that I can possibly run
some more tests to find out why. I've looked at the slab information
in /proc and found that on both the good nodes, and the bad ones, the
pvfs space is about 18MB, and nothing else is using space
excessively. IPCS shows two 32MB sections being used.
This appears on both good and bad nodes:
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x00000000 0 root 600 1056768 9 dest
0x00000000 32769 root 600 33554432 9 dest
0x00000000 65538 root 600 33554432 9 dest
0x00000000 98307 root 600 46084 9 dest
So I don't thing anything seems wrong with these segments. (Although I
would like to find out later what is consuming 64MB of the shared
memory space (we have 2GB on each node)-- might be mpich-gm.
My question is:
Are there any utilities that I can use, or any log files I can check,
or any experiments I can run to determine what has happened to the
"problem node" memory to fail when a write needs memory buffer space?
(I know that the mapped option probably caused the problem, but why
did it only occur on 37 nodes, not the rest? Any why cannot I find
anything to indicate the problem? )
Thanks for your help, (Please don't waste time on this, because PVFS is
working fine with a reboot; if you have a quick answer or advise I would
appreciate your information.)
Kent
Jan 13 10:33:24 compute-10-17 kernel: pvfs: debug = 0x0, maxsz = 16777216
bytes, buffer = dynamic, major = 0
...
Jan 27 20:14:18 compute-10-17 kernel: (pvfsdev.c, 1356): pvfsdev: could not
allocate memory!
Jan 27 20:14:18 compute-10-17 kernel: (pvfsdev.c, 1032): pvfsdev:
setup_buffer() failure.
Jan 27 20:14:18 compute-10-17 kernel: (ll_pvfs.c, 553): ll_pvfs_file_write
failed on 2600628
Kent
Texas Advanced Computing Center
www.tacc.utexas.edu/general/staff
Please use consulting form at: www.tacc.utexas.edu/consulting
>-----Original Message-----
>From: pvfs-users-admin@beowulf-underground.org [mailto:pvfs-users-
>admin@beowulf-underground.org] On Behalf Of Nathan Poznick
>Sent: Tuesday, January 27, 2004 12:09 AM
>To: pvfs-users@beowulf-underground.org
>Subject: Re: [PVFS-users] setup_buffer() failure 8288 error on write
>pvfsdev: cannot allocate memory
>
>Thus spake milfeld@tacc.utexas.edu:
>>
>> The Question:
>> Is there a fix for this problem?
>> If so, can it be used for this version (possibly by not using a
>> "mapped" buffer?).
>
>I would certainly try using buffer=static, since the kernel you are
>using has CONFIG_HIGHMEM4G=y, and the documentation states that mapped
>is only available for machines without CONFIG_HIGHMEM[4|64]G enabled.
>If buffer=static doesn't work, there's always buffer=dynamic to fall
>back on as well.
>
>> Do we need to upgrade to fix the problem?
>
>Perhaps not to fix this specific problem, but you should upgrade if
>possible to gain the many other fixes which have gone into the
>filesystem since 1.5.6 (of which there have been more than a few).
>
>
>--
>Nathan Poznick <poznick@conwaycorp.net>
>
>Modern Americans are so exposed, peered at, inquired about, and spied
>upon as to be increasingly without privacy-members of a naked society
>and denizens of a goldfish bowl. - Edward V. Long