[PVFS-developers]
Re: [PVFS-users] Recompile pvfs module for SuSE 2.4.19-NUMA
Claude Pignol
cpignol at seismiccity.com
Tue Mar 9 11:18:23 EST 2004
Rob,
I did an strace on the iod to see where we are wasting some times:
Two cases IO 128K and IO 512K
Read from a client node dd if=/pvfs/test bs=128k of=/dev/null count=16
and a second run with dd if=/pvfs/test bs=512k of=/dev/null count=4
During each run I record what one of the iod is doing and compare the strace
128K I/O
18:34:58.816754 mmap(NULL, 4194304, PROT_READ, MAP_SHARED, 7, 0) =
0x2a95da2000 <0.000020>
18:34:58.816794 madvise(0x2a95da2000, 4194304, MADV_SEQUENTIAL|0x1) = 0
<0.000260>
18:34:58.817072 select(9, [4 5], [8], NULL, {20, 0}) = 1 (out [8], left
{20, 0}) <0.000006>
18:34:58.817145 fcntl(8, F_GETFL) = 0x802 (flags
O_RDWR|O_NONBLOCK|O_LARGEFILE) <0.000003>
18:34:58.817165 sendto(8, "\0\0\0\0\217\4\6\0\374\362\v\0\351.\f\0
\351\r\0\260\3"..., 65536, 0, NULL, 0) = 65536 <0.000116>
18:34:58.817312 select(9, [4 5 8], [], NULL, {20, 0}) = 1 (in [8], left
{19, 990000}) <0.009769>
512K I/O
18:35:35.328624 mmap(NULL, 4194304, PROT_READ, MAP_SHARED, 7, 0) =
0x2a95da2000 <0.000019>
18:35:35.328662 madvise(0x2a95da2000, 4194304, MADV_SEQUENTIAL|0x1) = 0
<0.000261>
18:35:35.328940 select(9, [4 5], [8], NULL, {20, 0}) = 1 (out [8], left
{20, 0}) <0.000006>
18:35:35.329014 fcntl(8, F_GETFL) = 0x802 (flags
O_RDWR|O_NONBLOCK|O_LARGEFILE) <0.000003>
18:35:35.329034 sendto(8, "\0\0\0\0\217\4\6\0\374\362\v\0\351.\f\0
\351\r\0\260\3"..., 65536, 0, NULL, 0) = 65536 <0.000112>
18:35:35.329176 select(9, [4 5 8], [], NULL, {20, 0}) = 1 (in [8], left
{19, 740000}) <0.258920>
The timings are similar excepted the select that follow the sendto of 64KB
0.01s for the 128K I/O
0.25s for the 512K I/O
It the same kind of timing for all the sendto of 64KB
Claude
Rob Ross wrote:
>Hey,
>
>What's your strip size default?
>
>So adjusting those parameters did have a positive effect for many cases,
>but the 256KB read case is still bad?
>
>Is it consistently bad for ever-larger sizes, or is that particular size a
>bad one?
>
>Thanks,
>
>Rob
>
>On Mon, 8 Mar 2004, Claude Pignol wrote:
>
>
>
>>Rob,
>>
>>
>>I/O 64KB no problem
>>I/O 128KB no problem
>>I/O 256KB write no problem and read 10 times slower.
>>The tuning of the parameters helps to get a better performance when it
>>works normally,
>>but with the I/O of 256K pvfs doesn't behave normally.
>>The current parameters are
>>r(w)mem_max 1048575
>>write_buf 4096
>>access_size 4096
>>socket_buf 1024
>>No error message in the pvfs log
>>
>>Disks: raid disk that can deliver 30MB/s
>>Dedicated to pvfs data
>>
>>Regards
>>Claude
>>
>>
>>
>>
>>
>>Rob Ross wrote:
>>
>>
>>
>>>On Mon, 8 Mar 2004, Claude Pignol wrote:
>>>
>>>
>>>
>>>
>>>
>>>>Rob Ross wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>Oh, I misunderstood what you were saying before. I thought that the "few
>>>>>MB" was your file size, not your access size.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>The problem is the I/O size not the file size.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>How many I/O servers do you have in the system? How much memory do you
>>>>>have in your client?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>10 I/O servers 1GB (dedicated ffor iod)
>>>>
>>>>
>>>>
>>>>
>>>Clients have this much RAM too?
>>>
>>>
>>>
>>>
>>>
>>>>>These four /proc values are the default and maximum socket buffer sizes,
>>>>>if I understand things correctly:
>>>>>/proc/sys/net/core/rmem_default
>>>>>/proc/sys/net/core/rmem_max
>>>>>/proc/sys/net/core/wmem_default
>>>>>/proc/sys/net/core/wmem_max
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>r(w)mem_default is 65535
>>>>r(w)mem_max is 131071
>>>>
>>>>
>>>>
>>>>
>>>I would adjust these up significantly. I've seen suggestions of as much
>>>as 8MB for wide area; maybe try 1MB and see how that goes? We're much
>>>nicer about socket usage now, so it shouldn't be too much of a resource
>>>hog.
>>>
>>>I don't think the client adjusts these, so it's going to use the default.
>>>The iod *does* adjust these -- see below.
>>>
>>>
>>>
>>>
>>>
>>>>>Also, you might want to adjust the following in your iod.conf file (see
>>>>>man pages for details): socket_buf, access_size.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>write_buf 512
>>>>access_size 512
>>>>socket_buf 64
>>>>
>>>>
>>>>
>>>>
>>>I would adjust access_size up to some multiple of the new wmem_max so that
>>>there is a large enough memory mapped region to fill the buffer with one
>>>send. Likewise for write_buf.
>>>
>>>I would adjust socket_buf to be the same as r(w)mem_max, because that is
>>>what the iod will use.
>>>
>>>
>>>
>>>
>>>
>>>>>About where does the dropoff start to occur?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>I/O size of 256KB
>>>>
>>>>The read rate is around 4MB/s for I/O of 1024K
>>>>
>>>>Thanks
>>>>Claude
>>>>
>>>>
>>>>
>>>>
>>>Let me know if this helps. Also, as a kick-start for the next stage, what
>>>sort of storage do you have on those nodes (single disks, SW RAID, FC
>>>attached, ...)?
>>>
>>>Thanks,
>>>
>>>Rob
>>>
>>>
>>>
>>>
>>>
>>>>>Regards,
>>>>>
>>>>>Rob
>>>>>
>>>>>On Mon, 8 Mar 2004, Claude Pignol wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>Thanks Rob,
>>>>>>
>>>>>>Another fact:
>>>>>>I found that the read works very well with 64K I/O: the read speed is
>>>>>>better than the write speed.
>>>>>>The read perf start degrading when I increase the I/O size
>>>>>>
>>>>>>I agree that there is a starting cost but there is the read ahead mechanism
>>>>>>that speed up the disk access.
>>>>>>I am testing with file of min 1GB
>>>>>>
>>>>>>I have tested with dynamic buffering (the default) and the static buffering.
>>>>>>Same problem.
>>>>>>How do you increase tcp buffer size?
>>>>>>net.ipv4.tcp_rmem
>>>>>>net.ipv4.tcp_wmem
>>>>>>net.ipv4.tcp_mem
>>>>>>
>>>>>>
>>>>>>Claude
>>>>>>
>>>>>>
>>>>>>Rob Ross wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>Hi Claude,
>>>>>>>
>>>>>>>Sorry we didn't get back to you sooner. I'm glad that the kernel update
>>>>>>>fixed the problem.
>>>>>>>
>>>>>>>What block size (bs=XXX) are you using in your tests?
>>>>>>>
>>>>>>>Note that when reading no I/O can start until data is read off disk, while
>>>>>>>in the write case data can start moving right away. So you may just be
>>>>>>>seeing startup costs.
>>>>>>>
>>>>>>>You could look at increasing TCP buffer sizes on your system as a first
>>>>>>>step.
>>>>>>>
>>>>>>>Regards,
>>>>>>>
>>>>>>>Rob
>>>>>>>
>>>>>>>On Mon, 8 Mar 2004, Claude Pignol wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>Greetings,
>>>>>>>>
>>>>>>>>An upgrade to 2.4.21 fixes the problem.
>>>>>>>>Compile and start OK.
>>>>>>>>I have noticed a performance problem in reading from PVFS.
>>>>>>>>With big I/O (few MB) reading is around 1/3 of the performance of writing.
>>>>>>>>Pvfs deamons with default parameters
>>>>>>>>Reading/Writing from on node to pvfs using dd.
>>>>>>>>I have verified the disk performance of all the 10 I/O nodes
>>>>>>>>I have also verified the network perf to all the nodes.
>>>>>>>>What is the best strategy/tools to address this kind of problem?
>>>>>>>>Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>>Claude Pignol wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>Greetings,
>>>>>>>>>
>>>>>>>>>I try to do a benchmark of pvfs with the SuSE 2.4.19-NUMA kernel
>>>>>>>>>to compare with the SuSE 2.4.19-SMP kernel.
>>>>>>>>>No problem to compile and load the pvfs.o module with the SMP kernel
>>>>>>>>>
>>>>>>>>>With the NUMA kernel I get 3 undefined symbols when I try to load the
>>>>>>>>>module
>>>>>>>>>pvfs.o: unresolved symbol __pollwait
>>>>>>>>>pvfs.o: unresolved symbol mem_map
>>>>>>>>>pvfs.o: unresolved symbol iget4
>>>>>>>>>
>>>>>>>>>The kernel source is installed.
>>>>>>>>>Any idea?
>>>>>>>>>Thanks in advance
>>>>>>>>>Claude
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>_______________________________________________
>>>>>>>>>PVFS-users mailing list
>>>>>>>>>PVFS-users at www.beowulf-underground.org
>>>>>>>>>http://www.beowulf-underground.org/mailman/listinfo/pvfs-users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>_______________________________________________
>>>>>>>>PVFS-developers mailing list
>>>>>>>>PVFS-developers at www.beowulf-underground.org
>>>>>>>>http://www.beowulf-underground.org/mailman/listinfo/pvfs-developers
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>--
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>--
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>_______________________________________________
>>>PVFS-developers mailing list
>>>PVFS-developers at www.beowulf-underground.org
>>>http://www.beowulf-underground.org/mailman/listinfo/pvfs-developers
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>>
>>
>
>_______________________________________________
>PVFS-developers mailing list
>PVFS-developers at www.beowulf-underground.org
>http://www.beowulf-underground.org/mailman/listinfo/pvfs-developers
>
>
>
--
More information about the PVFS-developers
mailing list