[PVFS-developers]
Re: [PVFS-users] Recompile pvfs module for SuSE 2.4.19-NUMA
Rob Ross
rross at mcs.anl.gov
Sun Apr 4 21:41:44 EDT 2004
Hi Claude,
Sorry for the delay.
Ok. You've still got a 64K strip size set as default, and that is going
to keep the file system from doing as well as you might like, so I would
up that for one thing.
What is happening in your system with the current configuration is that
when you move from 64K to 128K to 256K reads, you're increasing the # of
servers that you're getting data send from to that single node (because of
the small strip size). My first guess at what is happening is that the
client is being slow to process all the incoming packets and ACK them,
whereas in the write case the client would be generating a very nice
stream of outgoing packets, with many less interrupts and such. This
would explain why the select() is taking longer to come back in the bad
case.
You could use tcpdump to look at the packet streams in order to see
windows closing and the like; that's what I think is going on. You should
probably do that. If that *is* the problem, there's nothing at the PVFS
layer that we can do. You might be able to adjust the parameters for your
network device driver to get better performance out of it, etc.
Or it might be something else entirely?
Rob
On Tue, 9 Mar 2004, Claude Pignol wrote:
> Rob,
>
> I did an strace on the iod to see where we are wasting some times:
> Two cases IO 128K and IO 512K
> Read from a client node dd if=/pvfs/test bs=128k of=/dev/null count=16
> and a second run with dd if=/pvfs/test bs=512k of=/dev/null count=4
> During each run I record what one of the iod is doing and compare the strace
>
> 128K I/O
> 18:34:58.816754 mmap(NULL, 4194304, PROT_READ, MAP_SHARED, 7, 0) =
> 0x2a95da2000 <0.000020>
> 18:34:58.816794 madvise(0x2a95da2000, 4194304, MADV_SEQUENTIAL|0x1) = 0
> <0.000260>
> 18:34:58.817072 select(9, [4 5], [8], NULL, {20, 0}) = 1 (out [8], left
> {20, 0}) <0.000006>
> 18:34:58.817145 fcntl(8, F_GETFL) = 0x802 (flags
> O_RDWR|O_NONBLOCK|O_LARGEFILE) <0.000003>
> 18:34:58.817165 sendto(8, "\0\0\0\0\217\4\6\0\374\362\v\0\351.\f\0
> \351\r\0\260\3"..., 65536, 0, NULL, 0) = 65536 <0.000116>
> 18:34:58.817312 select(9, [4 5 8], [], NULL, {20, 0}) = 1 (in [8], left
> {19, 990000}) <0.009769>
>
> 512K I/O
> 18:35:35.328624 mmap(NULL, 4194304, PROT_READ, MAP_SHARED, 7, 0) =
> 0x2a95da2000 <0.000019>
> 18:35:35.328662 madvise(0x2a95da2000, 4194304, MADV_SEQUENTIAL|0x1) = 0
> <0.000261>
> 18:35:35.328940 select(9, [4 5], [8], NULL, {20, 0}) = 1 (out [8], left
> {20, 0}) <0.000006>
> 18:35:35.329014 fcntl(8, F_GETFL) = 0x802 (flags
> O_RDWR|O_NONBLOCK|O_LARGEFILE) <0.000003>
> 18:35:35.329034 sendto(8, "\0\0\0\0\217\4\6\0\374\362\v\0\351.\f\0
> \351\r\0\260\3"..., 65536, 0, NULL, 0) = 65536 <0.000112>
> 18:35:35.329176 select(9, [4 5 8], [], NULL, {20, 0}) = 1 (in [8], left
> {19, 740000}) <0.258920>
>
> The timings are similar excepted the select that follow the sendto of 64KB
> 0.01s for the 128K I/O
> 0.25s for the 512K I/O
>
> It the same kind of timing for all the sendto of 64KB
>
> Claude
> Rob Ross wrote:
>
> >Hey,
> >
> >What's your strip size default?
> >
> >So adjusting those parameters did have a positive effect for many cases,
> >but the 256KB read case is still bad?
> >
> >Is it consistently bad for ever-larger sizes, or is that particular size a
> >bad one?
> >
> >Thanks,
> >
> >Rob
> >
> >On Mon, 8 Mar 2004, Claude Pignol wrote:
> >
> >
> >
> >>Rob,
> >>
> >>
> >>I/O 64KB no problem
> >>I/O 128KB no problem
> >>I/O 256KB write no problem and read 10 times slower.
> >>The tuning of the parameters helps to get a better performance when it
> >>works normally,
> >>but with the I/O of 256K pvfs doesn't behave normally.
> >>The current parameters are
> >>r(w)mem_max 1048575
> >>write_buf 4096
> >>access_size 4096
> >>socket_buf 1024
> >>No error message in the pvfs log
> >>
> >>Disks: raid disk that can deliver 30MB/s
> >>Dedicated to pvfs data
> >>
> >>Regards
> >>Claude
> >>
> >>
> >>
> >>
> >>
> >>Rob Ross wrote:
> >>
> >>
> >>
> >>>On Mon, 8 Mar 2004, Claude Pignol wrote:
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>Rob Ross wrote:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>Oh, I misunderstood what you were saying before. I thought that the "few
> >>>>>MB" was your file size, not your access size.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>The problem is the I/O size not the file size.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>How many I/O servers do you have in the system? How much memory do you
> >>>>>have in your client?
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>10 I/O servers 1GB (dedicated ffor iod)
> >>>>
> >>>>
> >>>>
> >>>>
> >>>Clients have this much RAM too?
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>>These four /proc values are the default and maximum socket buffer sizes,
> >>>>>if I understand things correctly:
> >>>>>/proc/sys/net/core/rmem_default
> >>>>>/proc/sys/net/core/rmem_max
> >>>>>/proc/sys/net/core/wmem_default
> >>>>>/proc/sys/net/core/wmem_max
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>r(w)mem_default is 65535
> >>>>r(w)mem_max is 131071
> >>>>
> >>>>
> >>>>
> >>>>
> >>>I would adjust these up significantly. I've seen suggestions of as much
> >>>as 8MB for wide area; maybe try 1MB and see how that goes? We're much
> >>>nicer about socket usage now, so it shouldn't be too much of a resource
> >>>hog.
> >>>
> >>>I don't think the client adjusts these, so it's going to use the default.
> >>>The iod *does* adjust these -- see below.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>>Also, you might want to adjust the following in your iod.conf file (see
> >>>>>man pages for details): socket_buf, access_size.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>write_buf 512
> >>>>access_size 512
> >>>>socket_buf 64
> >>>>
> >>>>
> >>>>
> >>>>
> >>>I would adjust access_size up to some multiple of the new wmem_max so that
> >>>there is a large enough memory mapped region to fill the buffer with one
> >>>send. Likewise for write_buf.
> >>>
> >>>I would adjust socket_buf to be the same as r(w)mem_max, because that is
> >>>what the iod will use.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>>About where does the dropoff start to occur?
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>I/O size of 256KB
> >>>>
> >>>>The read rate is around 4MB/s for I/O of 1024K
> >>>>
> >>>>Thanks
> >>>>Claude
> >>>>
> >>>>
> >>>>
> >>>>
> >>>Let me know if this helps. Also, as a kick-start for the next stage, what
> >>>sort of storage do you have on those nodes (single disks, SW RAID, FC
> >>>attached, ...)?
> >>>
> >>>Thanks,
> >>>
> >>>Rob
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>>Regards,
> >>>>>
> >>>>>Rob
> >>>>>
> >>>>>On Mon, 8 Mar 2004, Claude Pignol wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>Thanks Rob,
> >>>>>>
> >>>>>>Another fact:
> >>>>>>I found that the read works very well with 64K I/O: the read speed is
> >>>>>>better than the write speed.
> >>>>>>The read perf start degrading when I increase the I/O size
> >>>>>>
> >>>>>>I agree that there is a starting cost but there is the read ahead mechanism
> >>>>>>that speed up the disk access.
> >>>>>>I am testing with file of min 1GB
> >>>>>>
> >>>>>>I have tested with dynamic buffering (the default) and the static buffering.
> >>>>>>Same problem.
> >>>>>>How do you increase tcp buffer size?
> >>>>>>net.ipv4.tcp_rmem
> >>>>>>net.ipv4.tcp_wmem
> >>>>>>net.ipv4.tcp_mem
> >>>>>>
> >>>>>>
> >>>>>>Claude
> >>>>>>
> >>>>>>
> >>>>>>Rob Ross wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>Hi Claude,
> >>>>>>>
> >>>>>>>Sorry we didn't get back to you sooner. I'm glad that the kernel update
> >>>>>>>fixed the problem.
> >>>>>>>
> >>>>>>>What block size (bs=XXX) are you using in your tests?
> >>>>>>>
> >>>>>>>Note that when reading no I/O can start until data is read off disk, while
> >>>>>>>in the write case data can start moving right away. So you may just be
> >>>>>>>seeing startup costs.
> >>>>>>>
> >>>>>>>You could look at increasing TCP buffer sizes on your system as a first
> >>>>>>>step.
> >>>>>>>
> >>>>>>>Regards,
> >>>>>>>
> >>>>>>>Rob
> >>>>>>>
> >>>>>>>On Mon, 8 Mar 2004, Claude Pignol wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>Greetings,
> >>>>>>>>
> >>>>>>>>An upgrade to 2.4.21 fixes the problem.
> >>>>>>>>Compile and start OK.
> >>>>>>>>I have noticed a performance problem in reading from PVFS.
> >>>>>>>>With big I/O (few MB) reading is around 1/3 of the performance of writing.
> >>>>>>>>Pvfs deamons with default parameters
> >>>>>>>>Reading/Writing from on node to pvfs using dd.
> >>>>>>>>I have verified the disk performance of all the 10 I/O nodes
> >>>>>>>>I have also verified the network perf to all the nodes.
> >>>>>>>>What is the best strategy/tools to address this kind of problem?
> >>>>>>>>Thanks
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>Claude Pignol wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>Greetings,
> >>>>>>>>>
> >>>>>>>>>I try to do a benchmark of pvfs with the SuSE 2.4.19-NUMA kernel
> >>>>>>>>>to compare with the SuSE 2.4.19-SMP kernel.
> >>>>>>>>>No problem to compile and load the pvfs.o module with the SMP kernel
> >>>>>>>>>
> >>>>>>>>>With the NUMA kernel I get 3 undefined symbols when I try to load the
> >>>>>>>>>module
> >>>>>>>>>pvfs.o: unresolved symbol __pollwait
> >>>>>>>>>pvfs.o: unresolved symbol mem_map
> >>>>>>>>>pvfs.o: unresolved symbol iget4
> >>>>>>>>>
> >>>>>>>>>The kernel source is installed.
> >>>>>>>>>Any idea?
> >>>>>>>>>Thanks in advance
> >>>>>>>>>Claude
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>_______________________________________________
> >>>>>>>>>PVFS-users mailing list
> >>>>>>>>>PVFS-users at www.beowulf-underground.org
> >>>>>>>>>http://www.beowulf-underground.org/mailman/listinfo/pvfs-users
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>_______________________________________________
> >>>>>>>>PVFS-developers mailing list
> >>>>>>>>PVFS-developers at www.beowulf-underground.org
> >>>>>>>>http://www.beowulf-underground.org/mailman/listinfo/pvfs-developers
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>--
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>--
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>_______________________________________________
> >>>PVFS-developers mailing list
> >>>PVFS-developers at www.beowulf-underground.org
> >>>http://www.beowulf-underground.org/mailman/listinfo/pvfs-developers
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >_______________________________________________
> >PVFS-developers mailing list
> >PVFS-developers at www.beowulf-underground.org
> >http://www.beowulf-underground.org/mailman/listinfo/pvfs-developers
> >
> >
> >
>
> --
>
>
More information about the PVFS-developers
mailing list