[PVFS-developers] Re: [PVFS-users] Recompile pvfs module for SuSE 2.4.19-NUMA

Claude Pignol cpignol at seismiccity.com
Tue Mar 9 11:18:23 EST 2004


Rob,

I did an strace on the iod to see where we are wasting some times:
Two cases IO 128K and IO 512K
Read from a client node dd if=/pvfs/test bs=128k of=/dev/null count=16
and a second run with dd if=/pvfs/test bs=512k of=/dev/null count=4
During each run I record what one of the iod is doing and compare the strace

128K I/O
18:34:58.816754 mmap(NULL, 4194304, PROT_READ, MAP_SHARED, 7, 0) = 
0x2a95da2000 <0.000020>
18:34:58.816794 madvise(0x2a95da2000, 4194304, MADV_SEQUENTIAL|0x1) = 0 
<0.000260>
18:34:58.817072 select(9, [4 5], [8], NULL, {20, 0}) = 1 (out [8], left 
{20, 0}) <0.000006>
18:34:58.817145 fcntl(8, F_GETFL) = 0x802 (flags 
O_RDWR|O_NONBLOCK|O_LARGEFILE) <0.000003>
18:34:58.817165 sendto(8, "\0\0\0\0\217\4\6\0\374\362\v\0\351.\f\0 
\351\r\0\260\3"..., 65536, 0, NULL, 0) = 65536 <0.000116>
18:34:58.817312 select(9, [4 5 8], [], NULL, {20, 0}) = 1 (in [8], left 
{19, 990000}) <0.009769>

512K I/O
18:35:35.328624 mmap(NULL, 4194304, PROT_READ, MAP_SHARED, 7, 0) = 
0x2a95da2000 <0.000019>
18:35:35.328662 madvise(0x2a95da2000, 4194304, MADV_SEQUENTIAL|0x1) = 0 
<0.000261>
18:35:35.328940 select(9, [4 5], [8], NULL, {20, 0}) = 1 (out [8], left 
{20, 0}) <0.000006>
18:35:35.329014 fcntl(8, F_GETFL) = 0x802 (flags 
O_RDWR|O_NONBLOCK|O_LARGEFILE) <0.000003>
18:35:35.329034 sendto(8, "\0\0\0\0\217\4\6\0\374\362\v\0\351.\f\0 
\351\r\0\260\3"..., 65536, 0, NULL, 0) = 65536 <0.000112>
18:35:35.329176 select(9, [4 5 8], [], NULL, {20, 0}) = 1 (in [8], left 
{19, 740000}) <0.258920>

The timings are similar excepted the select that follow the sendto of 64KB
0.01s for the 128K I/O
0.25s for the 512K I/O

It the same kind of timing for all the sendto of 64KB

Claude
Rob Ross wrote:

>Hey,
>
>What's your strip size default?
>
>So adjusting those parameters did have a positive effect for many cases, 
>but the 256KB read case is still bad?
>
>Is it consistently bad for ever-larger sizes, or is that particular size a 
>bad one?
>
>Thanks,
>
>Rob
>
>On Mon, 8 Mar 2004, Claude Pignol wrote:
>
>  
>
>>Rob,
>>
>>
>>I/O 64KB no problem
>>I/O 128KB no problem
>>I/O 256KB write no problem and read 10 times slower.
>>The tuning of the parameters helps to get a better performance when it 
>>works normally,
>>but with the I/O of 256K pvfs doesn't behave normally.
>>The current parameters are
>>r(w)mem_max 1048575
>>write_buf 4096
>>access_size 4096
>>socket_buf 1024
>>No error message in the pvfs log
>>
>>Disks: raid disk that can deliver 30MB/s
>>Dedicated to pvfs data
>>
>>Regards
>>Claude
>>
>>
>>
>>
>>
>>Rob Ross wrote:
>>
>>    
>>
>>>On Mon, 8 Mar 2004, Claude Pignol wrote:
>>>
>>> 
>>>
>>>      
>>>
>>>>Rob Ross wrote:
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>>>>Oh, I misunderstood what you were saying before.  I thought that the "few 
>>>>>MB" was your file size, not your access size.
>>>>>
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>The problem is the I/O size not the file size.
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>>>>How many I/O servers do you have in the system?  How much memory do you 
>>>>>have in your client?
>>>>>
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>10 I/O servers 1GB (dedicated ffor iod)
>>>>   
>>>>
>>>>        
>>>>
>>>Clients have this much RAM too?
>>>
>>> 
>>>
>>>      
>>>
>>>>>These four /proc values are the default and maximum socket buffer sizes, 
>>>>>if I understand things correctly:
>>>>>/proc/sys/net/core/rmem_default
>>>>>/proc/sys/net/core/rmem_max
>>>>>/proc/sys/net/core/wmem_default
>>>>>/proc/sys/net/core/wmem_max
>>>>>
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>r(w)mem_default is 65535
>>>>r(w)mem_max is 131071
>>>>   
>>>>
>>>>        
>>>>
>>>I would adjust these up significantly.  I've seen suggestions of as much 
>>>as 8MB for wide area; maybe try 1MB and see how that goes?  We're much 
>>>nicer about socket usage now, so it shouldn't be too much of a resource 
>>>hog.
>>>
>>>I don't think the client adjusts these, so it's going to use the default.  
>>>The iod *does* adjust these -- see below.
>>>
>>> 
>>>
>>>      
>>>
>>>>>Also, you might want to adjust the following in your iod.conf file (see 
>>>>>man pages for details): socket_buf, access_size.
>>>>>
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>write_buf 512
>>>>access_size 512
>>>>socket_buf 64
>>>>   
>>>>
>>>>        
>>>>
>>>I would adjust access_size up to some multiple of the new wmem_max so that 
>>>there is a large enough memory mapped region to fill the buffer with one 
>>>send.  Likewise for write_buf.
>>>
>>>I would adjust socket_buf to be the same as r(w)mem_max, because that is 
>>>what the iod will use.
>>>
>>> 
>>>
>>>      
>>>
>>>>>About where does the dropoff start to occur?
>>>>>
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>I/O size of 256KB
>>>>
>>>>The read rate is around 4MB/s for I/O of 1024K
>>>>
>>>>Thanks
>>>>Claude
>>>>   
>>>>
>>>>        
>>>>
>>>Let me know if this helps.  Also, as a kick-start for the next stage, what 
>>>sort of storage do you have on those nodes (single disks, SW RAID, FC 
>>>attached, ...)?
>>>
>>>Thanks,
>>>
>>>Rob
>>>
>>> 
>>>
>>>      
>>>
>>>>>Regards,
>>>>>
>>>>>Rob
>>>>>
>>>>>On Mon, 8 Mar 2004, Claude Pignol wrote:
>>>>>
>>>>>
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>>>Thanks Rob,
>>>>>>
>>>>>>Another fact:
>>>>>>I found that the read works very well with 64K I/O: the read speed is 
>>>>>>better than the write speed.
>>>>>>The read perf start degrading when I increase the I/O size
>>>>>>
>>>>>>I agree that there is a starting cost but there is the read ahead mechanism
>>>>>>that speed up the disk access.
>>>>>>I am testing with file of min 1GB
>>>>>>
>>>>>>I have tested with dynamic buffering (the default) and the static buffering.
>>>>>>Same problem.
>>>>>>How do you increase tcp buffer size?
>>>>>>net.ipv4.tcp_rmem
>>>>>>net.ipv4.tcp_wmem
>>>>>>net.ipv4.tcp_mem
>>>>>>
>>>>>>
>>>>>>Claude
>>>>>>
>>>>>>
>>>>>>Rob Ross wrote:
>>>>>>
>>>>>>  
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>Hi Claude,
>>>>>>>
>>>>>>>Sorry we didn't get back to you sooner.  I'm glad that the kernel update 
>>>>>>>fixed the problem.
>>>>>>>
>>>>>>>What block size (bs=XXX) are you using in your tests?
>>>>>>>
>>>>>>>Note that when reading no I/O can start until data is read off disk, while 
>>>>>>>in the write case data can start moving right away.  So you may just be 
>>>>>>>seeing startup costs.
>>>>>>>
>>>>>>>You could look at increasing TCP buffer sizes on your system as a first 
>>>>>>>step.
>>>>>>>
>>>>>>>Regards,
>>>>>>>
>>>>>>>Rob
>>>>>>>
>>>>>>>On Mon, 8 Mar 2004, Claude Pignol wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    
>>>>>>>
>>>>>>>         
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>Greetings,
>>>>>>>>
>>>>>>>>An upgrade to 2.4.21 fixes the problem.
>>>>>>>>Compile and start OK.
>>>>>>>>I have noticed a performance problem in reading from PVFS.
>>>>>>>>With big I/O (few MB) reading is around 1/3 of the performance of writing.
>>>>>>>>Pvfs deamons with default parameters
>>>>>>>>Reading/Writing from on node to pvfs using dd.
>>>>>>>>I have verified the disk performance of all the 10 I/O nodes
>>>>>>>>I have also verified the network perf to all the nodes.
>>>>>>>>What is the best strategy/tools to address this kind of problem?
>>>>>>>>Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>>Claude Pignol wrote:
>>>>>>>>
>>>>>>>> 
>>>>>>>>
>>>>>>>>      
>>>>>>>>
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>>>Greetings,
>>>>>>>>>
>>>>>>>>>I try to do a benchmark of pvfs with the SuSE 2.4.19-NUMA kernel
>>>>>>>>>to compare with the SuSE 2.4.19-SMP kernel.
>>>>>>>>>No problem to compile and load the pvfs.o module with the SMP kernel
>>>>>>>>>
>>>>>>>>>With the NUMA kernel I get 3 undefined symbols when I try to load the 
>>>>>>>>>module
>>>>>>>>>pvfs.o: unresolved symbol __pollwait
>>>>>>>>>pvfs.o: unresolved symbol mem_map
>>>>>>>>>pvfs.o: unresolved symbol iget4
>>>>>>>>>
>>>>>>>>>The kernel source is installed.
>>>>>>>>>Any idea?
>>>>>>>>>Thanks in advance
>>>>>>>>>Claude
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>_______________________________________________
>>>>>>>>>PVFS-users mailing list
>>>>>>>>>PVFS-users at www.beowulf-underground.org
>>>>>>>>>http://www.beowulf-underground.org/mailman/listinfo/pvfs-users
>>>>>>>>>
>>>>>>>>>   
>>>>>>>>>
>>>>>>>>>        
>>>>>>>>>
>>>>>>>>>             
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>_______________________________________________
>>>>>>>>PVFS-developers mailing list
>>>>>>>>PVFS-developers at www.beowulf-underground.org
>>>>>>>>http://www.beowulf-underground.org/mailman/listinfo/pvfs-developers
>>>>>>>>
>>>>>>>>
>>>>>>>> 
>>>>>>>>
>>>>>>>>      
>>>>>>>>
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>    
>>>>>>>
>>>>>>>         
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>-- 
>>>>>>
>>>>>>
>>>>>>
>>>>>>  
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>-- 
>>>>
>>>>
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>>_______________________________________________
>>>PVFS-developers mailing list
>>>PVFS-developers at www.beowulf-underground.org
>>>http://www.beowulf-underground.org/mailman/listinfo/pvfs-developers
>>>
>>> 
>>>
>>>      
>>>
>>
>>
>>
>>
>>
>>    
>>
>
>_______________________________________________
>PVFS-developers mailing list
>PVFS-developers at www.beowulf-underground.org
>http://www.beowulf-underground.org/mailman/listinfo/pvfs-developers
>
>  
>

--



More information about the PVFS-developers mailing list