[Pvfs2-developers] libpvfs2 usage

Sam Lang slang at mcs.anl.gov
Thu Oct 19 13:12:14 EDT 2006


On Oct 19, 2006, at 11:22 AM, Brett Bode wrote:

> It appears that the memory registration code works best if the  
> buffer size is an integer multiple of the strip_size. Making sure  
> that is the case has allowed me to run on our EHCA without  
> exceeding its registration limit. However I am not seeing much of  
> any speedup by going to larger strip and/or buffer sizes. One odd  
> thing that I have noticed in a new debug run is that even with a  
> stip_size of 1M and a 6M buffer (we have 6 servers) it appears data  
> is sent in 256K chunks.

I have been referring to 'flow buffer size' and 'flow buffer count',  
and this may be cause for some confusion.  The 'flow buffers' are  
used on the server to pull data off the network and push it to disk  
(or vice-versa).  The size of each buffer and number of buffers can  
be changed with config options.  The defaults are 256K and 8  
respectively.

Also just to avoid confusion, I usually use 'request size' to refer  
to the size of the IO request being made to PVFS.  So there are three  
things at play here:  request size, flow buffer size, and strip size.

So the 256K chunks you see are from that default flow buffer size.   
We only post bmi operations in flow buffer sized chunks, so at the  
BMI method layer (IB in this case) on the server, you see those  
memory registrations.  On the client the buffer passed in through the  
system interfaces is used, so I would bet that the memory  
registrations don't get chunked into flow buffer sized regions like  
they do on the server.

To change the flow buffer size, you can add this to your fs.conf:

FlowBufferSizeBytes 1446864

It needs to be added within the <Filesystem> context.

-sam


> I have posted a new log for this run at http://www.scl.ameslab.gov/ 
> ~brett/pvfs2-1M.log. The sends look like:
> [D 10:51:33.141315] test_rq: rq 0x15501580 completed 24 from da8:3336.
> [D 10:51:33.141343] BMI_testcontext completing: 91
> [D 10:51:33.141415] BMI_post_send_list: addr: 11, count: 1,  
> total_size: 262144, tag: 20
> [D 10:51:33.141443]    element 0: offset: 0x14ea1688, size: 262144
> [D 10:51:33.141470] BMI_ib_post_send_list: listlen 1 tag 20.
> [D 10:51:33.141505] memcache_register: miss [0] 0x14ea1688 len 262144.
> [D 10:51:33.141580] BMI_post_recv: addr: 11, offset: 0x15501730,  
> size: 24, tag:
>
> So the first question is why am I sending in 256K chunks instead of  
> the full 1M strip size? In addition in the default config. with a  
> 64K strip size I see:
>
> [D 10:38:29.272429] BMI_post_send_list: addr: 12, count: 3,  
> total_size: 196608,
> tag: 21
> [D 10:38:29.272465]    element 0: offset: 0x14eb16a8, size: 65536
> [D 10:38:29.272491]    element 1: offset: 0x14f116a8, size: 65536
> [D 10:38:29.272517]    element 2: offset: 0x14f716a8, size: 65536
> [D 10:38:29.272544] BMI_ib_post_send_list: listlen 3 tag 21.
> [D 10:38:29.272578] memcache_register: miss [0] 0x14eb16a8 len 65536.
> [D 10:38:29.272763] memcache_register: miss [1] 0x14f116a8 len 65536.
> [D 10:38:29.272941] memcache_register: miss [2] 0x14f716a8 len 65536.
> [D 10:38:29.273136] BMI_post_recv: addr: 12, offset: 0x15024eb0,  
> size: 24, tag:
>
> I assume this means indicates some level of parallel sends for the  
> 64K chunks. i don't see that with with the larger stripes. Perhaps  
> that loss of parallelization is why I don't see much speedup to  
> larger stripes?
>
> Thanks,
> Brett
> On Oct 18, 2006, at 12:49 PM, Pete Wyckoff wrote:
>
>> brett at scl.ameslab.gov wrote on Wed, 18 Oct 2006 11:04 -0500:
>>> Ok I got some debugging output finally by hardcoding in the  
>>> gossip...
>>> calls. I have posted a log file at:
>>> http://www.scl.ameslab.gov/~brett/pvfs2.log
>>>
>>> The app in this case is using a 1MB IO buffer to write a ~62MB file
>>> once and then read it back in several times. The pvfs2 debug output
>>> is mixed in with the application output, but I think its still not
>>> too hard to follow.
>>
>> Thanks, that's very helpful.  Here's a quick summary of what's going
>> on in the memory caching.
>>
>> Your app runs 11 seconds.  For the first 2.5 sec, the cache misses
>> 902 times, on 902 different buffer addresses, most all 64 kB.  For
>> the remainder of the runtime there are no misses.  All of these
>> previous misses generated cached registrations which are then
>> reused.  Most are reused exactly 10 times, but three are used
>> hundreds of times, perhaps control buffers used internally by pvfs.
>> The reason it is 64 kB is that that is probably the stripe size
>> you're using for transfers.
>>
>> I think why we're seeing so much time in the memcache_* functions
>> must be due to the length of this list of registrations.  That's a
>> lot of pointer chasing to get down to the on-average 451th element.
>> One thing I can do is put in a more reasonable data structure, but
>> it will still be a time-consuming function.
>>
>>> It appears to me that despite always being passed the same buffer  
>>> the
>>> memcache_register function almost always misses for the write. note
>>> that the output for a run on one of the EHCA's is very similar. On
>>> the EHCA I can write up to about 220MB before it dies with the too
>>> much memory registered error.
>>
>> This also probably explains your EHCA problem.  Those registrations
>> show up separately on the NIC, and maybe hit a limit there.
>>
>> The bigger problem is the same one seen by most applications that
>> use networks that require memory registration:  program semantics do
>> not require users to register memory but underlying hardware does,
>> thus something has to patch that gap.  If you reg/dereg around every
>> transfer, things are very slow.  Hence we go with caching in some
>> middle layer to fix this up.  The same is true for MPI as well.
>> (The Netpipe guys had a way to cause lots of damage by sending lots
>> of little buffers rather than one big one, I recall.)
>>
>> You probably see the buffer as a single thing, not 902 little 64 kB
>> chunks.  Somehow we have to communicate this information to the
>> message passing layer.  Fortunately you are calling PVFS_sys_write
>> just once with a single big buffer, not lots of times with
>> indivdiual chunks of the big buffer, so we have the information down
>> in PVFS land.  But, we have to figure out how to get this
>> information down to the networking layer.  The way the internal
>> abstractions are set up, there's no place where the network can find
>> out what buffer the user actually passed in.  I'm going to look
>> around and see if I can figure something out.
>>
>> By the way, various groups keep rediscovering this problem but there
>> are no real appealing fixes.  When was the last time you saw anybody
>> use MPI_Alloc_mem?  :)  We discovered it ourselves in the context of
>> PVFS back in 2003 or thereabouts, and took a stab at fixing it, but
>> didn't quite complete the work needed to fully integrate it.
>> (Wuj's Unifier framework (CCGrid04):
>>     http://www.osc.edu/~pw/papers/wu-unifier-ccgrid04.pdf
>> )
>>
>> 		-- Pete
>
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>



More information about the Pvfs2-developers mailing list