[Pvfs2-developers] libpvfs2 usage

Brett Bode brett at scl.ameslab.gov
Thu Oct 19 12:22:12 EDT 2006


It appears that the memory registration code works best if the buffer  
size is an integer multiple of the strip_size. Making sure that is  
the case has allowed me to run on our EHCA without exceeding its  
registration limit. However I am not seeing much of any speedup by  
going to larger strip and/or buffer sizes. One odd thing that I have  
noticed in a new debug run is that even with a stip_size of 1M and a  
6M buffer (we have 6 servers) it appears data is sent in 256K chunks.  
I have posted a new log for this run at http://www.scl.ameslab.gov/ 
~brett/pvfs2-1M.log. The sends look like:
[D 10:51:33.141315] test_rq: rq 0x15501580 completed 24 from da8:3336.
[D 10:51:33.141343] BMI_testcontext completing: 91
[D 10:51:33.141415] BMI_post_send_list: addr: 11, count: 1,  
total_size: 262144, tag: 20
[D 10:51:33.141443]    element 0: offset: 0x14ea1688, size: 262144
[D 10:51:33.141470] BMI_ib_post_send_list: listlen 1 tag 20.
[D 10:51:33.141505] memcache_register: miss [0] 0x14ea1688 len 262144.
[D 10:51:33.141580] BMI_post_recv: addr: 11, offset: 0x15501730,  
size: 24, tag:

So the first question is why am I sending in 256K chunks instead of  
the full 1M strip size? In addition in the default config. with a 64K  
strip size I see:

[D 10:38:29.272429] BMI_post_send_list: addr: 12, count: 3,  
total_size: 196608,
tag: 21
[D 10:38:29.272465]    element 0: offset: 0x14eb16a8, size: 65536
[D 10:38:29.272491]    element 1: offset: 0x14f116a8, size: 65536
[D 10:38:29.272517]    element 2: offset: 0x14f716a8, size: 65536
[D 10:38:29.272544] BMI_ib_post_send_list: listlen 3 tag 21.
[D 10:38:29.272578] memcache_register: miss [0] 0x14eb16a8 len 65536.
[D 10:38:29.272763] memcache_register: miss [1] 0x14f116a8 len 65536.
[D 10:38:29.272941] memcache_register: miss [2] 0x14f716a8 len 65536.
[D 10:38:29.273136] BMI_post_recv: addr: 12, offset: 0x15024eb0,  
size: 24, tag:

I assume this means indicates some level of parallel sends for the  
64K chunks. i don't see that with with the larger stripes. Perhaps  
that loss of parallelization is why I don't see much speedup to  
larger stripes?

Thanks,
Brett
On Oct 18, 2006, at 12:49 PM, Pete Wyckoff wrote:

> brett at scl.ameslab.gov wrote on Wed, 18 Oct 2006 11:04 -0500:
>> Ok I got some debugging output finally by hardcoding in the gossip...
>> calls. I have posted a log file at:
>> http://www.scl.ameslab.gov/~brett/pvfs2.log
>>
>> The app in this case is using a 1MB IO buffer to write a ~62MB file
>> once and then read it back in several times. The pvfs2 debug output
>> is mixed in with the application output, but I think its still not
>> too hard to follow.
>
> Thanks, that's very helpful.  Here's a quick summary of what's going
> on in the memory caching.
>
> Your app runs 11 seconds.  For the first 2.5 sec, the cache misses
> 902 times, on 902 different buffer addresses, most all 64 kB.  For
> the remainder of the runtime there are no misses.  All of these
> previous misses generated cached registrations which are then
> reused.  Most are reused exactly 10 times, but three are used
> hundreds of times, perhaps control buffers used internally by pvfs.
> The reason it is 64 kB is that that is probably the stripe size
> you're using for transfers.
>
> I think why we're seeing so much time in the memcache_* functions
> must be due to the length of this list of registrations.  That's a
> lot of pointer chasing to get down to the on-average 451th element.
> One thing I can do is put in a more reasonable data structure, but
> it will still be a time-consuming function.
>
>> It appears to me that despite always being passed the same buffer the
>> memcache_register function almost always misses for the write. note
>> that the output for a run on one of the EHCA's is very similar. On
>> the EHCA I can write up to about 220MB before it dies with the too
>> much memory registered error.
>
> This also probably explains your EHCA problem.  Those registrations
> show up separately on the NIC, and maybe hit a limit there.
>
> The bigger problem is the same one seen by most applications that
> use networks that require memory registration:  program semantics do
> not require users to register memory but underlying hardware does,
> thus something has to patch that gap.  If you reg/dereg around every
> transfer, things are very slow.  Hence we go with caching in some
> middle layer to fix this up.  The same is true for MPI as well.
> (The Netpipe guys had a way to cause lots of damage by sending lots
> of little buffers rather than one big one, I recall.)
>
> You probably see the buffer as a single thing, not 902 little 64 kB
> chunks.  Somehow we have to communicate this information to the
> message passing layer.  Fortunately you are calling PVFS_sys_write
> just once with a single big buffer, not lots of times with
> indivdiual chunks of the big buffer, so we have the information down
> in PVFS land.  But, we have to figure out how to get this
> information down to the networking layer.  The way the internal
> abstractions are set up, there's no place where the network can find
> out what buffer the user actually passed in.  I'm going to look
> around and see if I can figure something out.
>
> By the way, various groups keep rediscovering this problem but there
> are no real appealing fixes.  When was the last time you saw anybody
> use MPI_Alloc_mem?  :)  We discovered it ourselves in the context of
> PVFS back in 2003 or thereabouts, and took a stab at fixing it, but
> didn't quite complete the work needed to fully integrate it.
> (Wuj's Unifier framework (CCGrid04):
>     http://www.osc.edu/~pw/papers/wu-unifier-ccgrid04.pdf
> )
>
> 		-- Pete



More information about the Pvfs2-developers mailing list