[Pvfs2-developers] libpvfs2 usage
Brett Bode
brett at scl.ameslab.gov
Thu Oct 19 12:22:12 EDT 2006
It appears that the memory registration code works best if the buffer
size is an integer multiple of the strip_size. Making sure that is
the case has allowed me to run on our EHCA without exceeding its
registration limit. However I am not seeing much of any speedup by
going to larger strip and/or buffer sizes. One odd thing that I have
noticed in a new debug run is that even with a stip_size of 1M and a
6M buffer (we have 6 servers) it appears data is sent in 256K chunks.
I have posted a new log for this run at http://www.scl.ameslab.gov/
~brett/pvfs2-1M.log. The sends look like:
[D 10:51:33.141315] test_rq: rq 0x15501580 completed 24 from da8:3336.
[D 10:51:33.141343] BMI_testcontext completing: 91
[D 10:51:33.141415] BMI_post_send_list: addr: 11, count: 1,
total_size: 262144, tag: 20
[D 10:51:33.141443] element 0: offset: 0x14ea1688, size: 262144
[D 10:51:33.141470] BMI_ib_post_send_list: listlen 1 tag 20.
[D 10:51:33.141505] memcache_register: miss [0] 0x14ea1688 len 262144.
[D 10:51:33.141580] BMI_post_recv: addr: 11, offset: 0x15501730,
size: 24, tag:
So the first question is why am I sending in 256K chunks instead of
the full 1M strip size? In addition in the default config. with a 64K
strip size I see:
[D 10:38:29.272429] BMI_post_send_list: addr: 12, count: 3,
total_size: 196608,
tag: 21
[D 10:38:29.272465] element 0: offset: 0x14eb16a8, size: 65536
[D 10:38:29.272491] element 1: offset: 0x14f116a8, size: 65536
[D 10:38:29.272517] element 2: offset: 0x14f716a8, size: 65536
[D 10:38:29.272544] BMI_ib_post_send_list: listlen 3 tag 21.
[D 10:38:29.272578] memcache_register: miss [0] 0x14eb16a8 len 65536.
[D 10:38:29.272763] memcache_register: miss [1] 0x14f116a8 len 65536.
[D 10:38:29.272941] memcache_register: miss [2] 0x14f716a8 len 65536.
[D 10:38:29.273136] BMI_post_recv: addr: 12, offset: 0x15024eb0,
size: 24, tag:
I assume this means indicates some level of parallel sends for the
64K chunks. i don't see that with with the larger stripes. Perhaps
that loss of parallelization is why I don't see much speedup to
larger stripes?
Thanks,
Brett
On Oct 18, 2006, at 12:49 PM, Pete Wyckoff wrote:
> brett at scl.ameslab.gov wrote on Wed, 18 Oct 2006 11:04 -0500:
>> Ok I got some debugging output finally by hardcoding in the gossip...
>> calls. I have posted a log file at:
>> http://www.scl.ameslab.gov/~brett/pvfs2.log
>>
>> The app in this case is using a 1MB IO buffer to write a ~62MB file
>> once and then read it back in several times. The pvfs2 debug output
>> is mixed in with the application output, but I think its still not
>> too hard to follow.
>
> Thanks, that's very helpful. Here's a quick summary of what's going
> on in the memory caching.
>
> Your app runs 11 seconds. For the first 2.5 sec, the cache misses
> 902 times, on 902 different buffer addresses, most all 64 kB. For
> the remainder of the runtime there are no misses. All of these
> previous misses generated cached registrations which are then
> reused. Most are reused exactly 10 times, but three are used
> hundreds of times, perhaps control buffers used internally by pvfs.
> The reason it is 64 kB is that that is probably the stripe size
> you're using for transfers.
>
> I think why we're seeing so much time in the memcache_* functions
> must be due to the length of this list of registrations. That's a
> lot of pointer chasing to get down to the on-average 451th element.
> One thing I can do is put in a more reasonable data structure, but
> it will still be a time-consuming function.
>
>> It appears to me that despite always being passed the same buffer the
>> memcache_register function almost always misses for the write. note
>> that the output for a run on one of the EHCA's is very similar. On
>> the EHCA I can write up to about 220MB before it dies with the too
>> much memory registered error.
>
> This also probably explains your EHCA problem. Those registrations
> show up separately on the NIC, and maybe hit a limit there.
>
> The bigger problem is the same one seen by most applications that
> use networks that require memory registration: program semantics do
> not require users to register memory but underlying hardware does,
> thus something has to patch that gap. If you reg/dereg around every
> transfer, things are very slow. Hence we go with caching in some
> middle layer to fix this up. The same is true for MPI as well.
> (The Netpipe guys had a way to cause lots of damage by sending lots
> of little buffers rather than one big one, I recall.)
>
> You probably see the buffer as a single thing, not 902 little 64 kB
> chunks. Somehow we have to communicate this information to the
> message passing layer. Fortunately you are calling PVFS_sys_write
> just once with a single big buffer, not lots of times with
> indivdiual chunks of the big buffer, so we have the information down
> in PVFS land. But, we have to figure out how to get this
> information down to the networking layer. The way the internal
> abstractions are set up, there's no place where the network can find
> out what buffer the user actually passed in. I'm going to look
> around and see if I can figure something out.
>
> By the way, various groups keep rediscovering this problem but there
> are no real appealing fixes. When was the last time you saw anybody
> use MPI_Alloc_mem? :) We discovered it ourselves in the context of
> PVFS back in 2003 or thereabouts, and took a stab at fixing it, but
> didn't quite complete the work needed to fully integrate it.
> (Wuj's Unifier framework (CCGrid04):
> http://www.osc.edu/~pw/papers/wu-unifier-ccgrid04.pdf
> )
>
> -- Pete
More information about the Pvfs2-developers
mailing list