[Pvfs2-developers] libpvfs2 usage

Rob Ross rross at mcs.anl.gov
Thu Oct 19 13:00:15 EDT 2006


Hi Brett,

We need to get you and Kyle talking more :)! The 256K chunks are a 
result of the FlowBufferSizeBytes parameter, which defaults to 256K. 
Basically that value determines the size of the packets that are sent by 
BMI over the wire during I/O operations. Kyle has been playing with that 
value off and on and has seen significant improvements for larger values 
on the IB hardware.

That value can be set in the config files and is unrelated to the strip 
size.

Do we know how many slots are available for memory registrations on that 
card? It would be helpful in avoiding these situations (if possible).

Regards,

Rob

Brett Bode wrote:
> It appears that the memory registration code works best if the buffer 
> size is an integer multiple of the strip_size. Making sure that is the 
> case has allowed me to run on our EHCA without exceeding its 
> registration limit. However I am not seeing much of any speedup by going 
> to larger strip and/or buffer sizes. One odd thing that I have noticed 
> in a new debug run is that even with a stip_size of 1M and a 6M buffer 
> (we have 6 servers) it appears data is sent in 256K chunks. I have 
> posted a new log for this run at 
> http://www.scl.ameslab.gov/~brett/pvfs2-1M.log. The sends look like:
> [D 10:51:33.141315] test_rq: rq 0x15501580 completed 24 from da8:3336.
> [D 10:51:33.141343] BMI_testcontext completing: 91
> [D 10:51:33.141415] BMI_post_send_list: addr: 11, count: 1, total_size: 
> 262144, tag: 20
> [D 10:51:33.141443]    element 0: offset: 0x14ea1688, size: 262144
> [D 10:51:33.141470] BMI_ib_post_send_list: listlen 1 tag 20.
> [D 10:51:33.141505] memcache_register: miss [0] 0x14ea1688 len 262144.
> [D 10:51:33.141580] BMI_post_recv: addr: 11, offset: 0x15501730, size: 
> 24, tag:
> 
> So the first question is why am I sending in 256K chunks instead of the 
> full 1M strip size? In addition in the default config. with a 64K strip 
> size I see:
> 
> [D 10:38:29.272429] BMI_post_send_list: addr: 12, count: 3, total_size: 
> 196608,
> tag: 21
> [D 10:38:29.272465]    element 0: offset: 0x14eb16a8, size: 65536
> [D 10:38:29.272491]    element 1: offset: 0x14f116a8, size: 65536
> [D 10:38:29.272517]    element 2: offset: 0x14f716a8, size: 65536
> [D 10:38:29.272544] BMI_ib_post_send_list: listlen 3 tag 21.
> [D 10:38:29.272578] memcache_register: miss [0] 0x14eb16a8 len 65536.
> [D 10:38:29.272763] memcache_register: miss [1] 0x14f116a8 len 65536.
> [D 10:38:29.272941] memcache_register: miss [2] 0x14f716a8 len 65536.
> [D 10:38:29.273136] BMI_post_recv: addr: 12, offset: 0x15024eb0, size: 
> 24, tag:
> 
> I assume this means indicates some level of parallel sends for the 64K 
> chunks. i don't see that with with the larger stripes. Perhaps that loss 
> of parallelization is why I don't see much speedup to larger stripes?
> 
> Thanks,
> Brett
> On Oct 18, 2006, at 12:49 PM, Pete Wyckoff wrote:
> 
>> brett at scl.ameslab.gov wrote on Wed, 18 Oct 2006 11:04 -0500:
>>> Ok I got some debugging output finally by hardcoding in the gossip...
>>> calls. I have posted a log file at:
>>> http://www.scl.ameslab.gov/~brett/pvfs2.log
>>>
>>> The app in this case is using a 1MB IO buffer to write a ~62MB file
>>> once and then read it back in several times. The pvfs2 debug output
>>> is mixed in with the application output, but I think its still not
>>> too hard to follow.
>>
>> Thanks, that's very helpful.  Here's a quick summary of what's going
>> on in the memory caching.
>>
>> Your app runs 11 seconds.  For the first 2.5 sec, the cache misses
>> 902 times, on 902 different buffer addresses, most all 64 kB.  For
>> the remainder of the runtime there are no misses.  All of these
>> previous misses generated cached registrations which are then
>> reused.  Most are reused exactly 10 times, but three are used
>> hundreds of times, perhaps control buffers used internally by pvfs.
>> The reason it is 64 kB is that that is probably the stripe size
>> you're using for transfers.
>>
>> I think why we're seeing so much time in the memcache_* functions
>> must be due to the length of this list of registrations.  That's a
>> lot of pointer chasing to get down to the on-average 451th element.
>> One thing I can do is put in a more reasonable data structure, but
>> it will still be a time-consuming function.
>>
>>> It appears to me that despite always being passed the same buffer the
>>> memcache_register function almost always misses for the write. note
>>> that the output for a run on one of the EHCA's is very similar. On
>>> the EHCA I can write up to about 220MB before it dies with the too
>>> much memory registered error.
>>
>> This also probably explains your EHCA problem.  Those registrations
>> show up separately on the NIC, and maybe hit a limit there.
>>
>> The bigger problem is the same one seen by most applications that
>> use networks that require memory registration:  program semantics do
>> not require users to register memory but underlying hardware does,
>> thus something has to patch that gap.  If you reg/dereg around every
>> transfer, things are very slow.  Hence we go with caching in some
>> middle layer to fix this up.  The same is true for MPI as well.
>> (The Netpipe guys had a way to cause lots of damage by sending lots
>> of little buffers rather than one big one, I recall.)
>>
>> You probably see the buffer as a single thing, not 902 little 64 kB
>> chunks.  Somehow we have to communicate this information to the
>> message passing layer.  Fortunately you are calling PVFS_sys_write
>> just once with a single big buffer, not lots of times with
>> indivdiual chunks of the big buffer, so we have the information down
>> in PVFS land.  But, we have to figure out how to get this
>> information down to the networking layer.  The way the internal
>> abstractions are set up, there's no place where the network can find
>> out what buffer the user actually passed in.  I'm going to look
>> around and see if I can figure something out.
>>
>> By the way, various groups keep rediscovering this problem but there
>> are no real appealing fixes.  When was the last time you saw anybody
>> use MPI_Alloc_mem?  :)  We discovered it ourselves in the context of
>> PVFS back in 2003 or thereabouts, and took a stab at fixing it, but
>> didn't quite complete the work needed to fully integrate it.
>> (Wuj's Unifier framework (CCGrid04):
>>     http://www.osc.edu/~pw/papers/wu-unifier-ccgrid04.pdf
>> )
>>
>>         -- Pete
> 
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> 


More information about the Pvfs2-developers mailing list