[Pvfs2-developers] libpvfs2 usage

Brett Bode brett at scl.ameslab.gov
Wed Oct 18 14:15:51 EDT 2006


Yes I think you are correct about the memcache usage. The good news  
is that increasing the strip size seems to have completely eliminated  
the memcache miss issue. Setting the strip_size to the same as the  
buffer size (1MB) almost completely eliminated the misses. Indeed I  
can now run the bigger run on our ehca without blowing up. I suspect  
that the buffer size needs to be evenly divisible by the strip_size  
to avoid the mem. registration issue. However, this does not appear  
to give great performance. So it looks like I will need to play with  
the buffer and stripe sizes some more to get good performance.

Brett
On Oct 18, 2006, at 12:49 PM, Pete Wyckoff wrote:

> brett at scl.ameslab.gov wrote on Wed, 18 Oct 2006 11:04 -0500:
>> Ok I got some debugging output finally by hardcoding in the gossip...
>> calls. I have posted a log file at:
>> http://www.scl.ameslab.gov/~brett/pvfs2.log
>>
>> The app in this case is using a 1MB IO buffer to write a ~62MB file
>> once and then read it back in several times. The pvfs2 debug output
>> is mixed in with the application output, but I think its still not
>> too hard to follow.
>
> Thanks, that's very helpful.  Here's a quick summary of what's going
> on in the memory caching.
>
> Your app runs 11 seconds.  For the first 2.5 sec, the cache misses
> 902 times, on 902 different buffer addresses, most all 64 kB.  For
> the remainder of the runtime there are no misses.  All of these
> previous misses generated cached registrations which are then
> reused.  Most are reused exactly 10 times, but three are used
> hundreds of times, perhaps control buffers used internally by pvfs.
> The reason it is 64 kB is that that is probably the stripe size
> you're using for transfers.
>
> I think why we're seeing so much time in the memcache_* functions
> must be due to the length of this list of registrations.  That's a
> lot of pointer chasing to get down to the on-average 451th element.
> One thing I can do is put in a more reasonable data structure, but
> it will still be a time-consuming function.
>
>> It appears to me that despite always being passed the same buffer the
>> memcache_register function almost always misses for the write. note
>> that the output for a run on one of the EHCA's is very similar. On
>> the EHCA I can write up to about 220MB before it dies with the too
>> much memory registered error.
>
> This also probably explains your EHCA problem.  Those registrations
> show up separately on the NIC, and maybe hit a limit there.
>
> The bigger problem is the same one seen by most applications that
> use networks that require memory registration:  program semantics do
> not require users to register memory but underlying hardware does,
> thus something has to patch that gap.  If you reg/dereg around every
> transfer, things are very slow.  Hence we go with caching in some
> middle layer to fix this up.  The same is true for MPI as well.
> (The Netpipe guys had a way to cause lots of damage by sending lots
> of little buffers rather than one big one, I recall.)
>
> You probably see the buffer as a single thing, not 902 little 64 kB
> chunks.  Somehow we have to communicate this information to the
> message passing layer.  Fortunately you are calling PVFS_sys_write
> just once with a single big buffer, not lots of times with
> indivdiual chunks of the big buffer, so we have the information down
> in PVFS land.  But, we have to figure out how to get this
> information down to the networking layer.  The way the internal
> abstractions are set up, there's no place where the network can find
> out what buffer the user actually passed in.  I'm going to look
> around and see if I can figure something out.
>
> By the way, various groups keep rediscovering this problem but there
> are no real appealing fixes.  When was the last time you saw anybody
> use MPI_Alloc_mem?  :)  We discovered it ourselves in the context of
> PVFS back in 2003 or thereabouts, and took a stab at fixing it, but
> didn't quite complete the work needed to fully integrate it.
> (Wuj's Unifier framework (CCGrid04):
>     http://www.osc.edu/~pw/papers/wu-unifier-ccgrid04.pdf
> )
>
> 		-- Pete



More information about the Pvfs2-developers mailing list