[Pvfs2-developers] libpvfs2 usage
Brett Bode
brett at scl.ameslab.gov
Wed Oct 18 14:15:51 EDT 2006
Yes I think you are correct about the memcache usage. The good news
is that increasing the strip size seems to have completely eliminated
the memcache miss issue. Setting the strip_size to the same as the
buffer size (1MB) almost completely eliminated the misses. Indeed I
can now run the bigger run on our ehca without blowing up. I suspect
that the buffer size needs to be evenly divisible by the strip_size
to avoid the mem. registration issue. However, this does not appear
to give great performance. So it looks like I will need to play with
the buffer and stripe sizes some more to get good performance.
Brett
On Oct 18, 2006, at 12:49 PM, Pete Wyckoff wrote:
> brett at scl.ameslab.gov wrote on Wed, 18 Oct 2006 11:04 -0500:
>> Ok I got some debugging output finally by hardcoding in the gossip...
>> calls. I have posted a log file at:
>> http://www.scl.ameslab.gov/~brett/pvfs2.log
>>
>> The app in this case is using a 1MB IO buffer to write a ~62MB file
>> once and then read it back in several times. The pvfs2 debug output
>> is mixed in with the application output, but I think its still not
>> too hard to follow.
>
> Thanks, that's very helpful. Here's a quick summary of what's going
> on in the memory caching.
>
> Your app runs 11 seconds. For the first 2.5 sec, the cache misses
> 902 times, on 902 different buffer addresses, most all 64 kB. For
> the remainder of the runtime there are no misses. All of these
> previous misses generated cached registrations which are then
> reused. Most are reused exactly 10 times, but three are used
> hundreds of times, perhaps control buffers used internally by pvfs.
> The reason it is 64 kB is that that is probably the stripe size
> you're using for transfers.
>
> I think why we're seeing so much time in the memcache_* functions
> must be due to the length of this list of registrations. That's a
> lot of pointer chasing to get down to the on-average 451th element.
> One thing I can do is put in a more reasonable data structure, but
> it will still be a time-consuming function.
>
>> It appears to me that despite always being passed the same buffer the
>> memcache_register function almost always misses for the write. note
>> that the output for a run on one of the EHCA's is very similar. On
>> the EHCA I can write up to about 220MB before it dies with the too
>> much memory registered error.
>
> This also probably explains your EHCA problem. Those registrations
> show up separately on the NIC, and maybe hit a limit there.
>
> The bigger problem is the same one seen by most applications that
> use networks that require memory registration: program semantics do
> not require users to register memory but underlying hardware does,
> thus something has to patch that gap. If you reg/dereg around every
> transfer, things are very slow. Hence we go with caching in some
> middle layer to fix this up. The same is true for MPI as well.
> (The Netpipe guys had a way to cause lots of damage by sending lots
> of little buffers rather than one big one, I recall.)
>
> You probably see the buffer as a single thing, not 902 little 64 kB
> chunks. Somehow we have to communicate this information to the
> message passing layer. Fortunately you are calling PVFS_sys_write
> just once with a single big buffer, not lots of times with
> indivdiual chunks of the big buffer, so we have the information down
> in PVFS land. But, we have to figure out how to get this
> information down to the networking layer. The way the internal
> abstractions are set up, there's no place where the network can find
> out what buffer the user actually passed in. I'm going to look
> around and see if I can figure something out.
>
> By the way, various groups keep rediscovering this problem but there
> are no real appealing fixes. When was the last time you saw anybody
> use MPI_Alloc_mem? :) We discovered it ourselves in the context of
> PVFS back in 2003 or thereabouts, and took a stab at fixing it, but
> didn't quite complete the work needed to fully integrate it.
> (Wuj's Unifier framework (CCGrid04):
> http://www.osc.edu/~pw/papers/wu-unifier-ccgrid04.pdf
> )
>
> -- Pete
More information about the Pvfs2-developers
mailing list