[Pvfs2-developers] bmi questions

Phil Carns pcarns at wastedcycles.org
Fri Aug 18 10:27:51 EDT 2006


Sam Lang wrote:
> 
> On Aug 17, 2006, at 7:49 PM, Pete Wyckoff wrote:
> 
>> slang at mcs.anl.gov wrote on Thu, 17 Aug 2006 18:14 -0500:
>>
>>> * BMI memory allocation.  Do we place any restrictions on when or how
>>> frequently BMI_memalloc is called?  In the pvfs code, we always call
>>> BMI_memalloc for a post_send or post_recv.  Would it be possible to
>>> avoid the malloc on the client for a write and just use the user
>>> buffer?  Or should we mandate that calls to post_send and post_recv
>>> always pass in a pointer from BMI_memalloc?  (as a side note, if we
>>> make that mandate, maybe we should have a BMI_buffer type that
>>> memalloc returns and post_send/post_recv accept).
>>
>>
>> Both bmi_ib and bmi_gm define the BMI memalloc method to do
>> something other than simply malloc().  In the IB case, it pins the
>> memory early, and never unpins it until the corresponding
>> BMI_memfree() happens.  This is better than letting BMI do the
>> pinning explicitly, as it moves some of the messaging work out of
>> the critical path, if you can arrange to alloc/free before you do
>> send/recv.
>>
>> Note that these alloc routines only do something special if the
>> buffer is big enough to be "worth it" (8 kB for IB).
>>
>> There's no restrictions on how frequently you can call these things.
>> Each pinned memory region has some overhead in terms of in-pvfs data
>> structures, in-kernel data structers, and on-NIC data structures.
>> Ideally we'd try to limit the growth of these things and force old
>> entries to be freed, but in practice they mostly just grow and it's
>> not a big problem (unless you have lots of pvfs apps on a single
>> box, for instance).
>>
>> You can certainly avoid the malloc and use the user buffer when you
>> have one instead.  I think this is the common case for MPI-IO
>> operations.  Point out what case you're talking about and I'll take
>> a look.
> 
> 
> It looks like the mem_to_bmi code (client write) in flow always does  a 
> memalloc for the intermediate buffer and then copies the user  buffer 
> into that.  

This is actually a corner case that you are looking at, not the default 
behavior.  There are two differenty buffer handling approaches in this 
function:

/* was MAX_REGIONS enough to satisfy this step? */
if(!PINT_REQUEST_DONE(flow_data->parent->file_req_state) &&
    q_item->result_chain.result.bytes < flow_data->parent->buffer_size)
{
     /* create an intermediate buffer */

<.... code snipped - this is where the BMI_memalloc() occurs >

}
else
{
     /* normal case */

<.... code snipped - no BMI_memalloc() occurs here, and the existing 
buffer is used>

}

In the case where an intermediate buffer is used, we have detected that 
the memory regions being accessed were so discontiguous that it will 
take more than MAX_REGIONS (64) offsets and sizes to represent in this 
iteration (up to BUFFER_SIZE, normally 256K, per iteration).  Rather 
than make an arbitrarily long offset and size list to get to the 256K 
total that we want to transmit, this is the cutoff point at which it 
throws its hands up, makes a contiguous intermediate buffer, and copies 
everything into that.  In all other cases we use the existing buffers 
rather than allocating something new.

> On reads (bmi_to_mem), flow does use the client's  buffer, 
> so I guess that's a case that doesn't do memalloc.  I wonder  if the 
> copy on a client write could be avoided as well though.

We do try to avoid the copy on write if possible, although the choice of 
the cutoff point where we give up on list operations and start copying 
is kind of arbitrary- I don't think anyone has ever tested to see if 
that value makes sense.  Modifying it requires some synchronization with 
BMI and Trove as well, though, to make sure that they can also handle 
list io of up to MAX_REGIONS without breaking them apart.

-Phil


More information about the Pvfs2-developers mailing list