[Pvfs2-developers] Re: noncontig-test

Sam Lang slang at mcs.anl.gov
Wed Aug 8 19:47:41 EDT 2007


Hi All,

Sorry for not being around earlier to participate in this  
discussion.  I agree that the bits of code in  
io_find_target_datafiles are nasty and I'll be sure to clean that up,  
but the cause of this bug isn't with small IO.  The check total_bytes  
<= max_unexp_payload will still do the right thing (not enable small  
IO) even if max_unexp_payload is negative.  In fact, the problem  
occurs because the noncontig request makes the normal IO request  
larger than 4K, and when the sys-io state machine tries to post the  
unexpected send, the BMI mx layer returns EINVAL because the
request is larger than its specified unexpected size (see mx.c:1391).

We do the same thing in other BMI methods (gm and ib), but the  
unexpected limits there are bigger (8K for ib, 16K for gm, 16K for  
tcp), and so we've never actually hit them with unexpected requests.   
With a large indexed request, I think we would see the same errors  
with ib and gm, unless we hit the limit of max request segments  
first.  Each individual segment of an indexed request take 80 bytes,  
so we would need to have an indexed request with about 100 segments  
before hitting the max for ib, and somewhere around 200 for tcp and  
gm.  The limit of of segments for a request is hardcoded to 100 right  
now.

At this point, It seems like the best fix is the one Scott chose,  
just increase the max unexpected size for MX.  The alternative is to  
split up unexpected requests if they're above a certain size into  
multiple unexpecteds, and join them on the server.  Messy and a lot  
of work.  If the unexpected message size isn't card specific, could  
we make it something like 32K or 64K?  Are there drawbacks to making  
it that big?

Also, should we increase the limit of request segments allowed?  It  
might be inefficient for a user to create an MPI indexed dataype with  
that many elements, but there are users that will probably do it  
anyway.  Alternatively, we could consider more efficiently encoding  
each request segment in PVFS.

As an aside, the other methods return -EMSGSIZE, mx is returning  
EINVAL, which may have made this harder to debug.

-sam

On Aug 7, 2007, at 3:21 PM, Pete Wyckoff wrote:

> atchley at myri.com wrote on Tue, 07 Aug 2007 15:21 -0400:
>> I assumed that small_io_size is fixed in PVFS2 and was greater than 4
>> KB, which why I volunteered to change bmi_mx. I chose 4 KB for bmi_mx
>> simply because that was the value I used in Lustre (kernel page
>> size). I am not wedded to 4 KB.
>
> Okay, good reasoning.  We'll let Sam tell us what he thinks.  He did
> the small io work.  I can't think of a reason why any device could
> not support a minimum of 8k, like you say, if that would make more
> sense for the small io implementation.
>
> 		-- Pete
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers at beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>




More information about the Pvfs2-developers mailing list