[Pvfs2-developers] Re: noncontig-test

Scott Atchley atchley at myri.com
Wed Aug 8 21:09:37 EDT 2007


Hi Sam,

Welcome back!

I copied EINVAL from ib.c. I can change it to return BMI_EMSGSIZE  
since it more specific.

The trade-off to 32-64 KB unexpected size is memory footprint. To  
avoid mallocing a buffer for each incoming unexpected, I pre-alloc  
100 rx structs on the server to catch initial connect messages. For  
each peer that connects, I alloc another 20 rx structs. For each rx  
struct, I alloc two buffers of unexpected size (so I can repost the  
rx after handing the first buffer off in BMI_mx_testunexpected()).  
This behavior can be changed but it really helps reduce latency.

Also, since 32 KB is the starting size for rendezvous messages in MX,  
I would prefer to use 8 or 16 KB so that the unexpected message is  
actually sent eagerly.

As for number of segments, MX will accept up to 256. For messages  
less than 32 KB, there is not a penalty since eager messages are  
buffered before sending on the wire. For messages over 32 KB using  
more than one segment, MX will copy them into a contiguous buffer  
before sending, which will greatly reduce throughput.

Scott

On Aug 8, 2007, at 7:47 PM, Sam Lang wrote:

>
> Hi All,
>
> Sorry for not being around earlier to participate in this  
> discussion.  I agree that the bits of code in  
> io_find_target_datafiles are nasty and I'll be sure to clean that  
> up, but the cause of this bug isn't with small IO.  The check  
> total_bytes <= max_unexp_payload will still do the right thing (not  
> enable small IO) even if max_unexp_payload is negative.  In fact,  
> the problem occurs because the noncontig request makes the normal  
> IO request larger than 4K, and when the sys-io state machine tries  
> to post the unexpected send, the BMI mx layer returns EINVAL  
> because the
> request is larger than its specified unexpected size (see mx.c:1391).
>
> We do the same thing in other BMI methods (gm and ib), but the  
> unexpected limits there are bigger (8K for ib, 16K for gm, 16K for  
> tcp), and so we've never actually hit them with unexpected  
> requests.  With a large indexed request, I think we would see the  
> same errors with ib and gm, unless we hit the limit of max request  
> segments first.  Each individual segment of an indexed request take  
> 80 bytes, so we would need to have an indexed request with about  
> 100 segments before hitting the max for ib, and somewhere around  
> 200 for tcp and gm.  The limit of of segments for a request is  
> hardcoded to 100 right now.
>
> At this point, It seems like the best fix is the one Scott chose,  
> just increase the max unexpected size for MX.  The alternative is  
> to split up unexpected requests if they're above a certain size  
> into multiple unexpecteds, and join them on the server.  Messy and  
> a lot of work.  If the unexpected message size isn't card specific,  
> could we make it something like 32K or 64K?  Are there drawbacks to  
> making it that big?
>
> Also, should we increase the limit of request segments allowed?  It  
> might be inefficient for a user to create an MPI indexed dataype  
> with that many elements, but there are users that will probably do  
> it anyway.  Alternatively, we could consider more efficiently  
> encoding each request segment in PVFS.
>
> As an aside, the other methods return -EMSGSIZE, mx is returning  
> EINVAL, which may have made this harder to debug.
>
> -sam
>
> On Aug 7, 2007, at 3:21 PM, Pete Wyckoff wrote:
>
>> atchley at myri.com wrote on Tue, 07 Aug 2007 15:21 -0400:
>>> I assumed that small_io_size is fixed in PVFS2 and was greater  
>>> than 4
>>> KB, which why I volunteered to change bmi_mx. I chose 4 KB for  
>>> bmi_mx
>>> simply because that was the value I used in Lustre (kernel page
>>> size). I am not wedded to 4 KB.
>>
>> Okay, good reasoning.  We'll let Sam tell us what he thinks.  He did
>> the small io work.  I can't think of a reason why any device could
>> not support a minimum of 8k, like you say, if that would make more
>> sense for the small io implementation.
>>
>> 		-- Pete
>> _______________________________________________
>> Pvfs2-developers mailing list
>> Pvfs2-developers at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>
>
>




More information about the Pvfs2-developers mailing list