[Pvfs2-developers] Re: noncontig-test

Sam Lang slang at mcs.anl.gov
Wed Aug 8 22:18:29 EDT 2007


On Aug 8, 2007, at 8:09 PM, Scott Atchley wrote:

> Hi Sam,
>
> Welcome back!

Hi Scott,

Thanks!

>
> I copied EINVAL from ib.c. I can change it to return BMI_EMSGSIZE  
> since it more specific.

Ah ok.  EMSGSIZE might help us know what the problem is if/when it  
happens again.
>
> The trade-off to 32-64 KB unexpected size is memory footprint. To  
> avoid mallocing a buffer for each incoming unexpected, I pre-alloc  
> 100 rx structs on the server to catch initial connect messages. For  
> each peer that connects, I alloc another 20 rx structs. For each rx  
> struct, I alloc two buffers of unexpected size (so I can repost the  
> rx after handing the first buffer off in BMI_mx_testunexpected()).  
> This behavior can be changed but it really helps reduce latency.
>
> Also, since 32 KB is the starting size for rendezvous messages in  
> MX, I would prefer to use 8 or 16 KB so that the unexpected message  
> is actually sent eagerly.

Can we make it 16K?

>
> As for number of segments, MX will accept up to 256. For messages  
> less than 32 KB, there is not a penalty since eager messages are  
> buffered before sending on the wire. For messages over 32 KB using  
> more than one segment, MX will copy them into a contiguous buffer  
> before sending, which will greatly reduce throughput.

Sorry - bad terminology there.  By 'segments', I meant base types of  
a PVFS datatype (called a PVFS Request).  They all get encoded into  
the same buffer.

-sam

>
> Scott
>
> On Aug 8, 2007, at 7:47 PM, Sam Lang wrote:
>
>>
>> Hi All,
>>
>> Sorry for not being around earlier to participate in this  
>> discussion.  I agree that the bits of code in  
>> io_find_target_datafiles are nasty and I'll be sure to clean that  
>> up, but the cause of this bug isn't with small IO.  The check  
>> total_bytes <= max_unexp_payload will still do the right thing  
>> (not enable small IO) even if max_unexp_payload is negative.  In  
>> fact, the problem occurs because the noncontig request makes the  
>> normal IO request larger than 4K, and when the sys-io state  
>> machine tries to post the unexpected send, the BMI mx layer  
>> returns EINVAL because the
>> request is larger than its specified unexpected size (see mx.c: 
>> 1391).
>>
>> We do the same thing in other BMI methods (gm and ib), but the  
>> unexpected limits there are bigger (8K for ib, 16K for gm, 16K for  
>> tcp), and so we've never actually hit them with unexpected  
>> requests.  With a large indexed request, I think we would see the  
>> same errors with ib and gm, unless we hit the limit of max request  
>> segments first.  Each individual segment of an indexed request  
>> take 80 bytes, so we would need to have an indexed request with  
>> about 100 segments before hitting the max for ib, and somewhere  
>> around 200 for tcp and gm.  The limit of of segments for a request  
>> is hardcoded to 100 right now.
>>
>> At this point, It seems like the best fix is the one Scott chose,  
>> just increase the max unexpected size for MX.  The alternative is  
>> to split up unexpected requests if they're above a certain size  
>> into multiple unexpecteds, and join them on the server.  Messy and  
>> a lot of work.  If the unexpected message size isn't card  
>> specific, could we make it something like 32K or 64K?  Are there  
>> drawbacks to making it that big?
>>
>> Also, should we increase the limit of request segments allowed?   
>> It might be inefficient for a user to create an MPI indexed  
>> dataype with that many elements, but there are users that will  
>> probably do it anyway.  Alternatively, we could consider more  
>> efficiently encoding each request segment in PVFS.
>>
>> As an aside, the other methods return -EMSGSIZE, mx is returning  
>> EINVAL, which may have made this harder to debug.
>>
>> -sam
>>
>> On Aug 7, 2007, at 3:21 PM, Pete Wyckoff wrote:
>>
>>> atchley at myri.com wrote on Tue, 07 Aug 2007 15:21 -0400:
>>>> I assumed that small_io_size is fixed in PVFS2 and was greater  
>>>> than 4
>>>> KB, which why I volunteered to change bmi_mx. I chose 4 KB for  
>>>> bmi_mx
>>>> simply because that was the value I used in Lustre (kernel page
>>>> size). I am not wedded to 4 KB.
>>>
>>> Okay, good reasoning.  We'll let Sam tell us what he thinks.  He did
>>> the small io work.  I can't think of a reason why any device could
>>> not support a minimum of 8k, like you say, if that would make more
>>> sense for the small io implementation.
>>>
>>> 		-- Pete
>>> _______________________________________________
>>> Pvfs2-developers mailing list
>>> Pvfs2-developers at beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>>
>>
>>
>




More information about the Pvfs2-developers mailing list