[Pvfs2-developers] Re: noncontig-test
Sam Lang
slang at mcs.anl.gov
Wed Aug 8 22:18:29 EDT 2007
On Aug 8, 2007, at 8:09 PM, Scott Atchley wrote:
> Hi Sam,
>
> Welcome back!
Hi Scott,
Thanks!
>
> I copied EINVAL from ib.c. I can change it to return BMI_EMSGSIZE
> since it more specific.
Ah ok. EMSGSIZE might help us know what the problem is if/when it
happens again.
>
> The trade-off to 32-64 KB unexpected size is memory footprint. To
> avoid mallocing a buffer for each incoming unexpected, I pre-alloc
> 100 rx structs on the server to catch initial connect messages. For
> each peer that connects, I alloc another 20 rx structs. For each rx
> struct, I alloc two buffers of unexpected size (so I can repost the
> rx after handing the first buffer off in BMI_mx_testunexpected()).
> This behavior can be changed but it really helps reduce latency.
>
> Also, since 32 KB is the starting size for rendezvous messages in
> MX, I would prefer to use 8 or 16 KB so that the unexpected message
> is actually sent eagerly.
Can we make it 16K?
>
> As for number of segments, MX will accept up to 256. For messages
> less than 32 KB, there is not a penalty since eager messages are
> buffered before sending on the wire. For messages over 32 KB using
> more than one segment, MX will copy them into a contiguous buffer
> before sending, which will greatly reduce throughput.
Sorry - bad terminology there. By 'segments', I meant base types of
a PVFS datatype (called a PVFS Request). They all get encoded into
the same buffer.
-sam
>
> Scott
>
> On Aug 8, 2007, at 7:47 PM, Sam Lang wrote:
>
>>
>> Hi All,
>>
>> Sorry for not being around earlier to participate in this
>> discussion. I agree that the bits of code in
>> io_find_target_datafiles are nasty and I'll be sure to clean that
>> up, but the cause of this bug isn't with small IO. The check
>> total_bytes <= max_unexp_payload will still do the right thing
>> (not enable small IO) even if max_unexp_payload is negative. In
>> fact, the problem occurs because the noncontig request makes the
>> normal IO request larger than 4K, and when the sys-io state
>> machine tries to post the unexpected send, the BMI mx layer
>> returns EINVAL because the
>> request is larger than its specified unexpected size (see mx.c:
>> 1391).
>>
>> We do the same thing in other BMI methods (gm and ib), but the
>> unexpected limits there are bigger (8K for ib, 16K for gm, 16K for
>> tcp), and so we've never actually hit them with unexpected
>> requests. With a large indexed request, I think we would see the
>> same errors with ib and gm, unless we hit the limit of max request
>> segments first. Each individual segment of an indexed request
>> take 80 bytes, so we would need to have an indexed request with
>> about 100 segments before hitting the max for ib, and somewhere
>> around 200 for tcp and gm. The limit of of segments for a request
>> is hardcoded to 100 right now.
>>
>> At this point, It seems like the best fix is the one Scott chose,
>> just increase the max unexpected size for MX. The alternative is
>> to split up unexpected requests if they're above a certain size
>> into multiple unexpecteds, and join them on the server. Messy and
>> a lot of work. If the unexpected message size isn't card
>> specific, could we make it something like 32K or 64K? Are there
>> drawbacks to making it that big?
>>
>> Also, should we increase the limit of request segments allowed?
>> It might be inefficient for a user to create an MPI indexed
>> dataype with that many elements, but there are users that will
>> probably do it anyway. Alternatively, we could consider more
>> efficiently encoding each request segment in PVFS.
>>
>> As an aside, the other methods return -EMSGSIZE, mx is returning
>> EINVAL, which may have made this harder to debug.
>>
>> -sam
>>
>> On Aug 7, 2007, at 3:21 PM, Pete Wyckoff wrote:
>>
>>> atchley at myri.com wrote on Tue, 07 Aug 2007 15:21 -0400:
>>>> I assumed that small_io_size is fixed in PVFS2 and was greater
>>>> than 4
>>>> KB, which why I volunteered to change bmi_mx. I chose 4 KB for
>>>> bmi_mx
>>>> simply because that was the value I used in Lustre (kernel page
>>>> size). I am not wedded to 4 KB.
>>>
>>> Okay, good reasoning. We'll let Sam tell us what he thinks. He did
>>> the small io work. I can't think of a reason why any device could
>>> not support a minimum of 8k, like you say, if that would make more
>>> sense for the small io implementation.
>>>
>>> -- Pete
>>> _______________________________________________
>>> Pvfs2-developers mailing list
>>> Pvfs2-developers at beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>>
>>
>>
>
More information about the Pvfs2-developers
mailing list