[Pvfs2-developers] Re: noncontig-test
Scott Atchley
atchley at myri.com
Wed Aug 8 21:09:37 EDT 2007
Hi Sam,
Welcome back!
I copied EINVAL from ib.c. I can change it to return BMI_EMSGSIZE
since it more specific.
The trade-off to 32-64 KB unexpected size is memory footprint. To
avoid mallocing a buffer for each incoming unexpected, I pre-alloc
100 rx structs on the server to catch initial connect messages. For
each peer that connects, I alloc another 20 rx structs. For each rx
struct, I alloc two buffers of unexpected size (so I can repost the
rx after handing the first buffer off in BMI_mx_testunexpected()).
This behavior can be changed but it really helps reduce latency.
Also, since 32 KB is the starting size for rendezvous messages in MX,
I would prefer to use 8 or 16 KB so that the unexpected message is
actually sent eagerly.
As for number of segments, MX will accept up to 256. For messages
less than 32 KB, there is not a penalty since eager messages are
buffered before sending on the wire. For messages over 32 KB using
more than one segment, MX will copy them into a contiguous buffer
before sending, which will greatly reduce throughput.
Scott
On Aug 8, 2007, at 7:47 PM, Sam Lang wrote:
>
> Hi All,
>
> Sorry for not being around earlier to participate in this
> discussion. I agree that the bits of code in
> io_find_target_datafiles are nasty and I'll be sure to clean that
> up, but the cause of this bug isn't with small IO. The check
> total_bytes <= max_unexp_payload will still do the right thing (not
> enable small IO) even if max_unexp_payload is negative. In fact,
> the problem occurs because the noncontig request makes the normal
> IO request larger than 4K, and when the sys-io state machine tries
> to post the unexpected send, the BMI mx layer returns EINVAL
> because the
> request is larger than its specified unexpected size (see mx.c:1391).
>
> We do the same thing in other BMI methods (gm and ib), but the
> unexpected limits there are bigger (8K for ib, 16K for gm, 16K for
> tcp), and so we've never actually hit them with unexpected
> requests. With a large indexed request, I think we would see the
> same errors with ib and gm, unless we hit the limit of max request
> segments first. Each individual segment of an indexed request take
> 80 bytes, so we would need to have an indexed request with about
> 100 segments before hitting the max for ib, and somewhere around
> 200 for tcp and gm. The limit of of segments for a request is
> hardcoded to 100 right now.
>
> At this point, It seems like the best fix is the one Scott chose,
> just increase the max unexpected size for MX. The alternative is
> to split up unexpected requests if they're above a certain size
> into multiple unexpecteds, and join them on the server. Messy and
> a lot of work. If the unexpected message size isn't card specific,
> could we make it something like 32K or 64K? Are there drawbacks to
> making it that big?
>
> Also, should we increase the limit of request segments allowed? It
> might be inefficient for a user to create an MPI indexed dataype
> with that many elements, but there are users that will probably do
> it anyway. Alternatively, we could consider more efficiently
> encoding each request segment in PVFS.
>
> As an aside, the other methods return -EMSGSIZE, mx is returning
> EINVAL, which may have made this harder to debug.
>
> -sam
>
> On Aug 7, 2007, at 3:21 PM, Pete Wyckoff wrote:
>
>> atchley at myri.com wrote on Tue, 07 Aug 2007 15:21 -0400:
>>> I assumed that small_io_size is fixed in PVFS2 and was greater
>>> than 4
>>> KB, which why I volunteered to change bmi_mx. I chose 4 KB for
>>> bmi_mx
>>> simply because that was the value I used in Lustre (kernel page
>>> size). I am not wedded to 4 KB.
>>
>> Okay, good reasoning. We'll let Sam tell us what he thinks. He did
>> the small io work. I can't think of a reason why any device could
>> not support a minimum of 8k, like you say, if that would make more
>> sense for the small io implementation.
>>
>> -- Pete
>> _______________________________________________
>> Pvfs2-developers mailing list
>> Pvfs2-developers at beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>
>
>
More information about the Pvfs2-developers
mailing list