[Pvfs2-developers] BMI send implemtations

Scott Atchley atchley at myri.com
Tue Aug 22 16:18:35 EDT 2006


On Aug 22, 2006, at 3:34 PM, Pete Wyckoff wrote:

> atchley at myri.com wrote on Tue, 22 Aug 2006 14:21 -0400:
>> Ok. MX does this underneath the covers for me. If my send is small
>> (<32 KB by default), it buffers the data and the send can complete
>> immediately. Larger sends are held on the sender until a matching
>> receive is posted. If the receive is not posted before the sender
>> cancels the send, there is nothing else to clean up.
>
> Lucky you.  :)

See, Patrick's rant was worthwhile. ;-)

>> This brings up an possible case. What if peer A sends what it thinks
>> is an expected small message to B and B never posts a receive for it?
>> Since it is small, A's MX will buffer the message and send to B's MX
>> lib. The sender will get an immediate completion. If the B never
>> posts a matching receive, it will sit in the MX unexpected queue
>> indefinitely. How long should I let a message sit in the unexpected
>> queue before deleting it? I do not want to delete it immediately in
>> case the local BMI is about to post a matching receive for it.
>
> That's tricky.  In practice it's not an issue because all BMI users
> do a request/response from client to server currently.  You'll have
> to hang onto it as long as the application is up, maybe you'll be
> able to clear it when BMI_DROP_ADDR gets called for that particular
> client.

Hanging on to it is the default. :-)

I could add the BMI_DROP_ADDR function and clean up. If it is never  
called, the message will just sit in the MX unexpected queue and take  
up space. I think I can make it so that on shutdown I can dump the MX  
state including messages in the various queues. It could be helpful  
for debugging.

>> Also, MX does not limit the number of unexpected messages, but it
>> does place a limit on memory used for unexpected messages (~2MB by
>> default? and is tunable). There are no queue pairs for limiting per
>> peer messages. If rate limiting (flow control) is a requirement, let
>> me know.
>
> The reason I fret about flow control is how IB dies horribly if you
> don't pay attention to it.  As long as MX silently drops the
> message, but doesn't kill the network, the client will happily retry
> the unexpected message later.

Trust me, I would prefer to avoid it given the hoops I have to jump  
through in Lustre. In GM, the sender would flood the network until  
the receiver posted a receive. This is not a problem in MX. I will  
start without flow control and we will see how BMI behaves. We can  
always retrofit it after the fact.

>> MX is connection-less. I will pre-post a bunch of receives for
>> unexpected messages with a special bit mask. For expected receives, I
>> will post using the BMI tag as well as a identifier for the remote
>> peer. I can then test on a specific, per-peer receive (mx_recv()
>> followed by mx_test()) or on any available (expected or not) receive
>> (mx_recv() followed by mx_test_any()).
>
> That sounds fine.  If you have enough bits you can store the pointer
> for the peer state structure itself, rather than some identifier, to
> maybe avoid a hash lookup.
>
> In practice, BMI_testcontext() is most frequently used so you'll
> probably end up polling for any rather than a specific peer.  If
> that matters.
>
>>> To a BMI user, an unexpected message always signals the start of a
>>> new transaction, while an expected message continues an existing
>>> transaction for which you've got outstanding state.  Phil said in
>>> his CAC paper, "This reduces complexity on the server side because
>>> the server does not have to anticipate buffer use in advance."
>>
>> Can you clarify the "you" in the "while an expected message continues
>> an existing
>> transaction for which you've got outstanding state"? Does "you" mean
>> the BMI/MX method or BMI/PVFS? From what it looks like, I have no
>> state (unless I need to rate limit sends).
>
> Heh.  "you" being the pvfs2-server process, for instance.  A
> transaction like an IO flow consists of many messages that are all
> related to the same request/response pair.  The server will remember
> things like the file being used, offset for new data to be written,
> etc.  This is above BMI.
>
> You will need a teensy bit of per-peer state to translate BMI's
> struct method_addr into whatever MX uses to address remote nodes and
> processes (and back).  Hangs off a void* in there.  Maybe want to
> hold onto a user-understandable reprentation of the peer identity
> for error or debug messages (char *peername in IB).

Sorry, I did not mean _any_ state in the MX method. Sam, Murali and I  
discussed these. I meant to say that the MX method would not have any  
knowledge of how one BMI send (or receive) may or may not relate to  
another BMI send (or receive).

> Presumably you'll have a queue of outstanding sends and receives
> somewhere too, but this doesn't have to be per-peer in your case.

I will probably have per-peer send queues so that I can queue  
messages until I can build my MX endpoint address for that peer (call  
mx_iconnect()). MX will manage the receive queue (I call mx_isend())  
and I simply test for completions using mx_test() or mx_test_any(). I  
will need a queue of canceled requests to return in  
BMI_method_testcontext().

>>>> Is the purpose of the RTS/CTS messages then to stall the sending
>>>> until the receiver has posted the receive?
>>>
>>> Yes.   (I'm just talking about IB again.  GM may be similar.)
>>
>> This is unnecessary then for MX. If the send is expected, I can
>> simply call mx_isend() (same semantics as MPI_Isend()). If it is
>> small, MX will buffer and send it. If large, it will wait until the
>> peer posts a receive. If the peer fails to post a receive or if it is
>> gone (crashed, etc.), can I assume that BMI or higher will manage the
>> timeout and call BMI_method_cancel() on the send?
>>
>> If so, I do not need either RTS or CTS messages.
>
> That's a safe assumption.  You will have to get the tag matching
> correct to make sure messages demux into the right spots for the
> case of multiple concurrent clients, or even a multi-threaded
> client doing multiple requests, but that's hopefully doable.

MX provides 64 bits of match info for sends/recvs. I plan to  
partition the bits as follows:

bits	comment
0-3	msg_type (conn_req, conn_ack, expected, unexpected, ...?)
4-7	reserved for credits (if used)
8-15 	reserved
15-31	peer id (16 bits => 65K peers)
32-63	BMI tag

The conn_req and con_ack message types allow me to do a simple  
handshake to establish MX state (hostname, NIC index, endpoint index)  
and agree on a version number. I'm reserving bits for credits should  
flow control be necessary and another 8 bits for future use if  
needed. I am assuming that 65K peers will be enough for awhile. If  
you think that PVFS will be deployed on systems with more than 65K  
peers, I can pull from the reserved bits to increase this id. Lastly,  
I will pass the BMI tag, which is 32 bits.

I am assuming that a call to BMI_post_send() will have a matching  
call to BMI_post_recv() on the remote peer and that both will share  
the same BMI tag. If not, then please let me know.

>>>> If so, would the receiver
>>>> ever send a CTS to indicate that a match is not forthcoming?
>>>
>>> No.  Not sure how such information would help the sender.
>>
>> This is a possible case in Lustre. If the receiver cannot find a
>> matching buffer (either bad request or lack of resources), it will
>> let me know to NAK the send request (send a CTS that indicates  
>> failure).
>
> That's reasonable.  In PVFS this is figured out before BMI gets told
> to send a big message, but arguably the network layer could be more
> involved in the storage protocol.
>
>> Who manages the timeout? The BMI method (i.e. me) or something higher
>> in BMI/PVFS? Can I assume all send and receives are subject to  
>> timeout?
>
> Higher in PVFS (not BMI).  Can't really assume it though.  Sometimes
> we turn it off for debugging.  But normal use wants a configurable
> timeout to detect peer failure.  It's 30 sec by default, in fs.conf.
>
> 		-- Pete

Excellent. This is all I think I need for now (except for  
confirmation of my assumption above about the sender and receiver  
using matching tags).

Thanks,

Scott


More information about the Pvfs2-developers mailing list