[Pvfs2-developers] BMI/MX design draft

Scott Atchley atchley at myri.com
Thu Sep 14 13:01:22 EDT 2006


Hi all,

I have outlined what I intend to write for bmi_mx. Since MX is  
designed to implement the MPI API, it closely matches the major  
functions of PVFS (e.g. send, recv, test, test_any (aka testcontext),  
etc.). Like MPI, MX provides for matching tags. Specifically, it  
allows for 64 bits. I will use these bits to indicate the message  
type (EXPECTED or UNEXPECTED for PVFS messages as well as for  
connection messages for bmi_mx).

MX also does dynamic memory registration. There are no calls to [un] 
register memory. Malloc() and free() will simply be malloc() and free().

In the text below, I do not mention posting unexpected receives. MX  
has an unexpected callback handler. I will register a simple function  
that gets an idle rx descriptor and posts a matching receive  
immediately. This is preferred to pre-posting a bunch of generic,  
unexpected receives, which slows down matching of expected receives.

In talking with another developer at lunch about PVFS and the lack of  
a dedicated thread for bmi_mx, he will add a completion handler  
function to the MX API. This will allow me to progress my internal  
connection management messages without waiting for calls to  
BMI_mx_testcontext() or BMI_testunexpected().

Lastly, cancellation of posted sends (and partially matched receives)  
is a problematic operation. MX cannot guarantee that a peer has not  
actually received a message when we try to cancel a send. The only  
thing we can guarantee is that we can release the local buffer (i.e.  
the peer will not be able to read from it from this point on).  
Internally, MX needs to cleanup a lot of state to handle a send  
cancellation so we recently added a new function to the API called  
mx_disconnect(). This call cancels all outstanding operations with  
the peer (posted sends and matched receives). I will call it when I  
am asked to cancel a send or a receive that has already been matched  
but not completed (i.e. partially received) and when a peer is  
sending an internal bmi_mx connect message.

Any comments or suggestions?

Scott

--
Scott Atchley
Myricom Inc.
http://www.myri.com

Partition of Match Bits

bits	comment
4	msg_type	/* MX connect msgs, bmi_mx connect
			msgs (conn_req, conn_ack), expected,
			unexpected, ... */
4	credits		/* reserve in case credits are used */
4 	reserved	/* for future use */
20	id		/* receiver assigned id for the peer
			for posting rxs for a specific peer to
			distinguish rxs with the same BMI tag */
32	tag		/* the BMI tag */


Conn Req msg:
bmi_mx version		/* change if protocol changes */
name			/* MX hostname excluding board */
board			/* MX board */
endpoint		/* MX endpoint */
peer_id	(for B)		/* Have peer use this id when sending to me */
[credits]		/* max msgs in-flight */

Conn Ack msg:
peer_id (for A)		/* or negative value if version mismatch, etc. */

Initial Send Unex

A                       B
|    mx_iconnect()      |
|---------------------->|  /* msg_type bits = ICONN_CR */
|    conn_req msg       |  /* msg_type bits = CONN_REQ */
|---------------------->|
|                       |
|    mx_iconnect()      |
|<----------------------|  /* msg_type bits = ICONN_CA */
|    conn_ack msg       |  /* stuff id in bits, use 0 byte msg? */
|<----------------------|  /* msg_type bits = CONN_ACK */
|                       |

Peer State:
peername		/* mx://host:board:endpoint */
name			/* host */
board			/* board */
endpoint		/* endpoint */
my_id			/* id assigned to me by peer */
peer_id			/* id I assigned to peer */
mx_nic_id		/* peer's MX nic_id */
mx_endpoint_addr_t	/* MX endpoint address */
state			/* INIT, WAIT, READY, DISCONNECT */
qlist queued_sends	/* need connect, need tx descriptor */
qlist queued_recvs	/* in DISCONNECT, wait until state back to INIT,  
then post */
qlist pending_recvs	/* in-flight recvs (in case of cancel) */
qlist peers		/* for hanging on the global list of peers */
lock			/* for serialization */

Msg Descriptor (tx/rx):
type			/* TX/RX */
msg_type		/* CONN_REQ, CONN_ACK, EXPECTED, UNEXPECTED */
qlist global_list	/* hand on global list of TX or RX for cleanup */
qlist list		/* hang on idle, queued and pending lists */
method_op		/* BMI op */
state			/* IDLE, PREP, PENDING, COMPLETED, CANCELED */
peer			/* owning peer */
tag			/* BMI supplied 32-bit msg id */
match_info		/* MX match info (msg type [ | peer_id]) */
mxseg			/* mx_segment_t for small messages */
buffer			/* void * for small msg */
*mxsegs			/* array of mx_segments pointing to list of bufs */
nseg			/* number of segments */
nob			/* number of bytes */
lock			/* might not be needed */

Global State:
peername		/* mx://host:board:endpoint */
name			/* host */
board			/* board */
endpoint		/* endpoint */
qlist peers		/* list of peers */
qlist txs		/* list of txs (for cleanup) */
qlist idle_txs		/* available txs */
qlist rxs		/* list of rxs (for cleanup) */
qlist idle_rxs		/* available rxs */
qlist cancelled_reqs	/* called mx_cancel(), return in bmi_testcontext 
() */
next_id			/* id for next peer [1...2^20] */
lock			/* for serialization */


Peer State on Client
Peer state starts at INIT. When calling mx_iconnect(), set state to  
WAIT. When the mx_iconnect() request returns, post a receive for  
CONN_ACK and send a CONN_REQ msg. When the CONN_ACK completes, set  
state to READY. For each tx and rx, add a reference for the peer.  
When any tx or rx completes, decrement the reference. If a request to  
cancel a msg sets the state to DISCONNECT, wait until all pending txs  
and rxs complete and decrement their references, set the state to  
INIT and start over.

Peer State on Server
When a CONN_REQ rx completes, retrieve the peer info from the  
endpoint addr context. If none is found, create a new peer. If found,  
set state to DISCONNECT and cancel pending_recvs. Set the peer state  
to INIT. Call mx_iconnect() and set the state to WAIT. When the  
mx_iconnect() request returns, send a CONN_ACK and set state to  
READY. For each tx and rx, add a reference for the peer. When any tx  
or rx completes, decrement the reference. When any tx or rx  
completes, decrement the reference. If a request to cancel a msg sets  
the state to DISCONNECT, wait until all pending txs and rxs complete  
and decrement their references, set the state to INIT and start over.


BMI_mx_post_send_common()
	get idle tx
	lookup peer using peername
	if no peer
	    /* should happen on client only */
	    create peer
	assign msg_type, peer, tag, method_op, nob
	create match_info (msg_type | peer_id | tag)
	map segment(s)
	if unexpected
	    ensure length < EAGER_SIZE
	switch peer state
	    case READY
		add reference count on peer
		send tx
		break
	    case INIT
		call mx_iconnect
		/* fall through */
	    case WAIT
	    case DISCONNECT
		append to queued_sends
		break

BMI_mx_post_send()
	call BMI_mx_post_send_common()

BMI_mx_post_send_list()
	call BMI_mx_post_send_common()

BMI_mx_post_sendunexpected()
	call BMI_mx_post_send_common() with unexpected flag

BMI_mx_post_sendunexpected_list()
	call BMI_mx_post_send_common() with unexpected flag

BMI_mx_post_recv()
	get idle rx
	lookup peer using peername
	if no peer
	    /* should happen on client only */
	    create peer
	assign msg_type, peer, tag, method_op, nob
	create match_info (msg_type | peer_id | tag)
	map segment(s)
	if unexpected
	    ensure length < EAGER_SIZE
	switch peer state
	    case INIT
	    case WAIT
	    case READY
		add reference count on peer
		queue on pending_recvs
		post rx
		break
	    case DISCONNECT
		/* we can't post it and add a ref if in DISCONNECT
		   because we need the ref count to go to 0 before
		   the state goes back to INIT */
		queue on queued_recvs
		break

BMI_mx_post_recv_list()
	call BMI_mx_post_recv()

BMI_mx_test()
	mx_test()

BMI_mx_testcontext()
	handle_conn_reqs()
	for 1 to incount
	    dequeue from cancelled_reqs
	    set outid, err, user_ptr
	    queue idle tx/rx
	for completed to incount
	    mx_test_any() with EXPECTED bit mask
	    set outid, err, size, user_ptr
	    if rx
		dequeue from pending_recvs
	    queue idle tx/rx

BMI_mx_testunexpected()
	handle_conn_reqs()
	mx_test_any() with UNEXPECTED bit mask
	if found
	    update UI struct
	    queue idle rx
	    return 1
         else
	    return 0

handle_conn_reqs()
	do
	    mx_test_any() with ICONN_CR or ICONN_CA bit mask
	    switch type
		case ICONN_CR
		    if success
			get idle tx
			send CONN_REQ
		    else
			set peer state to DISCONNECT
			drop queued rxs and txs
		case ICONN_CA
		    if success
			get idle tx
			set peer state to READY
			send CONN_ACK
			send queued txs
		    else
			set peer state to DISCONNECT
			drop queued rxs and txs
	while (request returned)
	do
	    mx_test_any() with CONN_REQ or CONN_ACK bit mask
	    switch type
		case TX
		    handle CONN TX completion
		case RX
		    handle CONN RX completion
	while (request returned)

handle CONN TX completion
	if failed
	    set peer state to DISCONNECT
	    drop queued rxs and txs
	put idle tx

handle CONN RX completion
	if CONN_REQ
	    parse msg
	    mx_iconnect() with ICONN_CA
	    if the values don't match
		set peer state to DISCONNECT
	if CONN_ACK
	    if success
		get my_id from match_info
		set peer state to READY
		send queued txs
	    else
		set peer state to DISCONNECT
		drop pending rxs and txs
	put idle rx


BMI_mx_cancel()
	if rx
	    mx_cancel(rx)
	    if SUCCESS, return SUCCESS
             else
		mx_test(rx)
		if SUCCESS, return FAIL /* rx completed */
		else
			set peer state to DISCONNECT
			mx_disconnect()
			cancel pending_recvs
	else /* tx */
	    set peer state to DISCONNECT
	    mx_disconnect()
	    cancel pending_recvs

BMI_mx_method_addr_lookup()
	parse id
	lookup peer in peers list
	if !found
	    create a new peer
	return method_addr *
	

BMI_mx_rev_lookup()
	return peer's peername

BMI_mx_set_info()		/* drop_addr (probe for unmatched, expected  
messages and drop them) */

BMI_mx_get_info()		/* unexpected size, drop_addr (probe for  
unmatched, expected messages and drop them) */

BMI_mx_initialize()
	alloc global peer state
	alloc pool of rxs and txs
	mx_init()
	mx_open_endpoint()
	mx_register_unexp_handler()

BMI_mx_finalize()
	mx_wakeup()
	mx_finalize()

BMI_mx_memalloc()		/* malloc() */
BMI_mx_memfree()		/* free() */
BMI_mx_open_context()		/* return 0 */
BMI_mx_close_context()		/* return 0 */


More information about the Pvfs2-developers mailing list