[Pvfs2-developers] BMI questions

Sam Lang slang at mcs.anl.gov
Fri Dec 1 04:33:12 EST 2006


On Nov 30, 2006, at 6:58 PM, Scott Atchley wrote:

> On Nov 30, 2006, at 4:31 PM, Sam Lang wrote:
>
>> Right now all our operations (or transactions, as you call them)  
>> start with an unexpected message from the client, and end with an  
>> expected message from the server.  I don't know if that's a design  
>> requirement of BMI though, or just an artifact of how we use it in  
>> PVFS.  I _think_ the BMI interfaces were meant to allow expected  
>> messages in either direction in any order, and its left up to the  
>> upper layers to make sure they get posted right, but again, I  
>> would have to defer to one of the BMI sages.
>
> Hmmm. I assumed that for any operation, that there would be a back  
> and forth between client and server ending with a expected send  
> from server to the client:
>
> Client          Server
>    |     unex      |
>    |-------------->|
>    |               |
>    |      ex       |
>    |<--------------|
>    |               |
>    |      ex       |
>    |-------------->|
>    |               |
>    |      ex       |
>    |<--------------|
>    |               |
>
> with a minimum of unexpected client to server followed by an  
> expected from server to client. If this is the case I might be able  
> to do a simple flow control on the client using a reference count  
> (increment on send to server S and decrement on receive from S).
>
> Are you saying that a single operation may not ping pong back and  
> forth but have multiple expected sends in a single direction?
>
> Client          Server
>    |     unex      |
>    |-------------->|
>    |               |
>    |      ex       |
>    |<--------------|
>    |               |
>    |      ex       |
>    |-------------->|
>    |               |
>    |      ex       |
>    |-------------->|
>    |               |
>    |      ex       |
>    |-------------->|
>    |               |
>    |      ex       |
>    |<--------------|
>    |               |
>
> If so, would each of the receives (and matching sends) use  
> different tags? Also, this case presents a resource starvation  
> risk. Since the BMI method does not know about the entire operation  
> (how many sends/receives), it is possible that it could start the  
> operation but not be able to get the additional resources for the  
> subsequent sends/receives to complete it.

Your example above is currently how writes work.  The client sends an  
unexpected message to the server (a control message for the IO, file  
info, size of the IO, etc.), which posts an expected receive, and  
then sends an expected back to the client.  The client posts a  
receive for the expected before sending the unexpected.  After the  
receive of the expected message at the client completes (this is a  
'ready for IO' message from the server), It posts a send of the  
actual IO (this will be up to FlowBufferSize).  Once that send  
completes, it posts another one, and assumes that the server has  
already posted another receive (based on the size of the entire IO).   
Once all the IO has completed at the server (including pushing the  
data to disk), the server sends a response ack message, which the  
client posted a receive for before doing any of the actual IO.

I think the ordering of posts goes something like this for a write:

client:									server:
------------------------------------------------------------------------ 
------------------------

										post_unexp
post_recv(ready_ack)						
post_send(IO_request)
										wait(IO_request)
										post_recv(IO1)
										post_send(ready_ack)
wait(ready_ack)
post_send(IO1)
post_recv(write_ack)
										wait(IO1)
										post_recv(IO2)
wait_for_send_completion(IO1)
post_send(IO2)
										wait(IO2)
										post_recv(IO3)
...										...
post_send(ION)
										wait(ION)
										post_send(write_ack)
wait(write_ack)


It looks like the flow code on the server doesn't actually post the  
next recv of IO (IO2), until the first recv has completed (IO1), so  
its possible that the client posts (and starts) the next send before  
the server posts the next receive, although its probably unlikely.   
The server posts the next recv (IO2) once the first recv completes,  
as well as posting another recv (IO3) if necessary after the write to  
disk of the completed receive from IO1, so receives will begin to be  
posted before the current receive completes, allowing the server to  
post receives before the client posts associated sends.  This is  
essentially what the flow looks like on the server:

time
-------------->

[---BMI RECV IO1---][----BMI RECV IO2----][----------BMI RECV  
IO4----------][-------------------BMI RECV IO7----------------]
                                      [-DISK WRITE IO1-][------BMI  
RECV IO3------][--------BMI RECV IO5--------][---BMI RECV...
                                                                         
      [-DISK WRITE IO2-][------------------BMI RECV  
IO6---------------][---BMI RECV...
                                                                         
                                               [-DISK WRITE IO3-][--- 
BMI RECV...
                                                                         
                                                             [-DISK  
WRITE IO4-][---BMI RECV....
                                                                         
                                                                         
                          [-DISK WRITE IO5-]
                                                                         
                                                                         
                                           [-DISK WRITE IO6-]
                                                                         
                                                                         
                                                                  [- 
DISK WRITE IO7-]

(I hope the columns match up ok there, you may need to resize your  
window for best viewing :-)).

The [---] show the post and completion times of BMI receive  
operations, and associated writes of the received data to disk.  Each  
BMI receive uses a separate buffer (up to a max of 8 buffers).  Every  
time a bmi recv completes, two things happen, the associated trove  
write is posted, and a new bmi recv is posted.  So over time, bmi  
receives will get posted at the server before bmi sends get posted at  
the client, but the second and maybe third bmi receives posted may be  
posted after the bmi sends at the client.

To answer your specific questions:

The same bmi tag is passed to each of the post_send and post_recv  
calls for the entire IO operation.

As to hitting resource limits, the client doesn't post the next send  
until the previous send has completed.  I think with enough IO  
operations from different clients happening concurrently, it may be  
possible to run into the resource issues you speak of, but I need to  
verify that.

>
>> Are you able to do some kind of pre-posting if you know there's  
>> always an expected coming back?
>>
>> -sam
>
> I assumed that BMI always posted a receive for an expected incoming  
> send? Does it not? I would hope that BMI or a higher layer would  
> pre-post the receive before calling the send function. If not, let  
> me know.

Yes it always posts a receive for an expected message.  For most  
expected messages the receive is guaranteed to be posted before the  
peer posts the send.  That doesn't appear to guaranteed in the IO  
case though, as I mentioned above.

Hope this helps.

-sam

>
> Scott
>



More information about the Pvfs2-developers mailing list