[Pvfs2-developers] bmi multiple address endpoints

Sam Lang slang at mcs.anl.gov
Wed Nov 7 17:21:13 EST 2007


I discussed the desired behavior we want out of this fail-over code  
with folks offline, and we came up with a plan.

At the moment, there are two conflicting failure schemes, multiple  
addresses, and multiple protocols.  Also, the multiple protocols case  
isn't ideal for failure, since both protocols on the same host are  
active and listening, making a timeout with one protocol  
automatically switch to the second (often slower) one, without any  
admin fail-over procedures.

The plan we have is to allow multiple addresses to be specified in an  
ordered list as before:

Alias hosta mx://hosta:0:3 mx://hostb:0:4 ... tcp://hosta:3334 tcp:// 
hostb:3335 ....

For the server, a command line option specifies which addresses from  
this list to listen on.

pvfs2-server fs.conf -a hosta -e "mx://hosta:0:3" -e "tcp://hosta:3334"

For the client, the contact order is based on the addresses in the  
Alias from left to right, with a few caveats:

1.  Once an address fails, the client attempts the next addresses in  
the list till one succeeds in a round-robin fashion.  Once an attempt  
succeeds, the cursor is reset to the beginning of the list.  This  
allows fail-over to a server to be 'reset' at some point, without  
getting stuck attempting to contact (and succeeding) on a different  
protocol.

2.  An option to the mount entry specifies which protocols should be  
"filtered out" of the list.  This allows clients to control the  
secondary endpoints that are attempted from the list.  The order  
remains the same, but for example, nodes that only want to use mx can  
set a mount option to bmi=mx, and the list of endpoints in the above  
example just becomes:

Alias hosta mx://hosta:0:3 mx://hostb:0:4 ...

This prevents the behavior where a client would connect over tcp  
instead of attempting mx addresses until the HA infrastructure has  
time to do the proper fail-over.  In some cases admins may choose to  
interleave protocol addresses:

Alias hosta mx://hosta:0:3 tcp://hosta:3334 mx://hostb:0:4 tcp:// 
hostb:3335 ....

The default behavior here would be to fail-over to the tcp address if  
the mx address failed, which the admin may actually want.  With the  
mount option, the user/admin can then further constrain the list to  
be only the mx addresses (or tcp), and allow the fail-over to only  
occur on one protocol.  Also, the mount option can specify an  
ordering to the protocols as well, so that even if the different  
addresses of the protocols are interleaved in the list, the mx  
addresses will always be attempted first, then the tcp addresses.

Hopefully this covers all the scenarios we plan to see.  At the  
moment I'm considering the mount option as bmi=<proto1>:<proto2>:...

I'm going to start modifying the patch to get this behavior.  Let me  
know what you think.

Thanks,
-sam


On Nov 6, 2007, at 6:18 PM, Sam Lang wrote:

>
> Here's take2.  Hopefully a little cleaner.
>
> The more I think about the issue with multiple protocols and  
> primary/secondary addresses, the more complicated it gets.  I've  
> added an option to the server to specify the index in the set of  
> primary/secondary addresses to listen on, allowing a server to be  
> started and listen on the 3rd set of addresses in the endpoint  
> string.  This doesn't fix the problem on the client though, as we  
> don't really want to iterate over both the primary/secondary  
> endpoints and the different protocols.  I guess maybe the protocol  
> on the client has to be chosen based on the protocol specified in  
> the mntent.
>
> -sam
>
> <bmi-maddrs-take2.patch>
>
> On Nov 6, 2007, at 2:27 PM, Sam Lang wrote:
>
>>
>> On Nov 6, 2007, at 1:41 PM, Pete Wyckoff wrote:
>>
>>> slang at mcs.anl.gov wrote on Tue, 06 Nov 2007 10:50 -0600:
>>>> The attached patch implements BMI multiple address endpoints  
>>>> that we
>>>> talked about some time ago.  To refresh everyone's memory, this
>>>> allows a set of addresses to be specified:
>>>>
>>>> tcp://host1:3334/pvfs2-fs,tcp://host2:3335/pvfs2-fs,tcp::// 
>>>> host3:3336/
>>>> pvfs2-fs
>>>>
>>>> In the config file for a given storage endpoint.  The BMI code
>>>> manages the endpoint, setting the currently used address to the  
>>>> first
>>>> one in the list.  On message failure, the endpoint is  
>>>> transitioned to
>>>> point to the next address in the list.  This continues in a round-
>>>> robin fashion.
>>>
>>> This is good stuff.  I'd like to help review it.  Can you do some
>>> trivial things first to make that easier?
>>>
>>> 1.
>>>> +    struct bmi_endpoint_ref_s *newref;
>>> [..]
>>>> +    /* haven't seen any of these addresses before, add a new  
>>>> endpoint */
>>>> +    newref = malloc(sizeof(struct bmi_addr_ref_s));
>>>
>>> Please change all malloc(sizeof()) and memset(,,sizeof()) to use the
>>> variable name, not its type.  The bug you did above (and other
>>> places) is way too common.  We just have to stop doing that.  Like
>>> this instead:
>>>
>>> 	newref = malloc(sizeof(*newref));
>>>
>>> 2.
>>> There's a bunch of stuff that seems out of place.  Can you check
>>> that in or push it to the side so we can concentrate on the core?
>>> 179 kB is a big patch.  :)
>>
>> The bmi-addr.[ch] files are new files, and the critical part of  
>> the patch.  They obsolete the ref_st calls, so the reference-list. 
>> [ch] files have been removed in the patch.  That may be adding to  
>> the number of lines.
>>
>>>
>>> Renaming all the method_ops is a big uninteresting part.
>>
>> Yeah, I got tired of calling them over and over with the  
>> BMI_method_ tagged on the front.  I can try to pull that stuff out.
>>
>>>
>>> Adding bmi_ in front of everything too, but it's too hard to rip
>>> that out mid-patch now.  Random whitespace fixes too.  Good changes,
>>> just hard to read.
>>>
>>> 3.
>>> Some trivial bugs.
>>>
>>>> +    ref->current = ref->current + 1 % ref->count;
>>>
>>> Check your precedence table.
>>>
>>>> + * (C) 2001 Clemson University and The University of Chicago
>>>
>>> New files go back in time.
>>>
>>> Can you put some comments above each of the three new structs in
>>> bmi-addr.c?  I keep getting confused on which is the old-style
>>> addr and which is the comma-separated list.  And what the various
>>> "link" and "refs" fields point to.
>>
>> Sure thing.
>>
>>>
>>>> I've done some basic testing, but there's still more to do.  The
>>>> client IO state machine is a bear, and testing all the cases where
>>>> things could failover (requests, flows, acks, etc.) is going to  
>>>> take
>>>> some more work.  I wanted to get the patch out there to allow  
>>>> others
>>>> to provide feedback.
>>>
>>> Yeah, totally.  But it can be made to work.
>>>
>>> What did we decide do with mixed method usage?  The old semantic
>>> was "ib://foo:2345/pvfs2-fs,tcp://foo:2347/pvfs2-fs" means try to
>>> use IB, but if you don't have an IB nic, switch to TCP.  I agree we
>>> decided that was less interesting.  Do we just add docs that say
>>> that this comma is now for multi-pathing?  If people try this
>>> example with the new code, it will flip from IB to TCP at every
>>> timeout.  The old behavior was to stick with the first one where
>>> you had the hardware.  In other words, probably some docs somewhere
>>> should be added to this patch.
>>
>> Right, I thought we had decided to go the multi-path route.  I  
>> guess there could be a config option that would set the flipping  
>> from round-robin to try-once, giving the old behavior.
>>
>>>
>>> On the server side, the ,-separated addresses mean "listen on all
>>> these interfaces".  What do servers do now when they see your
>>> tcp://host1,tcp://host2 example string above?  Looks like they would
>>> fail to listen on anything (host1 can't bind to host2 address?).
>>> This has to be in the fs.conf as the Alias string for each server so
>>> that clients can find the IO servers, not just in the pvfs2tab to
>>> find the config server.
>>
>> Yeah that formatting isn't going to work.  I wanted to keep it  
>> simple, but I guess that's not possible.  Should we separate  
>> fallback addresses with ';' instead?
>>
>> tcp://host1:3331,ib://host1:3335;tcp://host2:3332,ib://host2:3336
>>
>> Something like that?
>> -sam
>>
>>>
>>> 		-- Pete
>>>
>>
>



More information about the Pvfs2-developers mailing list