Support network ingress on arbitrary ports#9454
Support network ingress on arbitrary ports#9454openshift-bot merged 1 commit intoopenshift:masterfrom marun:tcp-routes
Conversation
|
@jsafrane I'd love if you could review this work.... |
|
[test] |
pkg/externalip/api/v1/types.go
Outdated
|
|
||
| // ServiceRef references the service (in any namespace) that is | ||
| // bound to this external ip address. | ||
| ServiceRef kapi.ObjectReference `json:"serviceRef"` |
There was a problem hiding this comment.
Should be in a status struct, just like PV's. The pattern here is exactly like PVC and PV, except we want to not make the mistake we did there (making the name a user would see confusing). How does a user request one of these IPs?
There was a problem hiding this comment.
Can you write out in the description of the PR how you expect these to be used (the exact sequence of interactions that leads to one of these being allocated)?
|
My goal was to allow automatic allocation of an 'ingress ip' to a service, as follows:
I was not intending allocation of an ingress ip be performed directly (at least not initially). Either an ingress ip would be allocated for a service in response to an annotation, or a service would specify one or more externalIPs and the admission controller would allocate the corresponding ingress ips. In response to your comment to avoid using the admission controller to allocate an ingress ip for each externalIP defined on a service, how would a standard controller be able to prevent multiple services from allocating the same ingress ip? I imagine a controller could record allocation status on the service, but services don't have a status field. Given that the ingress object appears poised to absorb responsibility for ingress ip allocation, should one be added? By using an admission controller, I was hoping to avoid needing one since clashing allocations would be prevented on creation and update. This solution is essentially a stop-gap while we wait for upstream ingress to evolve to meet the same requirement. Do you think this is a reasonable goal? Is there another way you would suggest accomplishing it? |
|
Maybe a different question - if I had a Service of type LoadBalancer, and you were on bare metal, would I just expect it to get allocated one of these new-fangled magic external IPs? And then if I went to the cloud, would I expect to get a real AWS/GCE loadbalancer (because I'm kind of assuming that the trickery we do for external wouldn't work on AWS/GCE)? Or would I want to use this in GCE/AWS as well as a loadbalancer? It kind of sounds like this is bare metal L3/4 load balancing, which in theory services should do since we already have an IP. Controllers are single threaded, so they keep their own state consistently and allocate (that's how the security allocator gives out unique UID blocks to each namespace that is created). |
|
The security allocator uses namespaces as the authoritative source of "which ips are handed out" and then uses the "allocator" utility code to manage tracking that. That's our standard "I want to define in config a range, and then hand out values outside of it". An admin can override the security allocator by setting the annotation, in which case the allocator just marks that value taken and moves on. I don't want to get too much into allocation yet - @pravisankar knows a lot about it and can help explain the code if necessary, just want to understand the exact use case we have to solve so we can align it with existing APIs. |
|
Ok, some concrete thoughts: We have a big choice here:
I think the first is simpler than the later, because we can make some assumptions:
If we do the latter, we have to add more API objects and do a bit more work:
Adding API objects is always expensive - I think just from that measure if we can do "bare metal service loadbalancer" we get the end user feature without having to deal with some of the future compatibility, and we could change directions more easily later on. |
|
Re: pool vs claim - all allocators have to synchronize on something to ensure no duplicates. For IP addresses (like ports) we generally would use a single API object that contains the range and a bitmap of everything that has been allocated. The controller has to perform two duties (can be one or two controllers) - verify the bitmap is accurate (no double allocations), and give out IPs as they are needed. The claim is necessary so the user has an anchor (I own this IP). Unlike PVC, nothing is special about the IP except its value, so we don't need one object per IP. The binder has to go in a specific order - allocate from the pool (so that if a second controller starts it can't use that IP), then mark the user's IP as fulfilled (by setting status). It also has to handle rollback if the update failed for any reason. The "repair" responsibility has to periodically run and verify that nothing funny has happened - usually that happens on controller start. The problem with repair is that it can't resolve any problems - an admin has to step in and fix them (since it could be data loss). That's the strongest advantage to the simple model above - user doesn't get special IPs, so the controller can be aggressive about fixing problems it finds. |
|
@smarterclayton I think option 1 is preferable for all the reasons you describe. Would the following be acceptable?
Open questions:
|
|
One silly question... I was talking to @DirectXMan12 and he wondered if the admission controller could do the allocation of the externalIP directly and change the service. We weren't sure if that violated any rules about what an admission controller should do. And we weren't sure if we allocated the IP in the admission controller and then had to rewind whether we would ever free the allocated IP. Would that work? Or is it illegal? Thanks. |
|
@knobunc There's nothing technically preventing the approach you're suggesting, but one of @smarterclayton's comments (#9454 (comment)) suggests that it's preferable to do as little as possible in admission control. |
|
In general, we should only implement an admission controller if there is no way to accomplish the particular goal using a regular controller. Our goal is to have zero admission controllers in the long run.
The controller only has to allocate from memory, it does not need to allocate to etcd and back, as long as a couple of invariants are met. One, the controller on startup has to read the current state, check the status of all currently allocated services, allocate them in a deterministic order that another controller would observe (creation date, probably), and then start handling normal allocations. We have code to handle "wait for the initial list" through DeltaFIFO. However, if you do that, then you only need an in memory copy because you're working in a consistent order - another controller at the same time would be making the same decisions as you.
We should upgrade the externalIP admission controller as you note, and also add the thing that was added to the new endpoints admission controller that lets an effective cluster admin set any IP they want into external (bypass the change). We can do that as a second part, but I think it's important so that cluster admins can fix things when they go wrong.
Generally we allocate from the first open spot, so while I think this is a reasonable concern, I'd say we could punt on this for a future iteration (where we keep a list of recently recovered IPs).
What does the existing service load balancer allocator do for GCE in the event the ip is not available? Fail, or allocate another one? The controller can do the check to see whether the IP is free and use it, and we can let admins mutate the service status if necessary. We could also gate this with a permission check (the same controller as externalIP) where admins only can set this value. It's important to remember that the allocation controller has to take into account when an admin allocates an IP directly, and that we need to ensure it isn't confused about whether this was a bug or intentional. Arguably, we could tell admins to create a second IP range that is outside of the allocatable range and have them set themselves - the allocation controller must ignore ingress IPs not in its range anyway. |
|
@smarterclayton Allocating from memory in a consistent order requires that the ordering of watch events be reproducible from a list call. I don't see how that is possible. The order of watch events is determined by when a change is accepted by etcd, but creationTimestamp (the obvious sorting key) is set in code. The two orderings have no deterministic relationship. Here's a scenario assuming 2 controllers allocating from memory:
It would be possible for both serviceA and serviceB to be assigned the same ip if the sorted order of controllerY's list result happened to be (serviceB, serviceA) (e.g. creationTimestamp of B was less than A). Given this, I don't think allocating from memory would be supportable for more than a single controller. It would be necessary to resort to persistent storage to coordinate multiple controllers. Which is more important - multiple controllers or avoiding persistent storage? |
|
Sort the list call before it is returned by the controller. In general the server is already sorting, so I don't think order is incorrect. In your watch calls you need to order based on something deterministic (create date) to determine whether to replace. In your initial setup you need to make decisions based on that order. |
|
Latest update is rough (and testing is still in progress) but feedback is welcome. |
pkg/cmd/server/api/v1/types.go
Outdated
| ExternalIPNetworkCIDRs []string `json:"externalIPNetworkCIDRs"` | ||
| // IngressIPNetworkCIDRs controls the range to assign ingress ips from for services of type LoadBalancer on bare | ||
| // metal. If empty, ingress ips will not be assigned. It may contain a single CIDR that will be allocated from. | ||
| // For security reasons, you should ensure that this range does not overlap with the CIDRS reserved for external ips, |
|
Nice work Maru. It looks good to me now, but I'll reexamine in the morning to see if I spot anything. |
| } | ||
| if len(config.NetworkConfig.IngressIPNetworkCIDRs) > 1 { | ||
| validationResults.AddErrors(field.Invalid(fldPath.Child("networkConfig", "ingressIPNetworkCIDRs"), config.NetworkConfig.IngressIPNetworkCIDRs, "must be a single valid CIDR notation IP range (e.g. 172.30.0.0/16)")) | ||
| } |
There was a problem hiding this comment.
else if len(config.NetworkConfig.IngressIPNetworkCIDRs) == 1 {
There was a problem hiding this comment.
No longer necessary with the switch to IngressIPNetworkCIDR
|
@smarterclayton I added permission for the controller to write events, but I need guidance as to when/where/why it would be appropriate to use events instead of logging/HandleError. |
Use HandleError if you are going to eat an unexpected error in a control loop (if I see the text "this should never happen", the next line should be HandleError). Use v(4/5) logging for debugging of tricky code that you suspect to be broken due to logical errors in your code at some point in the future. Use events when you need to communicate to an admin or end users that a specific thing happened that is relevant for them to react / know. Do not use events if you're in a tight loop or a failure mode could cause you to spam events forever. We use events to report things like "we pulled this image to the node" and "a node was detected to be down" (relevant to user and admin), but generally not "I succeeded at doing something that I should generally be doing" (who needs to know that, etc). |
|
@smarterclayton @marun Just like security/vnid controllers, do we also want to add repair controller to ensure ingressIP is assigned to only one service, service with loadBalancer type always has an ingressIP (when config ingress cidr is not nil), detect ingressIP leaks? |
|
Since the controller here resyncs, it's its own repair controller. On Tue, Aug 2, 2016 at 2:20 PM, Ravi Sankar Penta notifications@github.com
|
| case <-deleted: | ||
| case <-time.After(time.Duration(30 * time.Second)): | ||
| t.Fatalf("failed to see expected service deletion") | ||
| } |
There was a problem hiding this comment.
Would like to see a table driven test that sets up all the race conditions we discussed and shows that the controller resolves them if it runs or resyncs.
There was a problem hiding this comment.
This test is for sanity purposes - the (currently failing) integration test will target the race conditions in question. Now that the design is mostly shaken out I'm going to ensure better unit test coverage and then work on integration testing.
| func (c *MasterConfig) RunIngressIPController(client *kclient.Client) { | ||
| // TODO need to disallow if a cloud provider is configured | ||
| if len(c.Options.NetworkConfig.IngressIPNetworkCIDR) == 0 { | ||
| glog.V(2).Infof("Ingress ip controller will not start - IngressIPNetworkCIDR not configured") |
There was a problem hiding this comment.
No need for this message (we don't do it in other places).
There was a problem hiding this comment.
Done. You're clearly the authority here, but I did add this message to be consistent with some of the other controllers not starting due to missing configuration:
https://github.com/marun/origin/blob/32c0ad3977d08318cdd13da01152308411e62910/pkg/cmd/server/origin/run_components.go#L90
https://github.com/marun/origin/blob/32c0ad3977d08318cdd13da01152308411e62910/pkg/cmd/server/origin/run_components.go#L109
https://github.com/marun/origin/blob/32c0ad3977d08318cdd13da01152308411e62910/pkg/cmd/server/origin/run_components.go#L417
There was a problem hiding this comment.
All old ones we want to remove.
On Wed, Aug 3, 2016 at 1:40 PM, Maru Newby notifications@github.com wrote:
In pkg/cmd/server/origin/run_components.go
#9454 (comment):@@ -516,3 +519,20 @@ func (c *MasterConfig) RunClusterQuotaReconciliationController() {
c.ClusterQuotaMappingController.GetClusterQuotaMapper().AddListener(controller)
go controller.Run(5, utilwait.NeverStop)
}
+
+// RunIngressIPController starts the ingress ip controller if IngressIPNetworkCIDR is configured.
+func (c *MasterConfig) RunIngressIPController(client *kclient.Client) {
- // TODO need to disallow if a cloud provider is configured
- if len(c.Options.NetworkConfig.IngressIPNetworkCIDR) == 0 {
glog.V(2).Infof("Ingress ip controller will not start - IngressIPNetworkCIDR not configured")Done. You're clearly the authority here, but I did add this message to be
consistent with some of the other controllers not starting due to missing
configuration:—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/pull/9454/files/32c0ad3977d08318cdd13da01152308411e62910#r73377275,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p7IZonM8LM47suosKQd6SBfw7vm_ks5qcNKggaJpZM4I6YVl
.
|
Do we want to turn this on by default for a random range and then make
Is incredibly powerful for testing |
|
@smarterclayton I'm satisfied with the unit test coverage. The integration test is now validating multiple controllers. There is the possibility of thrashing if multiple controllers coincide with ip exhaustion, but I'm assuming that's a rare enough case not to spend further effort on it.
|
|
@smarterclayton More questions:
|
Yes. We should pick something similar enough to the existing ranges that it's recognizable.
I think an event on ErrFull is relevant, although when we are in that state, how much work do we do before we find out the state is full? I.e. are we making a bunch of calls to the API server and then discovering we're full? I'm thinking about resync mostly - I don't want to write once per resync item. However, it's not the end of the world to just send the event on ErrFull and then open an issue to deal with it later.
An event on automatic reallocation in the user's namespace is appropriate (since it should be a one time event). |
|
@smarterclayton If the configured ip range is small relative to the number of services requiring allocation, the queue could cycle endlessly and generate an ErrFull event for each service pending allocation on each cycle. At present the only thing that would limit the rate of events being sent is that adding back to the work queue is rate limited. Is that sufficient? |
|
Yes that's sufficient. |
| // metal. If empty, ingress ips will not be assigned. It may contain a single CIDR that will be allocated from. | ||
| // For security reasons, you should ensure that this range does not overlap with the CIDRs reserved for external ips, | ||
| // nodes, pods, or services. | ||
| IngressIPNetworkCIDR string `json:"ingressIPNetworkCIDR"` |
There was a problem hiding this comment.
If I specify CIDR "0.0.0.0/32" that means no ingress - do we check for that?
There was a problem hiding this comment.
Done. I've disallowed 0.0.0.0 (see validation/master.go).
There was a problem hiding this comment.
If we're going to default this to on and provide a value we have to be able to distinguish between unset and set. You can do that either by allowing 0.0.0.0/32 to mean "set, but deliberately empty", or turn this value into a pointer.
However, we can do that in a follow up.
There was a problem hiding this comment.
Given that this is a string, isn't "" enough to mean unset?
I'm working on a followup that sets a default and also validates against overlap between IngressIPNetworkCIDR, ExternalIPNetworkCIDRs, ClusterNetworkCIDR and ServiceNetworkCIDR.
Regarding the default, is this the right place to add it?
https://github.com/openshift/origin/blob/master/pkg/cmd/server/start/network_args.go
There was a problem hiding this comment.
If you set a default, you need a way to distinguish between default, and unset, so that if an admin doesn't want to use the default, they can say "no default".
There was a problem hiding this comment.
No, defaulting happens in pkg/cmd/server/api/v1/conversions.go, and there are a few other steps to handle "default for existing clusters" vs. "default for new clusters"
|
One last issue (the IP.IsUnspecified()) and I'll merge this. |
This change adds a new controller that allocates ips to services of type load balancer from a range configured by IngressIPNetworkCIDR. This is intended to support ip-based traffic ingress on bare metal.
|
Evaluated for origin test up to fbca710 |
|
LGTM [merge] |
|
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/7653/) |
|
continuous-integration/openshift-jenkins/merge Waiting: You are in the build queue at position: 2 |
|
Evaluated for origin merge up to fbca710 |
Add a new ExternalIP type to enable tracking of allocated IPs and make it possible to have an external address be allocated to a service automatically from the range indicated by ExternalIPNetworkCIDRs. The type is global in scope with the name an ip address to ensure an ip is allocated at most once. For simplicity, an IP will be dedicated to a single service rather than having the port space of an ip be shared between many services.
TODO:
Update the admission controllerDoneReturn an error if a requested external ip is already in useCreate a new ExternalIP object for each requested external ipcc: @openshift/networking @smarterclayton