make login, project, and discovery work against kube with RBAC enabled by deads2k · Pull Request #11340 · openshift/origin

continuous-integration/openshift-jenkins/merge FAILURE (
https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/10645/) (Base
Commit: 8b8e813
8b8e813
)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#11340 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p4P1oWQ3i634L7AVOiNyED6WIOtuks5q3npSgaJpZM4KU_am
.

deads2k · 2016-10-28T11:54:56Z

@openshift/networking I'm going to guess that somehow this broke the networking test. Can you help me figure out how?

danwinship · 2016-10-31T14:01:30Z

tagging @marun since this is dind-related

If I check out that branch and do make clean; ./test/extended/networking.sh, I see the same deployment failure.

Doing . dind-nettest.sh; oc get nodes from another shell while it is waiting shows that the nodes are created and ready. But stracing the dind script shows that when it runs "oc get nodes ...", it gets back:

the server doesn't have a resource type "nodes"

???

danwinship · 2016-10-31T14:45:46Z

pkg/cmd/util/clientcmd/negotiate.go

-		if errors.IsNotFound(err) {
+		if errors.IsNotFound(err) || errors.IsForbidden(err) {
 			glog.V(4).Infof("Server path /oapi was not found, returning the requested group version %v", preferredGV)
 			return preferredGV, nil


This is the change that breaks the tests... Is there some other reason we could be getting a 403 here in some circumstances?

This is the change that breaks the tests... Is there some other reason we could be getting a 403 here in some circumstances?

Are you hitting an openshift server or a kubernetes server? We allow all users (authenticated and unauthenticated) to hit our discovery endpoints. The only way I can think of to fail is to race with an initial cache priming, but that's a little crazy. You could wait for a zero exit code oc get --raw /oapi.

Actually, it might not be that; the failure doesn't seem to be 100% reliable, so it might just be luck that it passed without that change

And this is against an openshift server

smarterclayton · 2016-10-31T15:07:01Z

Why is a race with initial cache priming crazy?

On Mon, Oct 31, 2016 at 10:52 AM, David Eads notifications@github.com
wrote:

@deads2k commented on this pull request.

In pkg/cmd/util/clientcmd/negotiate.go
#11340:
@@ -36,7 +36,7 @@ func negotiateVersion(client *kclient.Client, config *restclient.Config, request
// Get server versions
serverGVs, err := serverAPIVersions(client, "/oapi")
if err != nil {
if errors.IsNotFound(err) {
if errors.IsNotFound(err) || errors.IsForbidden(err) {
  glog.V(4).Infof("Server path /oapi was not found, returning the requested group version %v", preferredGV)
  return preferredGV, nil
This is the change that breaks the tests... Is there some other reason we
could be getting a 403 here in some circumstances?

Are you hitting an openshift server or a kubernetes server? We allow all
users (authenticated and unauthenticated) to hit our discovery endpoints.
The only way I can think of to fail is to race with an initial cache
priming, but that's a little crazy. You could wait for a zero exit code oc
get --raw /oapi.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#11340, or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pwcEccxbuRrMoI-b3Sm1z1coddfEks5q5gCmgaJpZM4KU_am
.

marun · 2016-10-31T15:52:24Z

Has this patch been tested against a non-dind multinode cluster?

marun · 2016-10-31T15:56:17Z

~~I'm seeing the failure consistently after rebasing the PR, which is how it is being tested in CI.~~

After rm -rf ~/.kube, I am no longer able to replicate.

marun · 2016-10-31T16:28:40Z

Assuming stale cache in ~/.kube is the cause of the CI failure, consider updating test/extended/networking.sh to clear it and seeing if that allows the job to pass.

liggitt · 2016-10-31T16:45:56Z

consider updating test/extended/networking.sh to clear it and seeing if that allows the job to pass

this confuses me... the cache shouldn't be storing anything in error cases, right?

deads2k · 2016-10-31T16:54:59Z

this confuses me... the cache shouldn't be storing anything in error cases, right?

riddle me this. How does this pull produce this?

https://paste.fedoraproject.org/466900/14779326/

deads2k · 2016-10-31T16:58:20Z

It almost looks like networking is starting an openshift server that does not run the kubernetes API server, but instead tries to proxy it and somehow the proxy isn't proxying the discovery doc. That's insane though. Running an openshift API server like that won't work and I don't think this code will react differently in that case than the old code would have.

danwinship · 2016-10-31T17:09:55Z

It almost looks like networking is starting an openshift server that does not run the kubernetes API server

dind runs openshift in several different modes while setting up the cluster. eg, first it runs openshift admin ca create-master-certs ..., then openshift start master --write-config ..., etc. (see images/dind/master/). And the "oc get nodes" loop to test if the cluster is ready starts before those config steps finish. So if one of those returns bad data, and oc caches it, then...

deads2k · 2016-10-31T17:18:16Z

dind runs openshift in several different modes while setting up the cluster. eg, first it runs openshift admin ca create-master-certs ..., then openshift start master --write-config ..., etc. (see images/dind/master/). And the "oc get nodes" loop to test if the cluster is ready starts before those config steps finish. So if one of those returns bad data, and oc caches it, then...

Thing is, none of these are server-side changes, so the discovery information being served is identical and whatever command eventually saved the empty discovery doc, got that back from the server it queried. But, every other component that starts a master is serving "normal" discovery information from the discovery endpoints.

danwinship · 2016-10-31T17:30:05Z

I meant, they run before the master does, maybe they're binding to port 8443 and serving bogus data. But it looks like they don't.

Is there any chance the master itself could be returning bad answers briefly at startup? Like, does the startup code do something like:

kube.StartMasterHTTPServer()
openshift.OhBTWHandleOpenShiftURLsToo()

deads2k · 2016-10-31T17:34:12Z

I meant, they run before the master does, maybe they're binding to port 8443 and serving bogus data. But it looks like they don't.

Is there any chance the master itself could be returning bad answers briefly at startup? Like, does the startup code do something like:

Even so, this doesn't change that behavior one way or the other. Are you guys running an oc login --token or oc project command somewhere on a loop? That might start succeeding sooner than usual.

stevekuznetsov · 2016-10-31T17:38:24Z

@danwinship there is one handler chain for the server that serves everything, described in MasterConfig.Run()

deads2k · 2016-10-31T17:59:59Z

Is there any chance the master itself could be returning bad answers briefly at startup? Like, does the startup code do something like:

@danwinship That's a good theory. Can you link me to where the oc get nodes is done so I can switch it to a health check first?

danwinship · 2016-10-31T18:54:54Z

wait-for-cluster in hack/dind-cluster.sh

marun · 2016-10-31T22:01:11Z

I don't see how the master could be returning bad answers briefly at startup, because if I am deploying a dind cluster after having removed ~/kube, everything works fine. oc get nodes works just fine, too. It's only if I wait (like 5m) and then try running oc get nodes that I'm seeing the failure indicated by fpaste.

marun · 2016-10-31T22:04:13Z

I don't think this issue is specific to dind deployment, and that merging should wait until the networking job is passing.

deads2k · 2016-10-31T22:12:04Z

I don't think this issue is specific to dind deployment, and that merging should wait until the networking job is passing.

I haven't proposed forcing it.

I'm not familiar with the job though. It would be nice to have a smaller, faster reproducer to help track this down. Is there a flag to prevent tear down and then scripts to debug using various kubeconfigs? The differences in this job make it harder to jump in and debug.

marun · 2016-11-01T16:08:30Z

@deads2k hack/dind-cluster.sh start -r reproduces the issue for me. Once the dind images have been built, deploying a cluster takes 20-30s. The only requirement is linux running recent (> 1.10) docker.

My comment about merge was in response to the bot trying to merge after your push. I realize now that the consistent job failure means it can't succeed.

deads2k · 2016-11-22T18:17:37Z

@deads2k hack/dind-cluster.sh start -r reproduces the issue for me. Once the dind images have been built, deploying a cluster takes 20-30s. The only requirement is linux running recent (> 1.10) docker.

It doesn't reproduce it for me. That command always seems to work.

deads2k · 2016-11-22T18:48:01Z

@deads2k hack/dind-cluster.sh start -r reproduces the issue for me. Once the dind images have been built, deploying a cluster takes 20-30s. The only requirement is linux running recent (> 1.10) docker.
It doesn't reproduce it for me. That command always seems to work.

@marun can you reproduce this reliably on an AWS instance I could use to try to diagnose it. The pastebin you made doesn't seem to appear in the jenkins test run (near as I can tell) and every other master we test it against works. This really looks like its specific to networking test provisioning somehow, I can't seem to make it fail locally, and I can't figure out where the failure happens and is logged for instrumenting while running jenkins.

stevekuznetsov · 2016-11-22T19:14:38Z

hack/dind-cluster.sh

  oc="$(os::build::find-binary oc)"

  # wait for healthz to report ok before trying to get nodes
-  os::util::wait-for-condition "ok" "${oc} get --config=\"${kubeconfig}\" --raw=/healthz" "120"


The function in question claims it uses eval but in fact does this:

while ! $(${condition}); do

This means that yes, these quotes were incorrect and the actual value of --config that is passed to oc get is "${kubeconfig}", literal quotes and all. @marun this type of thing is why I feel so strongly about not having the provision_util.sh file, since we spent a lot of time and effort getting it right the first time in os::cmd.

provision_util.sh isn't used here. That's legacy and only used by vagrant. I'm assuming you meant images/dind/node/openshift-dind-lib.sh, which needs to be a separate file so it can be distributed in the dind image. I'm happy to have os::cmd take over responsibility for this use case, but it would have to be copied into the image file for distribution regardless.

Ah, whatever checkout of Origin I'm in right now has the os::util::wait-for-condition function in provision-util.sh.

stevekuznetsov · 2016-11-22T19:15:49Z

hack/dind-cluster.sh

  template="$(echo "${template}" | tr -d '\n' | sed -e 's/} \+/}/g')"
  local count
-  count="$("${oc}" --config="${kubeconfig}" get nodes \
+  count="$("${oc}" --config=${kubeconfig} get nodes \


I can't get this one to fail for me, and these quotes look fine. Revert this change, please.

marun · 2016-11-22T19:39:21Z

hack/dind-cluster.sh

  local oc
  oc="$(os::build::find-binary oc)"

+  # wait for healthz to report ok before trying to get nodes


I don't see why this change would be necessary. Consider removing.

I don't see why this change would be necessary. Consider removing.

I think its reasonable to wait until the API server is ready before making real requests to it. This mirrors what other e2e tests do.

deads2k · 2016-11-22T20:25:37Z

and it worked. I'll squash down and eliminate the "extra" unquoting.

openshift-bot · 2016-11-22T20:37:30Z

Evaluated for origin test up to 6c6ec1a

openshift-bot · 2016-11-22T22:02:16Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/11643/) (Base Commit: fdd204a)

ncdc · 2016-11-23T01:54:36Z

@deads2k congrats, tests passed

openshift-bot · 2016-11-23T05:13:14Z

Evaluated for origin merge up to 6c6ec1a

openshift-bot · 2016-11-23T05:13:23Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/11653/) (Base Commit: 00ec8d2) (Image: devenv-rhel7_5407)

deads2k assigned fabianofranz Oct 12, 2016

deads2k added this to the 1.4.0 milestone Oct 31, 2016

danwinship reviewed Oct 31, 2016

View reviewed changes

deads2k force-pushed the fix-login-again branch from e7a7bac to fe545df Compare October 31, 2016 19:59

deads2k force-pushed the fix-login-again branch from fe545df to 4d333e7 Compare November 22, 2016 16:31

deads2k force-pushed the fix-login-again branch from bb52e9c to 5d1d22b Compare November 22, 2016 18:54

stevekuznetsov reviewed Nov 22, 2016

View reviewed changes

marun reviewed Nov 22, 2016

View reviewed changes

make login, project, and discovery work against kube with RBAC enabled

6c6ec1a

deads2k force-pushed the fix-login-again branch from 5d1d22b to 6c6ec1a Compare November 22, 2016 20:26

openshift-bot merged commit 57fcd19 into openshift:master Nov 23, 2016

deads2k deleted the fix-login-again branch February 3, 2017 17:39

MikeSpreitzer mentioned this pull request Mar 14, 2017

Napkin design: bulk-namespace access control and/or RBAC kubernetes/kubernetes#40403

Closed

Conversation

deads2k commented Oct 12, 2016

Uh oh!

deads2k commented Oct 13, 2016

Uh oh!

juanvallejo commented Oct 13, 2016

Uh oh!

fabianofranz commented Oct 13, 2016

Uh oh!

deads2k commented Oct 13, 2016

Uh oh!

openshift-bot commented Oct 13, 2016

Uh oh!

deads2k commented Oct 17, 2016

Uh oh!

deads2k commented Oct 17, 2016

Uh oh!

deads2k commented Oct 17, 2016

Uh oh!

deads2k commented Oct 19, 2016

Uh oh!

deads2k commented Oct 20, 2016

Uh oh!

deads2k commented Oct 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deads2k commented Oct 24, 2016

Uh oh!

deads2k commented Oct 24, 2016

Uh oh!

deads2k commented Oct 24, 2016

Uh oh!

stevekuznetsov commented Oct 24, 2016

Uh oh!

deads2k commented Oct 25, 2016

Uh oh!

smarterclayton commented Oct 25, 2016

Uh oh!

smarterclayton commented Oct 25, 2016

Uh oh!

deads2k commented Oct 28, 2016

Uh oh!

danwinship commented Oct 31, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Oct 31, 2016

@deads2k commented on this pull request.

Uh oh!

marun commented Oct 31, 2016

Uh oh!

marun commented Oct 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marun commented Oct 31, 2016

Uh oh!

liggitt commented Oct 31, 2016

Uh oh!

deads2k commented Oct 31, 2016

Uh oh!

deads2k commented Oct 31, 2016

Uh oh!

danwinship commented Oct 31, 2016

Uh oh!

deads2k commented Oct 31, 2016

Uh oh!

danwinship commented Oct 31, 2016

Uh oh!

deads2k commented Oct 31, 2016

Uh oh!

stevekuznetsov commented Oct 31, 2016

Uh oh!

deads2k commented Oct 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

deads2k commented Oct 21, 2016 •

edited

Loading

marun commented Oct 31, 2016 •

edited

Loading

deads2k commented Oct 31, 2016 •

edited

Loading

marun commented Nov 1, 2016 •

edited

Loading

openshift-bot commented Nov 23, 2016 •

edited

Loading