make login, project, and discovery work against kube with RBAC enabled#11340
make login, project, and discovery work against kube with RBAC enabled#11340openshift-bot merged 1 commit intoopenshift:masterfrom
Conversation
|
@fabianofranz @juanvallejo ping. |
|
@fabianofranz LGTM |
|
LGTM, @deads2k merge at will. |
|
[merge] |
|
[Test]ing while waiting on the merge queue |
|
yum re[test] re[merge] |
|
re[test] |
|
https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/10125/ failed on #11315 |
|
https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/10197/ failed on #11315 re[test] re[merge] |
|
https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/10265/ on #11315 re[test] re[merge] |
|
https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/10321/ on #11315 re[test] re[merge] |
|
https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/10413/ on #11315 re[test] re[merge] |
|
https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/10553/ on #11094 re[test] re[merge] |
|
@stevekuznetsov does networking ever work? I keep getting #11315 |
|
Unclear -- I don't have a good answer for you. |
|
@mfojtik @smarterclayton do we we want to force this one into 1.4? |
|
[merge] |
|
[merge] unless DinD is horribly broken. On Tue, Oct 25, 2016 at 5:53 PM, OpenShift Bot notifications@github.com
|
|
@openshift/networking I'm going to guess that somehow this broke the networking test. Can you help me figure out how? |
|
tagging @marun since this is dind-related If I check out that branch and do Doing ??? |
| if errors.IsNotFound(err) { | ||
| if errors.IsNotFound(err) || errors.IsForbidden(err) { | ||
| glog.V(4).Infof("Server path /oapi was not found, returning the requested group version %v", preferredGV) | ||
| return preferredGV, nil |
There was a problem hiding this comment.
This is the change that breaks the tests... Is there some other reason we could be getting a 403 here in some circumstances?
There was a problem hiding this comment.
This is the change that breaks the tests... Is there some other reason we could be getting a 403 here in some circumstances?
Are you hitting an openshift server or a kubernetes server? We allow all users (authenticated and unauthenticated) to hit our discovery endpoints. The only way I can think of to fail is to race with an initial cache priming, but that's a little crazy. You could wait for a zero exit code oc get --raw /oapi.
There was a problem hiding this comment.
Actually, it might not be that; the failure doesn't seem to be 100% reliable, so it might just be luck that it passed without that change
There was a problem hiding this comment.
And this is against an openshift server
|
Why is a race with initial cache priming crazy? On Mon, Oct 31, 2016 at 10:52 AM, David Eads notifications@github.com
|
|
Has this patch been tested against a non-dind multinode cluster? |
|
After |
|
Assuming stale cache in |
this confuses me... the cache shouldn't be storing anything in error cases, right? |
riddle me this. How does this pull produce this? |
|
It almost looks like networking is starting an openshift server that does not run the kubernetes API server, but instead tries to proxy it and somehow the proxy isn't proxying the discovery doc. That's insane though. Running an openshift API server like that won't work and I don't think this code will react differently in that case than the old code would have. |
dind runs openshift in several different modes while setting up the cluster. eg, first it runs |
Thing is, none of these are server-side changes, so the discovery information being served is identical and whatever command eventually saved the empty discovery doc, got that back from the server it queried. But, every other component that starts a master is serving "normal" discovery information from the discovery endpoints. |
|
I meant, they run before the master does, maybe they're binding to port 8443 and serving bogus data. But it looks like they don't. Is there any chance the master itself could be returning bad answers briefly at startup? Like, does the startup code do something like: |
Even so, this doesn't change that behavior one way or the other. Are you guys running an |
|
@danwinship there is one handler chain for the server that serves everything, described in |
@danwinship That's a good theory. Can you link me to where the |
|
|
e7a7bac to
fe545df
Compare
|
I don't see how the master could be returning bad answers briefly at startup, because if I am deploying a dind cluster after having removed |
|
I don't think this issue is specific to dind deployment, and that merging should wait until the networking job is passing. |
I haven't proposed forcing it. I'm not familiar with the job though. It would be nice to have a smaller, faster reproducer to help track this down. Is there a flag to prevent tear down and then scripts to debug using various kubeconfigs? The differences in this job make it harder to jump in and debug. |
|
@deads2k My comment about merge was in response to the bot trying to merge after your push. I realize now that the consistent job failure means it can't succeed. |
fe545df to
4d333e7
Compare
It doesn't reproduce it for me. That command always seems to work. |
@marun can you reproduce this reliably on an AWS instance I could use to try to diagnose it. The pastebin you made doesn't seem to appear in the jenkins test run (near as I can tell) and every other master we test it against works. This really looks like its specific to networking test provisioning somehow, I can't seem to make it fail locally, and I can't figure out where the failure happens and is logged for instrumenting while running jenkins. |
bb52e9c to
5d1d22b
Compare
hack/dind-cluster.sh
Outdated
| oc="$(os::build::find-binary oc)" | ||
|
|
||
| # wait for healthz to report ok before trying to get nodes | ||
| os::util::wait-for-condition "ok" "${oc} get --config=\"${kubeconfig}\" --raw=/healthz" "120" |
There was a problem hiding this comment.
The function in question claims it uses eval but in fact does this:
while ! $(${condition}); doThis means that yes, these quotes were incorrect and the actual value of --config that is passed to oc get is "${kubeconfig}", literal quotes and all. @marun this type of thing is why I feel so strongly about not having the provision_util.sh file, since we spent a lot of time and effort getting it right the first time in os::cmd.
There was a problem hiding this comment.
provision_util.sh isn't used here. That's legacy and only used by vagrant. I'm assuming you meant images/dind/node/openshift-dind-lib.sh, which needs to be a separate file so it can be distributed in the dind image. I'm happy to have os::cmd take over responsibility for this use case, but it would have to be copied into the image file for distribution regardless.
There was a problem hiding this comment.
Ah, whatever checkout of Origin I'm in right now has the os::util::wait-for-condition function in provision-util.sh.
hack/dind-cluster.sh
Outdated
| template="$(echo "${template}" | tr -d '\n' | sed -e 's/} \+/}/g')" | ||
| local count | ||
| count="$("${oc}" --config="${kubeconfig}" get nodes \ | ||
| count="$("${oc}" --config=${kubeconfig} get nodes \ |
There was a problem hiding this comment.
I can't get this one to fail for me, and these quotes look fine. Revert this change, please.
| local oc | ||
| oc="$(os::build::find-binary oc)" | ||
|
|
||
| # wait for healthz to report ok before trying to get nodes |
There was a problem hiding this comment.
I don't see why this change would be necessary. Consider removing.
There was a problem hiding this comment.
I don't see why this change would be necessary. Consider removing.
I think its reasonable to wait until the API server is ready before making real requests to it. This mirrors what other e2e tests do.
|
and it worked. I'll squash down and eliminate the "extra" unquoting. |
5d1d22b to
6c6ec1a
Compare
|
Evaluated for origin test up to 6c6ec1a |
|
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/11643/) (Base Commit: fdd204a) |
|
@deads2k congrats, tests passed |
|
Evaluated for origin merge up to 6c6ec1a |
|
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/11653/) (Base Commit: 00ec8d2) (Image: devenv-rhel7_5407) |
When RBAC is enabled (see kubernetes/kubernetes#34619), we have to tolerate 403s in addition to 404s.
@pweil- you'll probably need this if you want the tooling to work nicely (and it makes a difference)