Skip to content

OCPBUGS-64847: USHIFT-6489: UPSTREAM: 136508: kubelet: Fix race condition when pods are rejected#2574

Open
pacevedom wants to merge 5 commits intoopenshift:masterfrom
pacevedom:USHIFT-6489
Open

OCPBUGS-64847: USHIFT-6489: UPSTREAM: 136508: kubelet: Fix race condition when pods are rejected#2574
pacevedom wants to merge 5 commits intoopenshift:masterfrom
pacevedom:USHIFT-6489

Conversation

@pacevedom
Copy link

@pacevedom pacevedom commented Jan 25, 2026

Fix race condition in kubelet where rejected pods could be incorrectly counted as active during resource accounting, causing spurious rejections of subsequent pods.

When a pod fails admission (e.g., insufficient resources), the kubelet:

  1. Sets pod status to Failed with a specific rejection message.
  2. Skips adding the pod to pod_workers (no containers to manage). Also sets a specific status, including a message prefix about rejection.
  3. Schedules sync of the pod's status to the API server.
  4. A re-sync/UPDATE operation comes from the API server.
  5. The pod is added to pod_workers as a first sync with Pending phase (as it is the API server status).
  6. The pod's local status is updated to Failed and terminatingAt is set. This signals the pod as actively terminating, even though it has no containers.
  7. A new pod comes in, while going through admission active pods are filtered using filterOutInactivePods.
  8. A pod is considered active if it has containers running or is actively terminating. This is the case for the rejected pod, as it has terminatingAt set.
  9. Race condition: The new pod may be rejected because resource accounting is considering a pod that has no containers.

This only happens if a new pod arrives before the whole status sync round-trip to API server is finished.

If rejected pods do not have any resources they do not require any surveillance/cleanup, therefore they should not be added to pod_workers. From Kubelet's POV:
When receiving an ADD:

  • A pod being added to the kubelet undergoes the admission process. Rejected pods are not included in pod_workers.
  • A pod being added after a kubelet restart (after a restart pod_workers is empty and needs rebuilding):
    • The pod was not yet rejected by previous ADD operation, therefore it does not have the status updated and goes through admission, which will reject it again.
    • The pod was rejected by previous ADD operation but API server did not have the re-sync ready, therefore it has no status and goes through regular admission process, getting a rejection.
    • The pod was rejected by previous ADD operation and API server was synced. The status on the pod signals it was rejected and the logic can skip it based on that.

When receiving an UPDATE:

  • Any pod that comes through an UPDATE operation was already seen in an ADD operation. This is enforced by the Kubelet's config layer.
  • If a pod is not present in the pod_workers already, it means it was previously rejected by the ADD operation, therefore it can be skipped.

What type of PR is this?

/kind bug
/kind flake

What this PR does / why we need it:

Which issue(s) this PR is related to:

Fixes kubernetes#135296

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed a bug that caused rejected pods to be considered in resource accounting when scheduling (#135296)

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@openshift-ci-robot openshift-ci-robot added backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jan 25, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 25, 2026

@pacevedom: This pull request references Jira Issue OCPBUGS-64847, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references USHIFT-6489 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Rejected pods could have their resources incorrectly counted during
admission of subsequent pods, causing spurious rejections.

The race occurred because filterOutInactivePods() used the condition:
isAdmittedPodTerminal(p) && !IsPodTerminationRequested(p.UID)

When a pod is rejected, rejectPod() set statusManager to Failed, but
skips notifying podWorkers, rendering inconsistent state. The
statusManager then syncs to the API server, which sends an UPDATE
back to the kubelet. HandlePodUpdates calls podWorkers.UpdatePod() and
sets terminatingAt. If a new pod arrives after this API round-trip is
complete, IsPodTerminationRequested returns true, defeating the filter
condition (true && !true = false).

Fix by removing the !IsPodTerminationRequested check. The statusManager
is updated synchronously in rejectPod() and is the reliable source of
truth for terminal pod status during admission. This check was filtering
pods if they are terminal and podWorker does not know yet. If the UPDATE
from the statusManager resync is received before a new pod is scheduled,
the offending pod is now considered terminal and has terminatingAt set,
failing the check and counting it as if it was active because it has not
finished yet (even though it has not containers running).

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

@pacevedom: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci openshift-ci bot requested review from rphillips and sjenning January 25, 2026 10:49
@openshift-ci
Copy link

openshift-ci bot commented Jan 25, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pacevedom
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pacevedom pacevedom changed the title OCPBUGS-64847 USHIFT-6489: kubelet: Fix race condition when pods are rejected OCPBUGS-64847: USHIFT-6489: UPSTREAM: 136508: kubelet: Fix race condition when pods are rejected Jan 25, 2026
@openshift-ci-robot
Copy link

@pacevedom: This pull request references Jira Issue OCPBUGS-64847, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Rejected pods could have their resources incorrectly counted during
admission of subsequent pods, causing spurious rejections.

The race occurred because filterOutInactivePods() used the condition:
isAdmittedPodTerminal(p) && !IsPodTerminationRequested(p.UID)

When a pod is rejected, rejectPod() set statusManager to Failed, but
skips notifying podWorkers, rendering inconsistent state. The
statusManager then syncs to the API server, which sends an UPDATE
back to the kubelet. HandlePodUpdates calls podWorkers.UpdatePod() and
sets terminatingAt. If a new pod arrives after this API round-trip is
complete, IsPodTerminationRequested returns true, defeating the filter
condition (true && !true = false).

Fix by removing the !IsPodTerminationRequested check. The statusManager
is updated synchronously in rejectPod() and is the reliable source of
truth for terminal pod status during admission. This check was filtering
pods if they are terminal and podWorker does not know yet. If the UPDATE
from the statusManager resync is received before a new pod is scheduled,
the offending pod is now considered terminal and has terminatingAt set,
failing the check and counting it as if it was active because it has not
finished yet (even though it has not containers running).

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

@pacevedom: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci-robot
Copy link

@pacevedom: This pull request references Jira Issue OCPBUGS-64847, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Rejected pods could have their resources incorrectly counted during admission of subsequent pods, causing spurious rejections.

The Race Condition
The original check in filterOutInactivePods():
isAdmittedPodTerminal(p) && !IsPodTerminationRequested(p.UID)

When a pod is rejected, rejectPod() sets statusManager to Failed, but skips notifying podWorkers, rendering inconsistent state (pod has Failed but not yet cleaned up. In reality it didnt have any containers). The statusManager then syncs to the API server, which sends an UPDATE back to the kubelet.
HandlePodUpdates calls podWorkers.UpdatePod() and sets terminatingAt to signal cleanup. If a new pod arrives after this API round-trip is complete, IsPodTerminationRequested returns true, defeating the filter condition (true && !true = false).

Why IsPodTerminationRequested Is Wrong for Rejected Pods
IsPodTerminationRequested indicates the pod worker is actively terminating a pod - stopping containers, cleaning up resources, via terminatingAt. This makes sense for pods that actually started and need cleanup, which is not the case with rejected pods, which never ran any containers.

Why Simply Removing !IsPodTerminationRequested Doesn't Work
Evicting pods have Phase=Failed set before containers finish stopping. This behavior is unique to evicting pods, all the other phase transitions happen after the fact, not before.
Filtering them immediately would exclude pods still consuming CPU/memory/resources, breaking resource accounting and therefore admission.

The fix
Skip IsPodTerminationRequested for rejected pods, as we know it races, and include a check for the runtime pod cache to see if there are any containers/sandboxes:
isAdmittedPodTerminal(p) && !hasRuntimeContainers(p)

The new hasRuntimeContainers() function queries podCache.Get() for actual runtime state, not the API status (which has placeholder ContainerStatus entries created by generateAPIPodStatus even for rejected pods via UPDATE operations from API server syncs).

This correctly handles both cases:

  • Rejected pods: No runtime containers (never created) -> filtered immediately
  • Evicting pods: Have runtime containers until termination completes -> kept for resource accounting

What type of PR is this?

/kind bug
/kind flake

What this PR does / why we need it:

Which issue(s) this PR is related to:

Fixes kubernetes#135296

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed a bug that caused rejected pods to be considered in resource accounting when scheduling (#135296)

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@pacevedom
Copy link
Author

/test e2e-aws-ovn-runc
/test e2e-metal-ipi-ovn-ipv6

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 30, 2026

@pacevedom: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

  • [1b7b7ff|USHIFT-6489: Fix kubelet race condition in pod admission](1b7b7ff): does not specify an upstream backport in the commit message
  • [2a126dc|USHIFT-6489: Add unit tests for rejected pods new behavior](2a126dc): does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci-robot
Copy link

@pacevedom: This pull request references Jira Issue OCPBUGS-64847, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Fix race condition in kubelet where rejected pods could be incorrectly counted as active during resource accounting, causing spurious rejections of subsequent pods.

When a pod fails admission (e.g., insufficient resources), the kubelet:

  1. Sets pod status to Failed with a specific rejection message.
  2. Skips adding the pod to pod_workers (no containers to manage). Also sets a specific status, including a message prefix about rejection.
  3. Schedules sync of the pod's status to the API server.
  4. A re-sync/UPDATE operation comes from the API server.
  5. The pod is added to pod_workers as a first sync with Pending phase (as it is the API server status).
  6. The pod's local status is updated to Failed and terminatingAt is set. This signals the pod as actively terminating, even though it has no containers.
  7. A new pod comes in, while going through admission active pods are filtered using filterOutInactivePods.
  8. A pod is considered active if it has containers running or is actively terminating. This is the case for the rejected pod, as it has terminatingAt set.
  9. Race condition: The new pod may be rejected because resource accounting is considering a pod that has no containers.

This only happens if a new pod arrives before the whole status sync round-trip to API server is finished.

If rejected pods do not have any resources they do not require any surveillance/cleanup, therefore they should not be added to pod_workers. From Kubelet's POV:
When receiving an ADD:

  • A pod being added to the kubelet undergoes the admission process. Rejected pods are not included in pod_workers.
  • A pod being added after a kubelet restart (after a restart pod_workers is empty and needs rebuilding):
  • The pod was not yet rejected by previous ADD operation, therefore it does not have the status updated and goes through admission, which will reject it again.
  • The pod was rejected by previous ADD operation but API server did not have the re-sync ready, therefore it has no status and goes through regular admission process, getting a rejection.
  • The pod was rejected by previous ADD operation and API server was synced. The status on the pod signals it was rejected and the logic can skip it based on that.

When receiving an UPDATE:

  • Any pod that comes through an UPDATE operation was already seen in an ADD operation. This is enforced by the Kubelet's config layer.
  • If a pod is not present in the pod_workers already, it means it was previously rejected by the ADD operation, therefore it can be skipped.

What type of PR is this?

/kind bug
/kind flake

What this PR does / why we need it:

Which issue(s) this PR is related to:

Fixes kubernetes#135296

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed a bug that caused rejected pods to be considered in resource accounting when scheduling (#135296)

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 2, 2026

@pacevedom: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 3, 2026

@pacevedom: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

kubelet: Prevent rejected pods from entering pod_workers

  When a pod fails admission, it iss marked as Failed but may still receive
  UPDATE operations from API server resyncs. If these updates cause the pod
  to enter pod_workers, it gets terminatingAt set and is counted as "active"
  by filterOutInactivePods until termination completes. This causes incorrect
  resource accounting during admission of subsequent pods.

  In order to avoid adding rejected pods into resource accounting, these
  should never enter pod_workers: They are in a terminal state and they will
  not run any containers that require cleanup.
  There is only one way of identifying a rejected pod, which is a special
  status that the kubelet sets for them. This is the only distinctive feature
  the kubelet may use to filter them out of pod_workers if they arrive through
  an UPDATE operation.
  As kubelet config loops always processes pods a ADD before UPDATE, when the
  UPDATE comes the pod must already be in pod_workers if it was admitted, so
  it is safe to assume that if the pod is not already in pod_workers it was
  rejected.
  In the event of a kubelet restart, all pods arrive as ADD again, so there
  is a need to check for the status message if the pod is in a terminal state
  (meaning it was rejected before the restart).

Signed-off-by: Pablo Acevedo Montserrat <pacevedo@redhat.com>
Signed-off-by: Pablo Acevedo Montserrat <pacevedo@redhat.com>
Signed-off-by: Pablo Acevedo Montserrat <pacevedo@redhat.com>
Signed-off-by: Pablo Acevedo Montserrat <pacevedo@redhat.com>
Signed-off-by: Pablo Acevedo Montserrat <pacevedo@redhat.com>
@openshift-ci-robot
Copy link

@pacevedom: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci
Copy link

openshift-ci bot commented Feb 5, 2026

@pacevedom: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-ipv6 54fe4fa link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-aws-crun-wasm 54fe4fa link true /test e2e-aws-crun-wasm
ci/prow/e2e-aws-ovn-techpreview 54fe4fa link false /test e2e-aws-ovn-techpreview
ci/prow/integration 54fe4fa link true /test integration
ci/prow/verify-commits 54fe4fa link true /test verify-commits

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Kubelet admission check includes rejected pods in resource accounting, causing subsequent pods to be incorrectly rejected

2 participants