Skip to content

Add --with-heartbeat flag for heartbeat logging during long-running tasks#714

Open
ay901246 wants to merge 1 commit intocloudfoundry:mainfrom
ay901246:task-heartbeat-logger
Open

Add --with-heartbeat flag for heartbeat logging during long-running tasks#714
ay901246 wants to merge 1 commit intocloudfoundry:mainfrom
ay901246:task-heartbeat-logger

Conversation

@ay901246
Copy link

@ay901246 ay901246 commented Mar 11, 2026

Rev 4

  • Rename --poll-state -> --with-heartbeat

Usage update:

bosh run-errand smoke_tests --with-heartbeat
bosh run-errand smoke_tests --with-heartbeat=10

Rev 3

  • Removed description from tests, task response, and no longer passing it through heartbeat reporter

Revision 2 (per reviewer feedback)

Removed --poll-state from bosh task — scoped to bosh run-errand only
Removed task description from heartbeat output to avoid redundancy with the event log
Kept elapsed time for operator visibility
Heartbeat output is now minimal:

Task 185528 | 16:16:23 | Task state: processing (5s elapsed)
Files changed in this revision: cmd/cmd.go, cmd/opts/opts.go, cmd/opts/opts_test.go, ui/task/reporter.go, ui/task/reporter_test.go (net -16 lines)

Summary

Related to issue #715
Errands and other long-running BOSH tasks produce no CLI output while the Agent executes the errand script. This silence causes CI/CD systems (Concourse, GitHub Actions, Jenkins) to kill the process due to inactivity timeouts, and leaves human operators unsure whether the task is still running.

This PR adds an opt-in --poll-state flag to bosh run-errand and bosh task that prints periodic heartbeat status lines while a task is in the processing or queued state. The flag is fully backward-compatible — without it, behavior is identical to today.

Usage

Default 30s interval
bosh run-errand smoke_tests --poll-state

Custom 10s interval
bosh run-errand smoke_tests --poll-state=10

Reconnect to an in-progress task
bosh task 185528 --poll-state

Output format

Heartbeat lines match the existing BOSH event log format:
Task 185528 | 16:16:18 | Task state: queued
Task 185528 | 16:16:23 | Task state: processing (5s elapsed) (run errand hello-errand from deployment my-errand-deployment)

Architecture

  • No Director changes required: Uses existing task API fields (state, description, started_at).
  • Optional interface (HeartbeatReporter): Extends TaskReporter via safe type assertion — does not break any existing code.
  • Timestamp-based throttling: Handled in ReporterImpl — no time.Sleep in the UI layer, the WaitForCompletion loop spins at its normal rate and the reporter only prints when the interval has passed.
  • Thread-safe: Uses sync.Mutex for concurrent access to heartbeat state.
  • Output goes through boshui.UI: Ensures --json mode works correctly.

Files changed

File Change
director/interfaces.go New HeartbeatReporter interface
director/task_client_request.go Emit heartbeat on each poll cycle; added description/started_at to taskShortResp
ui/task/reporter.go EnableHeartbeat() + TaskHeartbeat() with throttling
cmd/opts/opts.go --poll-state flag on RunErrandOpts and TaskOpts
cmd/cmd.go Wire flag to enable heartbeat on reporters
cmd/session.go Expose taskReporter for run-errand path

Test plan

Unit tests (included)

  • Heartbeat not printed when disabled
  • Queued task output without elapsed time
  • Processing task with startedAt shows elapsed time
  • Processing task without startedAt (value 0) — no elapsed, no "queued" confusion
  • Throttling: multiple calls within interval produce only one output
  • WaitForCompletion emits heartbeats when reporter implements HeartbeatReporter
  • WaitForCompletion does not emit when reporter does not implement the interface
  • No heartbeats for tasks that immediately finish
  • Opts struct tags for PollState on both TaskOpts and RunErrandOpts

@aramprice aramprice moved this from Inbox to Pending Review | Discussion in Foundational Infrastructure Working Group Mar 12, 2026
@aramprice aramprice requested a review from rkoster March 12, 2026 16:06
@ay901246
Copy link
Author

Discussed with the team offline - the general use case is subscribe and watch for users, so there is likely no useful use case here for having to reconnect via tasks where the user wouldn't be a power-user who could do something else that already exists - pushed out revision 2 to reflect the removal of description (kept it in the task response, though), and --poll-state from task.

@github-project-automation github-project-automation bot moved this from Pending Review | Discussion to Waiting for Changes | Open for Contribution in Foundational Infrastructure Working Group Mar 12, 2026
@selzoc
Copy link
Member

selzoc commented Mar 12, 2026

Note - once this is merged, we also need to update https://github.com/cloudfoundry/docs-bosh/blob/master/content/cli-v2.md

@ay901246 ay901246 force-pushed the task-heartbeat-logger branch from 3d887a0 to ae86fc9 Compare March 12, 2026 21:25
selzoc
selzoc previously approved these changes Mar 12, 2026
@github-project-automation github-project-automation bot moved this from Waiting for Changes | Open for Contribution to Pending Merge | Prioritized in Foundational Infrastructure Working Group Mar 12, 2026
@Alphasite
Copy link
Contributor

Can we take a look at the name?

@selzoc selzoc requested a review from Alphasite March 13, 2026 22:05
…rrands

Errands produce no CLI output while the Agent executes the script.
This silence causes CI/CD systems to kill the process due to inactivity
timeouts and leaves operators unsure whether the task is still running.

Add an opt-in --with-heartbeat flag to `bosh run-errand` that prints
periodic heartbeat status lines while a task is processing or queued.

  bosh run-errand smoke_tests --with-heartbeat
  bosh run-errand smoke_tests --with-heartbeat=10

Output:
  Task 185528 | 16:16:23 | Task state: processing (5s elapsed)

No Director changes required — uses existing task API fields (state,
started_at). The HeartbeatReporter interface extends TaskReporter via
safe type assertion so no existing code is affected. Throttling is
handled in the reporter via timestamp comparison.

Made-with: Cursor
@ay901246 ay901246 force-pushed the task-heartbeat-logger branch from ae86fc9 to cab07db Compare March 13, 2026 22:07
@ay901246 ay901246 changed the title Add --poll-state flag for heartbeat logging during long-running tasks Add --with-heartbeat flag for heartbeat logging during long-running tasks Mar 13, 2026
Copy link
Contributor

@Alphasite Alphasite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some structural nits, but nothing awful. If you want to defer this to a seperate pr, feel free.

return c.sessionImpl()
}

func (c Cmd) sessionImpl() *SessionImpl {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this method here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense to change - if it's more "Go" to cast the interface to a concrete *SessionImpl (like sess, ok := c.session().(*SessionImpl)) instead of an explicit method to return the concrete ... struct? class? interface? then I think we can do this.

TaskOutputChunk(int, []byte)
}

// HeartbeatReporter is an optional extension of TaskReporter. Implementations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only have 2 implementations of this interface (the reporter and a noop reporter, and there is an EnableHeartbeat is there any reason to ad an interface here? it feels like an unnecessary abstraction.

(I tend to prefer to fewer interfaces and abstractions where i can)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily agree with this - we just had some churn around removing bosh task <id> --poll-state which shrinks the radius of this change, and if we remove this, we'd have to rely on heartbeatSeconds <= 0 being the only gate as to whether we have heartbeats or not, which kind of feels like a "side-car" driving the logic rather than explicitly saying "you're a heartbeat reporter, you mind heartbeats".


Optionally, here's how AI suggested a reply to this (in case my lack of go-understanding muddies the communication here):

I'm hesitant to merge this into the main TaskReporter interface. Since we narrowed the scope of this feature to just run-errand (dropping bosh task ), heartbeats are highly specific to a single workflow.

If we merge them, we'd have to rely on the internal heartbeatSeconds <= 0 state as the only gatekeeper for whether heartbeats happen or not. That feels a bit like a 'side-car' driving the logic. I prefer using the Go idiom of optional interfaces (like io.WriterTo extending io.Writer), which explicitly checks for structural capability rather than internal state. It also protects any future reporters (like a JSON or webhook reporter) from being forced to implement a dummy heartbeat method just to satisfy the core interface.

Let me know what you think!


Also, it feels like making a separate interface for this adheres more to the "i" in SOLID (interface segregation).

fakedir "github.com/cloudfoundry/bosh-cli/v7/director/directorfakes"
)

type heartbeatCall struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like this could also be solved with the counterfeiter generated mock rather than building a custom one, although i do see what you're going for.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think(?) this can only be solved that way if we merge the heartbeat reporter into the task reporter?

@ay901246
Copy link
Author

Some structural nits, but nothing awful. If you want to defer this to a seperate pr, feel free.

I think if we can agree on changing point 1 to explicitly cast rather than utilize a method to return the concrete implementation, but also agree on leaving 2 and 3, then I can make these changes and we can approve/push.

If you disagree, let's hop on a quick call Monday to hash it out, in case I'm misunderstanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Pending Merge | Prioritized

Development

Successfully merging this pull request may close these issues.

5 participants