Skip to content

Comments

feat: Separate Runtime Statistics Collection from UI Updates#4205

Open
kunwp1 wants to merge 10 commits intoapache:mainfrom
kunwp1:chris-introduce-new-interval-param
Open

feat: Separate Runtime Statistics Collection from UI Updates#4205
kunwp1 wants to merge 10 commits intoapache:mainfrom
kunwp1:chris-introduce-new-interval-param

Conversation

@kunwp1
Copy link
Contributor

@kunwp1 kunwp1 commented Feb 10, 2026

What changes were proposed in this PR?

This PR introduces a new configuration parameter runtime-statistics-persistence-interval to independently control the frequency of runtime statistics persistence, separate from the UI update frequency (status-update-interval). Previously, both UI updates and runtime statistics persistence were controlled by a single parameter status-update-interval. This means frequent UI updates (e.g., 500ms) caused excessive statistics writes to storage. This change allows independent control:

  • status-update-interval: Controls how often the frontend UI refreshes (default: 500ms)
  • runtime-statistics-persistence-interval: Controls how often statistics are persisted to storage (default: 2000ms)

Do two timers mean more frequent worker queries?

No. The controller tracks the timestamp of the last completed full-graph worker query and uses min(status-update-interval, runtime-statistics-persistence-interval) as a freshness threshold. When a timer fires, if the elapsed time since the last query is within this threshold, the controller forwards stats from cache without querying workers — so the faster timer drives all real worker queries and the slower timer always reuses the result. If a query is already in-flight when the second timer fires, the controller serves stats from the previous completed query's cache. Cache reuse applies to timer-triggered queries only; event-triggered queries (e.g., from worker completion events) always proceed to real worker RPCs.

Changes

  • Added runtime-statistics-persistence-interval parameter (default: 2000ms) in application.conf
  • Protobuf: Added StatisticsUpdateTarget enum (UI_ONLY, PERSISTENCE_ONLY, BOTH_UI_AND_PERSISTENCE) to QueryStatisticsRequest
  • Added RuntimeStatisticsPersist event for statistics-only updates; ExecutionStatsUpdate now handles UI-only updates
  • Added separate timer for runtime statistics persistence that runs independently from the UI update timer
  • Query Handling
    • Timer-triggered queries specify target: UI-only or persistence-only
    • Event-triggered queries (port/worker completion, pause, resume) send both UI and persistence updates to preserve original behavior
    • QueryWorkerStatisticsHandler routes to the appropriate event based on StatisticsUpdateTarget
  • Worker query deduplication in QueryWorkerStatisticsHandler: when the second timer fires, the controller checks whether worker stats were already fetched recently (within min(status-update-interval, runtime-statistics-persistence-interval)). If so, it forwards the cached stats to the appropriate sink (UI or persistence) without issuing any worker RPCs. If a query is already in-flight, cached stats from the previous completed query are forwarded.

Any related issues, documentation, discussions?

Closes #4204

How was this PR tested?

Tested with the following workflow and dataset, change the runtime-statistics-persistence-interval parameter to see if the runtime stats size reduces if we increase the parameter value.
Iris Dataset Analysis.json
Iris.csv

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude-4.6

@kunwp1 kunwp1 requested a review from Xiao-zhen-Liu February 10, 2026 20:56
@kunwp1 kunwp1 self-assigned this Feb 10, 2026
Copy link
Contributor

@Xiao-zhen-Liu Xiao-zhen-Liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left minor comments and some questions. Tested and can verify the size changes of the persisted runtime stats by adjusting this new parameter.

@kunwp1
Copy link
Contributor Author

kunwp1 commented Feb 20, 2026

@Xiao-zhen-Liu Can you do a one more pass of the review?

@chenlica and I discussed the design and decided to keep the two control messages independent (one for UI updates, one for persistence) to avoid coupling their intervals together.

To address the concern about increased worker query frequency, I added an optimization in QueryWorkerStatisticsHandler: when a timer fires, the controller checks whether a full-graph worker query was already completed recently (within min(status-update-interval, runtime-statistics-persistence-interval)). If so, it reuses the cached stats from WorkerExecution and forwards them to the appropriate sink (UI or persistence) without issuing any worker RPCs. This means the number of queryStatistics RPCs sent to workers does not increase compared to before. The faster timer drives all real worker queries and the slower timer always reuses the result.

Copy link
Contributor

@Xiao-zhen-Liu Xiao-zhen-Liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the new changes. Only have a minor clarification about the new behavior.

if (globalQueryStatsOngoing && msg.filterByWorkers.isEmpty) {
// A query is already in-flight: serve the last completed query's cached data,
// or drop silently if no prior query has finished yet.
if (lastWorkerQueryTimestampNs > 0) forwardStats(msg.updateTarget)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand the behavior of this change. Does this mean only the concurrent requests before the first globalQuery finishes will be dropped, and after the first globalQuery of a workflow finishes, all subsequent concurrent requests of an ongoing globalQuery will be served from cache? (Previously, any concurrent request will be dropped.)

forwardStats(msg.updateTarget)
// Record the completion timestamp before releasing the lock so that any timer
// firing in between sees a valid cache entry rather than triggering a redundant query.
if (globalQueryStatsOngoing) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the completion of filtered requests also trigger this lock-release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Separate Runtime Statistics Collection from UI Updates

3 participants