-
Notifications
You must be signed in to change notification settings - Fork 33
Only increment stats when the worker acknowledged the test #373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ruby/lib/minitest/queue/runner.rb
Outdated
| attributes.merge(type: type) | ||
| end.compact | ||
| File.open(queue_config.warnings_file, 'w') do |f| | ||
| File.open(queue_config.warnings_file, 'a') do |f| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call this multiple times so we need to use append.
a9c8024 to
14cf9e8
Compare
| @queue.increment_test_failed if acknowledged == 1 | ||
| if acknowledged | ||
| # if another worker already acknowledged the test, we don't need to update the global stats or increment the test failed count | ||
| record_stats(stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be enough.
Considering this scenario:
- Worker A runs test T1 (reclaim), fails, calls record_error. acknowledged = 0 → we don’t call record_stats.
- BuildStatusRecorder has already done self.failures += 1 (or self.errors += 1) for T1 in record() before calling build.record_error.
- Worker A then runs test T2, fails, calls record_error. acknowledged = 1 → we do call record_stats(stats).
stats is built from current self.failures / self.errors, which still include T1. So we send e.g. failures: 2 (T1 + T2) for worker A. - Redis ends up with worker A’s failures = 2, but only one of those (T2) was a first ack; T1 was a duplicate and was never supposed to be counted in stats.
So we over-count whenever a worker has both a duplicate ack and a later first ack: the duplicate stays in the in-memory counter and is included in the next record_stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this could be a problem. In that case you need to flip the logging around which makes the diff / refactor a lot more complicated.
Something like this https://github.com/Shopify/ci-queue/compare/cbruckmayer/fix-logging-of-tests?expand=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can continue the work on your branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably keep it simple and remove the ignored state and introduce it in a later dedicated PR.
| end | ||
|
|
||
| def record_error(id, payload, stats: nil) | ||
| acknowledged, _ = redis.pipelined do |pipeline| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current code uses one pipeline: acknowledge + record_stats together. The new code goes more round trip which may cause performance regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if you want to keep it pipelined you need to inline it into the lua script which is a lot more complicated as you cannot rely on the result of the previous command in a pipeline. I don't think it's a significant performance regression so I prefer simplicity here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. We can start simple first
… duplicate ack - Redis BuildRecord#record_error: run acknowledge first, then only record_stats and increment_test_failed when SADD indicates we were the first to ack. Return boolean so callers can roll back local state. - Minitest BuildStatusRecorder: when record_error returns false (duplicate ack), decrement local errors/failures so reporter totals stay correct when the same test is run by multiple workers.
| def record_error(id, payload, stats: nil) | ||
| acknowledged, _ = redis.pipelined do |pipeline| | ||
| # Run acknowledge first so we know whether we're the first to ack | ||
| acknowledged = redis.pipelined do |pipeline| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is only one command, you don't need a pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right. This lua script is a single round-trip even though it does several Redis calls inside.
I will remove pipeline in here.
| @queue.acknowledge(id, error: payload, pipeline: pipeline) | ||
| record_stats(stats, pipeline: pipeline) | ||
| end | ||
| end.first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is not necessary. Removing it along with pipeline.
| nil | ||
| if acknowledged | ||
| # We were the first to ack; another worker already ack'd would get falsy from SADD | ||
| redis.pipelined do |pipeline| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is only one command, we don't need a pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inside record_stats method (this is not lua script), it does multiple commands (hset and expire). So we can get benefit with pipeline. I prefer to keep it in here.
| @queue.increment_test_failed | ||
| end | ||
| # Return so caller can roll back local counter when not acknowledged | ||
| !!acknowledged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update the other implementations of record_error too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think so.
- lib/ci/queue/build_record.rb (base BuildRecord): Static/local queue, single process, no Redis, no distributed workers.
- lib/ci/queue/redis/grind_record.rb (GrindRecord): Grind mode (run tests many times). Different model from the main queue. It just lpushes onto an error list and updates stats. No “processed” set or SADD-based first-ack semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is: Are they used with the reporter because they all return nil now which means we would always decrement the error count.
| @queue.increment_test_failed | ||
| end | ||
| # Return so caller can roll back local counter when not acknowledged | ||
| !!acknowledged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!! is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!! does not change whether the value is “success” or “failure”. It only turns that into a real boolean.
acknowledge.lua already returns a boolean, so the !! is not a must. However, this can make sure the future change to acknowledge.lua will be safe.
- Redis BuildRecord: record_success returns !!acknowledged; record_requeue returns true - Base BuildRecord: record_error, record_success return true; add record_requeue (no-op, returns true) - BuildStatusRecorder: build pending stats, call record_* once, increment local counters only when acknowledged (avoids duplicate count for success/error) - Add tests: duplicate success/error no double-count, build record return booleans, static build returns true
|
|
||
| private | ||
|
|
||
| def stat_delta(counter, test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
Only increment error stats when the worker acknowledged the test otherwise we end up with an incorrect counter.