Skip to content

Add test sharding, proactive clean, and retry logic for self-hosted CI#1171

Merged
sbryngelson merged 17 commits intoMFlowCode:masterfrom
sbryngelson:ci-test
Feb 28, 2026
Merged

Add test sharding, proactive clean, and retry logic for self-hosted CI#1171
sbryngelson merged 17 commits intoMFlowCode:masterfrom
sbryngelson:ci-test

Conversation

@sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Feb 19, 2026

Summary

Hardens self-hosted CI with test sharding, retry logic, and script deduplication.

Test sharding & retry

  • Add --shard i/n flag to ./mfc.sh test — splits tests via modular arithmetic for even distribution
  • Frontier GPU matrix now runs 2 shards per interface (acc/omp), halving wall-clock time
  • Zero-test guard on both --only and --shard — empty results raise an error instead of silent green CI
  • GitHub runner tests retry up to 5 sporadic failures using tests/failed_uuids.txt
  • Abort path cleans failed_uuids.txt to prevent stale retries

--only filter improvements

  • UUIDs use OR logic (match any), labels use AND logic (match all)
  • --only matching zero tests now raises an error instead of silently passing

CI script consolidation

  • Merge submit-bench.sh into submit.sh for all 3 clusters (frontier, frontier_amd, phoenix) — submit.sh auto-detects bench vs test mode from the submitted script's basename
  • Unify frontier/ and frontier_amd/ scripts via directory-name detection — build.sh, bench.sh, submit.sh, and test.sh are now byte-identical across both directories
  • Net deletion of 3 files and ~120 lines of duplicated shell code

Other

  • Frontier test jobs use --qos=normal on batch partition (1h59m, CFD154 account)
  • --requeue on Phoenix SLURM jobs for preemption recovery
  • Build retry wrapper (3 attempts with clean between)
  • Pin nick-fields/retry to commit SHA for security on self-hosted runners
  • Lint-gate must pass before self-hosted tests run
  • Skip benchmark workflow for bot review events

Depends on: #1170

Test plan

  • Frontier GPU tests run in 2 shards per interface and complete within 2h
  • Phoenix tests pass with --requeue and preemption recovery
  • Lint-gate blocks self-hosted tests on lint failure
  • GitHub runner retry logic fires on ≤5 test failures
  • Benchmark jobs submit correctly via merged submit.sh (bench mode auto-detected)
  • frontier/ and frontier_amd/ scripts are identical and detect cluster correctly
  • --shard with zero resulting tests raises an error (not silent pass)

Copilot AI review requested due to automatic review settings February 19, 2026 20:01
@codeant-ai codeant-ai bot added the size:M This PR changes 30-99 lines, ignoring generated files label Feb 19, 2026

This comment was marked as off-topic.

coderabbitai[bot]

This comment was marked as off-topic.

cubic-dev-ai[bot]

This comment was marked as off-topic.

coderabbitai[bot]

This comment was marked as off-topic.

@codeant-ai codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:M This PR changes 30-99 lines, ignoring generated files labels Feb 20, 2026
coderabbitai[bot]

This comment was marked as off-topic.

coderabbitai[bot]

This comment was marked as off-topic.

@codeant-ai codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Feb 21, 2026
@codeant-ai codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Feb 23, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 23, 2026
@codeant-ai codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Feb 24, 2026
@codeant-ai codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Feb 24, 2026
@sbryngelson sbryngelson marked this pull request as draft February 25, 2026 01:04
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
submit.sh now auto-detects job type (bench vs test) from the submitted
script's basename, selecting the appropriate SBATCH account, time limit,
and partition. This eliminates three submit-bench.sh files and makes
frontier/ and frontier_amd/ scripts byte-identical via directory-name
detection for compiler flags and cluster-specific options.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
sbryngelson and others added 2 commits February 26, 2026 09:40
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Raise MFCException when --shard produces zero cases (prevents
  silent green CI with nothing executed)
- Pin nick-fields/retry to commit SHA for security on self-hosted
  runners with cluster credentials

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
@sbryngelson sbryngelson marked this pull request as ready for review February 26, 2026 22:07
coderabbitai[bot]

This comment was marked as off-topic.

@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
sbryngelson and others added 2 commits February 26, 2026 18:22
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MFlowCode MFlowCode deleted a comment from codecov bot Feb 27, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Feb 27, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 27, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 27, 2026
Replace per-case case-optimized builds with one generic build, reducing
build time from ~34 min to ~5-10 min. Halve benchmark timesteps to
compensate for slower non-optimized runtime. Reduce GPU --mem from 12
to 4 GB. Lower test build retry timeout from 480 to 60 minutes.

Closes MFlowCode#1275

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
coderabbitai[bot]

This comment was marked as off-topic.

@MFlowCode MFlowCode deleted a comment from codecov bot Feb 27, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 27, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Feb 27, 2026
@MFlowCode MFlowCode deleted a comment from codecov bot Feb 27, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Feb 27, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Feb 27, 2026
@sbryngelson sbryngelson merged commit 28fc258 into MFlowCode:master Feb 28, 2026
66 of 81 checks passed
@codecov
Copy link

codecov bot commented Feb 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.04%. Comparing base (1412eb2) to head (73fd804).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1171      +/-   ##
==========================================
- Coverage   44.05%   44.04%   -0.02%     
==========================================
  Files          70       70              
  Lines       20496    20499       +3     
  Branches     1991     1993       +2     
==========================================
- Hits         9029     9028       -1     
- Misses      10328    10330       +2     
- Partials     1139     1141       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files

Development

Successfully merging this pull request may close these issues.

2 participants