[WIP] HA failover raft by alpe · Pull Request #2751 · evstack/ev-node

alpe · 2025-10-10T16:43:52Z

🚧 WIP - early version

E2E test: test/e2e/failover_e2e_test.go

Overview

github-actions · 2025-10-10T16:44:05Z

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Oct 13, 2025, 11:42 AM

claude · 2025-10-10T16:44:08Z

Claude finished @alpe's task —— View job

Comprehensive Code Review: HA Failover with Raft

I've completed a detailed review of the HA failover implementation. Below are my findings organized by focus area.

1. Code Quality & Architecture ⚠️

Strengths

Clean separation of concerns: Raft logic isolated in pkg/raft/, failover orchestration in node/failover.go, and syncing integration in block/internal/syncing/
Interface-based design: Good use of interfaces like leaderElection, clusterClient, and RaftNode for testability
Modular failover state: failoverState struct cleanly encapsulates components that need reset during mode switches

Issues & Recommendations

Critical: Inconsistent nil checks for RaftNode

Location: block/internal/syncing/syncer.go:112-114

if raftNode != nil && !reflect.ValueOf(raftNode).IsNil() {
    s.raftRetriever = newRaftRetriever(raftNode, genesis, logger, eventProcessorFn(s.pipeEvent))
}

Problem: Using reflection for nil checks is a code smell. If raftNode is an interface, this suggests confusion about nil interfaces vs nil concrete types.

Recommendation:

Make RaftNode a concrete type pointer or use a sentinel value pattern
If it must be an interface, document why double nil-checking is necessary
Consider: if raftNode != nil && raftNode.IsLeader != nil { ... } or similar type assertion

Similar issue in: block/internal/syncing/syncer.go:177 and other locations

Major: JSON serialization for production consensus

Location: pkg/raft/types.go:3-10 and pkg/raft/node.go:177-180

// todo: refactor to use proto
type RaftBlockState struct {
    Height    uint64
    Hash      []byte
    Timestamp uint64
    Header    []byte
    Data      []byte
}

Problem:

JSON is inefficient for binary data (base64 encoding overhead)
No schema evolution strategy
Performance impact on Raft consensus
Author's comment acknowledges this should use protobuf

Recommendation:

Priority: Switch to protobuf before merging (as intended)
Add protobuf definitions in types/pb/
Benefits: smaller message size, schema evolution, type safety
Consider: Add benchmark comparing JSON vs protobuf serialization time

Major: Hardcoded HTTP endpoints for maintenance

Location: pkg/rpc/server/http.go:92-174

mux.HandleFunc("/raft/join", func(w http.ResponseWriter, r *http.Request) {
    // ...
    var rsp struct {
        NodeID  string `json:"id"`
        Address string `json:"address"`
    }
    // ...
})

Author's note: "quick solution to add some maintenance function for the raft cluster. Probably proto makes more sense"

Issues:

No authentication/authorization on cluster management endpoints
Inconsistent with rest of API architecture
Manual JSON marshaling instead of using typed handlers
Commented-out leader check at line 130-133

Recommendations:

Security: Add authentication before production (see Security section)
Architecture: Use gRPC/protobuf for consistency
Validation: Uncomment and enforce leader-only operations where appropriate
Consider: Rate limiting to prevent DoS on cluster operations

Moderate: Election timing constants need review

Location: pkg/raft/election.go:44

ticker := time.NewTicker(300 * time.Millisecond)

Issue: Hardcoded polling interval for leader election detection

Location: pkg/raft/node.go:63

raftConfig.LeaderLeaseTimeout = cfg.HeartbeatTimeout / 2

Issue: Hardcoded ratio without justification

Recommendations:

Make polling interval configurable
Document why HeartbeatTimeout / 2 is chosen for lease timeout
Consider: Test these values under network partition scenarios

Minor: Goroutine cleanup pattern inconsistency

Location: pkg/raft/election.go:22-26

defer func() {
    workerCancel()
    wg.Wait()
    close(errCh)
}()

✅ Good: Proper cleanup in RunWithElection

But: block/internal/syncing/raft_retriever.go:71-80 has different pattern

func (r *raftRetriever) Stop() {
    r.mtx.Lock()
    if r.cancel != nil {
        r.cancel()
        r.cancel = nil
    }
    r.mtx.Unlock()
    r.wg.Wait()
}

Issue: Unlocking before wg.Wait() could allow race if Start() called concurrently

Recommendation: Hold lock until after cleanup or use atomic state machine

2. Error Handling & Resilience ⚠️

Issues

Critical: Silent failure in FSM.Apply

Location: pkg/raft/node.go:254-258

if f.applyCh != nil {
    select {
    case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
    default:
        f.logger.Warn().Msg("apply channel full, dropping message")
    }
}

Problem:

Dropping Raft consensus messages silently breaks consistency guarantees
Follower nodes may miss critical state updates
No backpressure mechanism

Recommendations:

Critical: Block until channel accepts (Raft guarantees order)
Or: Return error from Apply() to signal backpressure to Raft
Add metric for dropped messages (if dropping is acceptable)
Increase channel buffer size based on expected throughput

Major: Leadership loss kills entire node

Location: pkg/raft/election.go:60-63

} else if !isLeader && isCurrentlyLeader { // lost leadership
    workerCancel()
    n.logger.Info().Msg("lost leadership")
    return ErrLeadershipLost
}

Author's note: "The node is shutdown in the downstream code to ensure clean state"

Problem:

Forces node restart on every leadership change
Increases downtime during failover
No graceful degradation to follower mode

Context from code: This seems intentional but aggressive

Recommendations:

Consider: Transition to follower mode instead of exiting
If restart is necessary, document why in code comments (state corruption risk?)
Add health check to verify state consistency before rejoining
Test: What happens under frequent leadership changes (network flapping)?

Major: Missing context propagation in failover

Location: node/failover.go:165-230

func (f *failoverState) Run(ctx context.Context) (multiErr error) {
    // ...
    defer f.rpcServer.Shutdown(context.Background()) // nolint: errcheck
    // ...
    defer func() {
        if err := f.headerSyncService.Stop(context.Background()); err != nil && !errors.Is(err, context.Canceled) {

Problem: Using context.Background() in cleanup ignores parent cancellation timeout

Recommendation:

shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
defer f.rpcServer.Shutdown(shutdownCtx)

Moderate: Retry logic without exponential backoff

Location: block/internal/syncing/syncer.go:530-556

for attempt := 1; attempt <= common.MaxRetriesBeforeHalt; attempt++ {
    // ...
    select {
    case <-time.After(common.MaxRetriesTimeout):
        continue
    // ...
}

Issue: Fixed retry timeout may not be optimal

Recommendation: Implement exponential backoff for transient failures

Minor: Error wrapping inconsistency

Location: Mixed use of fmt.Errorf and errors.Join

Recommendation: Establish pattern in CLAUDE.md for consistency

3. Security 🚨

Critical Issues

Authentication missing on cluster management endpoints

Location: pkg/rpc/server/http.go:111-150

mux.HandleFunc("/raft/join", func(w http.ResponseWriter, r *http.Request) {
    // No authentication!
    // ...
    if err := raftNode.AddPeer(rsp.NodeID, rsp.Address); err != nil {

Risk:

Severity: CRITICAL
Any network actor can add/remove Raft peers
Byzantine actors can join cluster and corrupt consensus
Cluster membership attacks

Recommendations:

MUST: Add mutual TLS for Raft communication
MUST: Require authentication token for /raft/* endpoints
MUST: Validate peer certificates against allowlist
Consider: Pre-shared cluster secret for bootstrap
Reference: Consul's gossip encryption + ACLs

Raft transport uses TCP without encryption

Location: pkg/raft/node.go:90-93

transport, err := raft.NewTCPTransport(cfg.RaftAddr, addr, 3, 10*time.Second, os.Stderr)
if err != nil {
    return nil, fmt.Errorf("create transport: %w", err)
}

Risk:

Severity: HIGH
Raft messages (including block data) transmitted in plaintext
Man-in-the-middle attacks possible
Eavesdropping on consensus

Recommendation:

Use Raft's TLS transport: raft.NewTCPTransportWithConfig() with TLS config
Mutual TLS between all Raft peers
Rotate certificates regularly

Input validation gaps

Location: pkg/rpc/server/http.go:126-129

if rsp.NodeID == "" || rsp.Address == "" {
    http.Error(w, "id and address are required", http.StatusBadRequest)
    return
}

Issue: No validation of address format, could inject malicious addresses

Recommendation:

// Validate address format
if _, err := net.ResolveTCPAddr("tcp", rsp.Address); err != nil {
    http.Error(w, "invalid address format", http.StatusBadRequest)
    return
}
// Validate nodeID format (alphanumeric, length limits)
if !isValidNodeID(rsp.NodeID) {
    http.Error(w, "invalid node ID", http.StatusBadRequest)
    return
}

Hardcoded log level exposure

Location: pkg/raft/node.go:61

raftConfig.LogLevel = "INFO"

Minor: Should be configurable, avoid leaking internal state in production

4. Performance & Resource Efficiency ⚙️

Issues

Moderate: JSON marshaling in hot path

Location: pkg/raft/node.go:177-186

func (n *Node) Broadcast(ctx context.Context, state *RaftBlockState) error {
    data, err := json.Marshal(state) // Hot path for every block!

Impact:

JSON serialization on every block broadcast
CPU overhead scales with block size
Already noted as TODO for protobuf

Measurement needed: Profile block production under load

Moderate: Unbounded channel could cause memory pressure

Location: block/internal/syncing/syncer.go:108

heightInCh:  make(chan common.DAHeightEvent, 10_000),

Issue: Large buffer (10k events) could consume significant memory

Calculation: If average event is 100KB, buffer could use 1GB RAM

Recommendation:

Monitor channel depth via metrics
Consider backpressure mechanism
Document memory assumptions

Minor: Snapshot store writes to stderr

Location: pkg/raft/node.go:80

snapshotStore, err := raft.NewFileSnapshotStore(cfg.RaftDir, int(cfg.SnapCount), os.Stderr)

Issue: Errors written to stderr instead of logger

Recommendation: Pass logger writer or /dev/null

Minor: Ticker not stopped in all paths

Location: pkg/raft/election.go:69-72

case <-ticker.C:
    if isStarted {
        ticker.Stop()
        continue
    }

Issue: continue after stop seems unnecessary

Recommendation: Review logic - ticker already stopped

5. Testing & Reliability 🧪

Strengths

✅ E2E test covers basic failover scenario (test/e2e/failover_e2e_test.go)
✅ Tests verify transaction propagation across nodes
✅ Tests verify DA height progression after failover

Gaps & Recommendations

Critical: No unit tests for Raft components

Missing:

pkg/raft/node_test.go
pkg/raft/election_test.go
node/failover_test.go

Should test:

Bootstrap vs join cluster scenarios
Snapshot/restore correctness
Leader election edge cases (split brain, network partition)
Concurrent access to FSM state
Error paths in Apply(), Snapshot(), Restore()

Major: E2E test only covers single failover

Location: test/e2e/failover_e2e_test.go:126-150

Missing scenarios:

Multiple consecutive failovers: Kill leader twice in succession
Network partition: Isolate leader, verify cluster survives
Follower failure: What happens when non-leader dies?
Concurrent leader kills: Race conditions during election
Snapshot recovery: Kill all nodes, restart from snapshots
Clock skew: Nodes with different HeartbeatTimeout values
Byzantine behavior: Node sending invalid Raft messages

Recommendation: Expand E2E suite before merging

Major: No chaos/reliability testing

Needed:

Random node kills during block production
Network latency injection
Disk I/O failures during snapshot
Memory pressure scenarios

Moderate: Mock generation for new interfaces

Check: Do new interfaces have mocks?

leaderElection interface - needs mock
clusterClient interface - needs mock
eventProcessor interface - needs mock

Recommendation: Run make mock-gen and add .mockery.yaml entries

Minor: E2E test log storage

Location: test/e2e/sut_helper.go:170
Author's note: "I am storing logs to help tracing down issues. This works with non temporary work dirs only."

Recommendation: Make log storage path configurable, preserve logs on failure

6. Documentation & Developer Experience 📚

Gaps

Critical: No ADR document

Expected: adr023_ha-failover_raft mentioned in branch name

Should document:

Why Raft chosen over other consensus (Paxos, etcd/raft comparison)
Leadership loss → node restart decision rationale
JSON → protobuf migration plan
Security threat model
Performance benchmarks/targets

Major: Incomplete package documentation

Missing:

pkg/raft/doc.go - package overview
Raft configuration guide (heartbeat tuning, snapshot frequency)
Failover behavior documentation
Operator runbook for cluster management

Major: Configuration documentation gaps

Location: pkg/config/config.go:255-266

Issues:

SendTimeout not explained (timeout for what operation?)
HeartbeatTimeout needs guidance on tuning (network latency considerations)
SnapCount - how to choose value? (based on block size/frequency)

Recommendation: Add detailed comments with examples

Moderate: Error messages could be more actionable

Example: pkg/raft/node.go:174

return fmt.Errorf("not leader")

Better:

return fmt.Errorf("not leader: current leader is %s, this node is %s", n.raft.Leader(), n.config.NodeID)

Moderate: CLAUDE.md needs update

Add section:

## High Availability with Raft

### Running a Raft Cluster
- Bootstrap first node with `--raft-bootstrap`
- Join additional nodes with `--raft-peers`
- Monitor cluster health via `/raft/node` endpoint

### Debugging Raft Issues
- Check Raft logs in `<raft-dir>/raft-log.db`
- Verify snapshot directory permissions
- Common errors and solutions

### Testing Failover
- Use `test/e2e/failover_e2e_test.go` as reference

Summary & Recommendations

Must Fix Before Merge (Blockers)

🚨 Security: Add authentication to Raft endpoints + TLS transport
⚠️ Data Integrity: Fix silent message dropping in FSM.Apply channel
⚠️ Architecture: Migrate JSON → Protobuf for Raft messages
📝 Documentation: Write ADR document explaining design decisions
🧪 Testing: Add unit tests for pkg/raft package

Should Fix Before Merge (High Priority)

Fix nil interface reflection checks with better pattern
Add authentication/rate limiting to cluster management endpoints
Expand E2E tests to cover multiple failover scenarios
Document leadership loss → restart rationale
Review goroutine cleanup patterns for consistency

Can Address Post-Merge (Technical Debt)

Refactor hardcoded timing constants to config
Implement exponential backoff in retry logic
Add chaos testing framework
Create operator runbook documentation
Add metrics for Raft health monitoring

Questions for Author

Leadership restart: Is there a specific reason nodes must restart on leadership loss rather than transitioning to follower? (line: pkg/raft/election.go:63)
JSON timeline: Is protobuf migration planned before merge or post-merge?
Security model: What's the intended deployment environment? (trusted network vs hostile?)
Performance targets: What's acceptable failover time? (current test shows ~200ms-5s)

Overall Assessment

This is a solid foundation for HA failover, with clean architecture and good separation of concerns. The E2E test demonstrates it works in the happy path. However, security and reliability gaps need addressing before production use. The WIP status is appropriate - this needs hardening before merge.

Recommend: Address must-fix items, expand testing, complete protobuf migration, then re-review.

--- | Branch: alex/adr023_ha-failover_raft3

alpe · 2025-10-10T16:45:04Z

node/failover.go

+)
+
+// failoverState collect the components to reset when switching modes.
+type failoverState struct {


This type contains all stateful services that are re-reset when switching from sync to aggregator mode

alpe · 2025-10-10T16:46:46Z

pkg/rpc/server/http.go

 	})

+	// optional Raft node details
+	if raftNode != nil {


quick solution to add some maintenance function for the raft cluster. Probably proto makes more sense

alpe · 2025-10-10T16:48:00Z

pkg/raft/types.go

@@ -0,0 +1,17 @@
+package raft
+
+// todo: refactor to use proto


RaftBlockState is currently de- / serialized via json for transport via raft.

alpe · 2025-10-10T16:50:54Z

pkg/raft/election.go

+			} else if !isLeader && isCurrentlyLeader { // lost leadership
+				workerCancel()
+				n.logger.Info().Msg("lost leadership")
+				return ErrLeadershipLost


The node is shutdown in the downstream code to ensure clean state.

alpe · 2025-10-10T16:53:12Z

test/e2e/sut_helper.go

+		logfile, err := os.Create(logfileName)
+		require.NoError(s.t, err)
+		errReader = io.NopCloser(io.TeeReader(errReader, logfile))
+		outReader = io.NopCloser(io.TeeReader(outReader, logfile))


I am storing logs to help tracing down issues. This works with non temporary work dirs only.

alpe · 2025-11-10T11:55:55Z

Closing in favour of #2814

alpe added 2 commits October 10, 2025 14:24

Start sequencer failoverx

1305599

Disable p2p peers when raft enabled

e676cea

github-project-automation bot added this to Evolve Oct 10, 2025

github-actions bot assigned alpe Oct 10, 2025

alpe commented Oct 10, 2025

View reviewed changes

alpe added 2 commits October 13, 2025 13:42

x

0804010

x

c4104c6

alpe closed this Nov 10, 2025

github-project-automation bot moved this to Done in Evolve Nov 10, 2025

julienrbrt deleted the alex/adr023_ha-failover_raft3 branch February 16, 2026 12:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] HA failover raft#2751

[WIP] HA failover raft#2751
alpe wants to merge 4 commits intomainfrom
alex/adr023_ha-failover_raft3

alpe commented Oct 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 10, 2025 •

edited

Loading

Uh oh!

claude bot commented Oct 10, 2025 •

edited

Loading

Uh oh!

alpe Oct 10, 2025

Uh oh!

alpe Oct 10, 2025

Uh oh!

alpe Oct 10, 2025

Uh oh!

alpe Oct 10, 2025

Uh oh!

alpe Oct 10, 2025

Uh oh!

alpe commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alpe commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

github-actions bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comprehensive Code Review: HA Failover with Raft

1. Code Quality & Architecture ⚠️

Strengths

Issues & Recommendations

Critical: Inconsistent nil checks for RaftNode

Major: JSON serialization for production consensus

Major: Hardcoded HTTP endpoints for maintenance

Moderate: Election timing constants need review

Minor: Goroutine cleanup pattern inconsistency

2. Error Handling & Resilience ⚠️

Issues

Critical: Silent failure in FSM.Apply

Major: Leadership loss kills entire node

Major: Missing context propagation in failover

Moderate: Retry logic without exponential backoff

Minor: Error wrapping inconsistency

3. Security 🚨

Critical Issues

Authentication missing on cluster management endpoints

Raft transport uses TCP without encryption

Input validation gaps

Hardcoded log level exposure

4. Performance & Resource Efficiency ⚙️

Issues

Moderate: JSON marshaling in hot path

Moderate: Unbounded channel could cause memory pressure

Minor: Snapshot store writes to stderr

Minor: Ticker not stopped in all paths

5. Testing & Reliability 🧪

Strengths

Gaps & Recommendations

Critical: No unit tests for Raft components

Major: E2E test only covers single failover

Major: No chaos/reliability testing

Moderate: Mock generation for new interfaces

Minor: E2E test log storage

6. Documentation & Developer Experience 📚

Gaps

Critical: No ADR document

Major: Incomplete package documentation

Major: Configuration documentation gaps

Moderate: Error messages could be more actionable

Moderate: CLAUDE.md needs update

Summary & Recommendations

Must Fix Before Merge (Blockers)

Should Fix Before Merge (High Priority)

Can Address Post-Merge (Technical Debt)

Questions for Author

Overall Assessment

Uh oh!

alpe Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

alpe Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

alpe Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

alpe Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

alpe Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

alpe commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

alpe commented Oct 10, 2025 •

edited

Loading

github-actions bot commented Oct 10, 2025 •

edited

Loading

claude bot commented Oct 10, 2025 •

edited

Loading