Pierre Zemb's Blog

Simulating Leader Election on top of FoundationDB

Table of contents

People are right to fear LLM-generated code. The models hallucinate APIs, miss edge cases, and produce code that looks correct but fails under pressure. For distributed systems, where bugs hide behind race conditions and network partitions, the stakes are even higher. A subtle leader election bug can cause split-brain, data corruption, or cascading failures across your cluster.

But what if you could give an LLM the right feedback loop? Not just "write me leader election" but "write me leader election, and here's how I'll prove it correct." That feedback loop is simulation. I decided to test this idea by building a leader election recipe for foundationdb-rs, with invariants generated by Claude Code.

🔗The Algorithm

The recipe uses a ballot-based approach, similar to Raft's term concept but simpler. The algorithm is based on "Leader Election Using NewSQL Database Systems" (Ismail et al., DAIS 2015), which uses a database as distributed shared memory. Our implementation differs by leveraging FDB's strictly serializable transactions and versionstamps instead of timestamps. Leaders hold time-bounded leases. Ballots are monotonically increasing fencing tokens that prevent stale leaders from corrupting state. FDB's serializable transactions handle mutual exclusion without explicit locking.

The key insight is storing the leader state at a single key, making all operations O(1) instead of O(N) candidate scanning. Most leader election implementations require scanning all candidates to determine who's in charge. With N candidates, that's N reads per query. Our approach stores the current leader explicitly, so checking who's leader is a single read. Claiming leadership is a single write with a conflict check. No quorum is needed because FDB already provides the coordination guarantees we need through its serializable transactions.

        sequenceDiagram
    participant A as Process A
    participant FDB as FoundationDB
    participant B as Process B

    A->>FDB: read leader key
    FDB-->>A: ballot=5, lease expired
    B->>FDB: read leader key
    FDB-->>B: ballot=5, lease expired

    A->>FDB: write ballot=6
    B->>FDB: write ballot=6

    FDB-->>A: COMMIT OK
    FDB-->>B: CONFLICT!

    Note over A: Becomes Leader
    Note over B: Retries, sees ballot=6, backs off
    

The diagram shows what happens when two processes try to claim leadership simultaneously. Both read the same expired leader state. Both try to write ballot=6. FDB's serializable transactions guarantee only one succeeds. The loser gets a conflict error, retries, sees ballot=6 with an active lease, and backs off. No split-brain possible. This is the fundamental safety property that the entire algorithm rests on.

When a process wants to become leader, it first registers as a candidate. FDB assigns a versionstamp at commit time using the SetVersionstampedValue atomic operation. This versionstamp is assigned by FDB itself when the transaction commits, providing a globally-ordered identity that no other process can have. The versionstamp never changes throughout the process's lifetime, even as the process sends heartbeats and refreshes its registration. This immutability is important: it means the process's identity is stable and can be used reliably in log replay during verification.

Then the process enters a loop: send heartbeats to prove liveness, try to claim leadership if the current lease expired, and optionally resign if graceful handoff is needed. Each leadership claim increments the ballot number. The new leader stores its versionstamp in the leader state, so followers can verify they're talking to the actual leader. The full implementation lives in the foundationdb-rs recipes module.

🔗The Simulation Workload

Simulation on top of FoundationDB may feel redundant. FDB itself has been validated through a trillion CPU-hours of simulation. But FDB's resiliency doesn't mean your code is resilient. Your layer sits on top of FDB. Your transaction logic, your conflict handling, your retry behavior. How does your code respond when FDB is unhealthy?

Simulation introduces chaos to answer that question. Network partitions, process crashes, clock skew up to ±1 second. Same seed, same execution, same bugs. Deterministic replay.

PhaseWhoWhat
SetupClient 0Initialize election, register all candidates
ChaosAll clientsLoop: heartbeat → try claim → maybe resign (10%)
CheckClient 0Read logs, snapshot state, run 13 invariants

FDB's simulator runs built-in chaos workloads alongside ours: RandomClogging injects network partitions, Attrition kills and reboots processes. Our workload adds clock skew simulation up to ±1 second.

Each operation logs its intent atomically in the same transaction, using the pattern from FDB's own AtomicOps workload. The SetVersionstampedKey atomic operation writes both the leader election mutation and its log entry in a single transaction. If the transaction commits, both succeed. If it aborts, neither does. This gives us a proper write-ahead log without the complexity of two-phase commit. The log key uses a versionstamp prefix for true FDB commit ordering. No clock skew ambiguity.

🔗How Atomic Logging Works

Each operation writes a log entry in the same transaction as the operation itself. The log key uses a versionstamp prefix, so entries sort by true FDB commit order, not by wall clock (which can drift ±1 second under simulation).

Log Key Structure:
┌─────────────────────┬───────────┬────────┐
│ versionstamp (10B)  │ client_id │ op_num │
└─────────────────────┴───────────┴────────┘
   Assigned by FDB at commit time

Here's what a typical log might look like after chaos:

VersionstampClientOpResult
0x0001...0Register
0x0002...1Register
0x0003...2Register
0x0004...0TryBecomeLeader✓ became leader, ballot=1
0x0005...1TryBecomeLeader✗ conflict
0x0006...0Heartbeat
0x0007...2TryBecomeLeader✓ became leader, ballot=2
0x0008...2Resign

Client 0 became leader with ballot 1. Client 1's claim failed (conflict). Client 2 later claimed ballot 2 when client 0's lease expired, then resigned. The versionstamp ordering is ground truth: no matter how skewed each client's clock was, the log shows exactly what committed and in what order.

During check, client 0 reads all logs and database state in a single snapshot, then runs the invariants. Same seed, same execution, same bugs. Deterministic replay.

🔗The Invariants

Here's where Claude Code enters the story. I asked it to generate invariants for leader election validation. I didn't give it a detailed specification. I pointed it at my previous post about designing workloads that find bugs and said "apply these patterns to leader election."

It generated 13 invariants. Here are the seven most important ones:

InvariantWhatWhyHow
DualPathValidationExpected state from log replay must match actual FDB snapshotThe keystone check. If logs say process A is leader with ballot 7, but FDB shows process B with ballot 6, something corrupted stateReplay all logged operations in versionstamp order to compute expected leader/ballot, then compare with actual database state
FencingTokenMonotonicityEach successful claim must have a strictly higher ballot than the previousWhen a network partition heals, an old leader might try to act with a stale ballot. This catches that writeFor each claim in the log, verify ballot > previous_ballot from the same transaction
OneValuePerBallotEach ballot number maps to exactly one clientTwo clients claiming the same ballot means either conflict detection failed or ballot increment is broken. Classic split-brain symptomScan all claims, verify no two different clients ever claimed the same ballot number
LeaderIsCandidateCurrent leader must exist in the candidates registryEdge cases like crash-during-registration or eviction-while-claiming can leave orphaned leadersRead leader state, verify a matching candidate entry exists
NoOverlappingLeadershipEvery leadership claim has a globally unique versionstampFDB serializes commits globally. Duplicate versionstamps mean either a logging bug or actual split-brainCollect all claim versionstamps, verify no duplicates
GlobalBallotSuccessionEach new leader must have ballot > previous leader's ballotCatches state regression after partition heals. An old leader can't "go back" to a stale ballotTrack previous_ballot in log entries, verify new_ballot > previous_ballot for every transition
LeaseExpiryAfterClaimLease must expire after the claim timestampClock skew can cause a leader to claim with an already-expired lease. This catches incorrect lease calculation or extreme clock driftFor each claim, verify lease_expiry_nanos > claim_timestamp_nanos

🔗What's Next

The leader election recipe has just been merged into foundationdb-rs as experimental. I haven't run it in production yet. The entire development was simulation-driven: write code, run simulation, fix what breaks, repeat. The simulation runs in CI on every PR.

Simulation gives us guarantees that the code behaves correctly under the right rules, even when FDB is tortured. Network partitions, clock skew, process crashes. The invariants verify that safety properties hold through all of it. Not a proof of correctness, but confidence earned through chaos.

The invariants that Claude Code generated encode patterns from prior simulation work: dual-path validation from AtomicOps, fencing tokens from the literature, lease checks tied to clock skew. What the LLM provided was speed: weeks of invariant development compressed into hours of review. The LLM proposes, simulation disposes.

If you have ideas for new tortures or invariants, the simulation code is open. Merge requests welcome.


Feel free to reach out with questions or to share your experiences with leader election and simulation testing. You can find me on Twitter, Bluesky or through my website.

Tags: #foundationdb #rust #simulation #distributed-systems #consensus #leader-election