Leader Election

Introduction

You're designing a distributed job scheduler, and the interviewer asks: "What happens when the scheduler crashes? Who picks up its work?"

You say: "We'll run two instances." Great. But now both instances wake up at the same second, poll the database, and fire the same job twice. One user gets two password reset emails. Another gets double-charged.

The problem isn't redundancy. It's coordination. When you have multiple instances of a service, someone needs to be in charge. That's what leader election solves.

What Is Leader Election?

Leader election is a protocol where multiple nodes in a distributed system agree on exactly one node to act as the "leader." The leader performs coordination tasks that must happen on a single node: writing to a shared resource, assigning work, or polling a database for due jobs.

The other nodes are followers (or standbys). If the leader dies, the followers detect the failure and elect a new leader. The system continues without manual intervention.

Node A (Leader)  ← does the work
Node B (Follower) ← watches, ready to take over
Node C (Follower) ← watches, ready to take over

The key guarantee: at any point in time, at most one node believes it is the leader. Violating this — called "split brain" — leads to duplicate processing, data corruption, or worse.

Why Leader Election Matters

Leader election shows up in almost every distributed systems interview, even when it's not the main question. Here's where you'll need it:

Job Schedulers & Cron Systems

Only one node should poll the database for due jobs. Without a leader, every node polls and dispatches the same jobs, causing duplicate execution. For more on this, see Cron Architecture.

Database Replication

In primary-replica setups (PostgreSQL, MySQL), the primary accepts writes. If it fails, a replica must be promoted. That's leader election.

Distributed Locks

When multiple services need exclusive access to a resource (a file, a queue partition, a configuration update), the lock holder is effectively the leader.

Consensus Systems

Kafka uses leader election to assign partition leaders. etcd and ZooKeeper use it internally for their own replication. It's leaders all the way down.

Common Algorithms

The Bully Algorithm

The simplest leader election algorithm. Every node has a numeric ID. The node with the highest ID wins.

1. Node detects leader is dead
2. Node sends "election" message to all nodes with higher IDs
3. If no higher node responds → this node becomes leader
4. If a higher node responds → that node takes over the election
5. The highest living node broadcasts "I am leader"

Pros: Simple to understand and implement. Cons: Assumes reliable failure detection. Not partition-tolerant. Rarely used in production.

Raft Consensus

Raft is the most popular consensus algorithm in modern systems. etcd (used by Kubernetes) and Consul both use Raft.

1. Nodes start as "followers"
2. If a follower doesn't hear from the leader within a timeout, it becomes a "candidate"
3. Candidate requests votes from all other nodes
4. If it gets a majority of votes → it becomes the new leader
5. Leader sends periodic heartbeats to maintain authority

Key property: Raft guarantees that only one leader exists per "term" (election round). A leader must have a majority, so even during a network partition, at most one partition can elect a leader.

Why Raft matters in interviews: It's behind etcd, Consul, CockroachDB, and TiKV. When you say "we'll use etcd for leader election," you're implicitly saying "we'll use Raft."

Paxos

Paxos is the original consensus algorithm. It predates Raft and solves the same problem, but it's notoriously hard to understand and implement correctly.

In interviews: You almost never need to explain Paxos. Mentioning that Raft was designed as a more understandable alternative to Paxos is sufficient. If the interviewer asks, say "Paxos solves the same consensus problem but Raft is preferred in practice because it's easier to implement correctly."

ZooKeeper: The Classic Approach

Apache ZooKeeper is a coordination service that makes leader election easy. It's used by Kafka (older versions), Hadoop, and many Java-based distributed systems.

How it works:

1. Each node creates an "ephemeral sequential" znode (a temporary, auto-numbered file)
   /election/node-0001
   /election/node-0002
   /election/node-0003

2. The node with the lowest number is the leader
   node-0001 → Leader

3. Each non-leader node watches the node just before it
   node-0003 watches node-0002
   node-0002 watches node-0001

4. If the leader (node-0001) dies, ZooKeeper deletes its ephemeral znode
   node-0002 gets notified → becomes leader

5. If node-0002 also dies, node-0003 gets notified → becomes leader

Why ephemeral nodes? If a node crashes or loses its connection to ZooKeeper, the ephemeral znode is automatically deleted. No manual cleanup needed. The system self-heals.

Pros: Battle-tested, strong consistency guarantees, built-in failure detection via session timeouts. Cons: Requires running a ZooKeeper cluster (3-5 nodes), adds operational complexity, Java-based.

etcd and Raft: The Modern Approach

etcd is a distributed key-value store that uses Raft internally. It's the backbone of Kubernetes and the modern alternative to ZooKeeper.

How leader election works with etcd:

1. Node acquires a lease (a time-limited lock):
   etcdctl lease grant 10   → returns lease ID (TTL: 10 seconds)

2. Node writes a key with the lease attached:
   etcdctl put /leader "node-A" --lease=<lease-id>

3. Only the first writer succeeds (compare-and-swap).
   Node A writes successfully → Node A is leader
   Node B tries to write → fails (key exists) → Node B is follower

4. Leader must refresh the lease before it expires:
   etcdctl lease keep-alive <lease-id>   (every few seconds)

5. If the leader crashes, the lease expires after 10 seconds
   Key /leader is deleted
   Followers retry the write → one wins → new leader

Pros: Simpler than ZooKeeper, Go-based (easier to deploy), Kubernetes-native, excellent documentation. Cons: Still requires a 3-5 node etcd cluster. Lease TTL adds a delay before failover (e.g., 10 seconds of no leader).

Redis Lease-Based Election

For simpler systems that don't need the full guarantees of Raft, Redis offers a lightweight approach.

How it works:

1. Node tries to set a key with NX (only if not exists) and EX (TTL):
   SET leader "node-A" NX EX 10

2. If SET returns OK → this node is leader
   If SET returns nil → another node is leader

3. Leader refreshes the key before expiry:
   SET leader "node-A" XX EX 10   (XX = only if exists)

4. If leader crashes, key expires after 10 seconds
   Next node to call SET NX wins

Pros: Dead simple, no extra infrastructure if you already run Redis, sub-second operations. Cons: Redis is not a consensus system. In rare edge cases (Redis failover, clock skew), two nodes may briefly believe they're both leader. Not suitable when split-brain would cause data loss.

When to use Redis: When duplicate work is tolerable (idempotent jobs, cache warming) and you want simplicity over correctness guarantees.

Comparison Table

Approach	Consistency	Failover Time	Complexity	Best For
ZooKeeper	Strong (ZAB)	~2-5 seconds	High (Java cluster)	Legacy systems, Kafka
etcd	Strong (Raft)	~5-15 seconds	Medium (Go binary)	Kubernetes, modern stacks
Redis	Weak (no consensus)	~10 seconds	Low (existing Redis)	Simple coordination, idempotent tasks
Raft (direct)	Strong	~1-5 seconds	Very high (custom impl)	Building your own database

Interview default: "We'll use etcd for leader election." It's modern, well-understood, and Kubernetes-native. If the system already uses ZooKeeper (e.g., older Kafka), mention that instead.

Common Interview Mistakes

Mistake 1: Ignoring split brain

"We'll run two instances and one will be the leader."

Problem: How? If they can't communicate (network partition), both think the other is dead. Both become leader. Both process the same jobs.

Better: Explain that leader election requires a quorum (majority agreement). With etcd or ZooKeeper, the leader must maintain a lease. If it can't reach the coordination service, it steps down. This prevents split brain.

Mistake 2: Building your own leader election

"I'll use a database row with a timestamp. Each node tries to UPDATE WHERE timestamp is old."

Problem: Database-based leader election is full of race conditions. Clock skew, long GC pauses, and network delays can all cause two nodes to grab the lock simultaneously. You're reimplementing consensus badly.

Better: Use a purpose-built coordination service (etcd, ZooKeeper). They've solved these edge cases already. In an interview, saying "I'd use etcd" is the right level of abstraction.

Mistake 3: Forgetting the failover delay

"If the leader dies, a new one takes over instantly."

Problem: Failover is never instant. The lease must expire (seconds), followers must detect the failure, and a new election must complete. During this window, no leader exists and work stalls.

Better: Acknowledge the failover gap. Discuss the trade-off: shorter lease TTLs mean faster failover but more network traffic for keep-alives. Typical values are 5-15 seconds.

Mistake 4: Using leader election when you don't need it

"Every service in my design uses leader election."

Problem: Leader election adds complexity and a single point of coordination. If your workers are stateless and idempotent, you don't need a leader. Just let all workers pull from a message queue independently.

Better: Only use leader election when exactly-one semantics matter: database polling, cron scheduling, partition assignment. For parallel task processing, a queue with competing consumers is simpler and scales better.

Summary: What to Remember

Leader election ensures exactly one node performs a coordinated task at any time
etcd (Raft) is the modern default; ZooKeeper is the classic choice; Redis works for simple cases
Split brain (two leaders) is the main failure mode to discuss. Quorum-based systems prevent it
Failover is never instant. Lease TTLs create a gap of seconds between leader failure and recovery
Don't use leader election when a message queue with competing consumers would suffice
Leader election is a building block, not the whole solution. Combine it with queues, databases, and retries

Interview golden rule:

Don't just say "we'll have a leader." Explain how it's elected (etcd lease, ZooKeeper ephemeral node), what happens when it dies (lease expires, new election), and what happens during the gap (work stalls briefly, then resumes).