CAP Theorem

Understanding consistency, availability, and partition tolerance in distributed systems

S
System Design Sandbox··12 min read
Learn the CAP theorem and how to make trade-offs between consistency, availability, and partition tolerance in distributed systems. Essential knowledge for system design interviews.

Introduction

Okay, so you're in your system design interview, and you just finished explaining your beautiful distributed architecture. You've got multiple data centers, load balancers, the whole nine yards. You're feeling pretty good.

Then the interviewer drops this bomb: "So what happens when the network between your data centers goes down? How does your system handle that?"

And now you're sweating because you know there's a trade-off here, but you can't quite articulate it. You've heard of the CAP Theorem, but what does it actually mean in practice?

Here's how it actually works in practice.


The Core Idea: You Can't Have Everything

The CAP Theorem is brutal in its simplicity:

When network failures happen, you MUST choose between:

  • Returning the correct data, or
  • Returning any data at all

You cannot have both. This isn't a tooling problem or an implementation issue. It's physics.

The Three Properties

C = Consistency Every read gets the most recent write. All nodes show the same data at the same time.

A = Availability Every request gets a response (not an error). The system always answers.

P = Partition Tolerance The system keeps working even when network messages are lost or delayed.

The theorem:

You can only guarantee TWO out of three when network partitions occur

Breaking Down Each Property

Consistency (C)

What it means: After you write data, EVERY subsequent read, no matter which server handles it, returns that new value.

In other words:

  • Single source of truth
  • No stale reads allowed
  • Some requests might fail to maintain correctness

When you need it:

  • Banking and payments
  • Inventory management
  • Anything where wrong data = real money lost

Example: You transfer $100 from checking to savings. If consistency is guaranteed, you'll never see a moment where the $100 disappeared from checking but hasn't appeared in savings yet.

Availability (A)

What it means: Every request gets a response. Period. No errors, no blocking.

The catch: That response might not have the latest data. But you get something.

When you need it:

  • Social media feeds
  • Search results
  • News feeds
  • Anywhere downtime pisses off users more than stale data does

Example: You post on social media. Your friend in another region might not see it for a few seconds, but their feed still loads instantly with the posts it knows about.

Partition Tolerance (P)

What it means: The system keeps working even when the network is screwed up: messages dropped, delayed, or nodes can't talk to each other.

Here's the reality: Network failures WILL happen. Cables get cut. Data centers go down. AWS regions have outages.

Because of this: Partition tolerance is basically mandatory for distributed systems. You can't just ignore it.

So the real choice is:

CP or AP
(Not CA, because partitions are inevitable)

Why You Can't Have All Three

Let's make this concrete. You've got two data centers: East and West.

Normal operation (no partition):

  • Both data centers are in sync
  • Reads return correct data
  • All requests are answered

Everything is perfect. CA + P all working together.

Then the network cable between them gets cut (partition happens):

Now a user in the West writes data. But East doesn't know about it yet because the network is down.

You have two choices:

Choice 1: Choose Consistency (CP)

  • Block reads in East until the network recovers
  • Users in East get errors or have to wait
  • When the network comes back, everyone sees the correct data

Choice 2: Choose Availability (AP)

  • East keeps serving reads with the old data
  • Users get instant responses
  • The data is stale, but nobody is blocked

You CANNOT:

  • Serve instant responses (A)
  • With guaranteed correct data (C)
  • When the network is broken (P)

One of these has to give.


The Three System Types

CA Systems (Consistency + Availability)

What they sacrifice: Partition tolerance

Reality check: These basically don't exist in distributed systems.

Why: Because you can't just "not tolerate" network partitions. They happen whether you like it or not.

Where you see them:

  • Single-node databases
  • Tightly coupled systems in one data center
  • Mostly theoretical for interviews

Interview tip: If someone asks about CA, acknowledge they're mostly theoretical since partitions are inevitable.

CP Systems (Consistency + Partition Tolerance)

What they sacrifice: Availability

The behavior:

  • Maintains strict correctness
  • Works during partitions
  • Will reject or delay requests to avoid serving stale data

Examples:

  • Traditional relational databases (PostgreSQL, MySQL in certain configs)
  • Zookeeper
  • HBase
  • MongoDB (with certain settings)

Use cases:

  • Financial systems
  • Inventory management
  • Anything where wrong data causes serious problems

Interview example: "For our payment system, we're going CP. If there's a network partition, we'd rather reject some transactions than risk double-charging customers or showing incorrect balances."

AP Systems (Availability + Partition Tolerance)

What they sacrifice: Consistency

The behavior:

  • Always responds to requests
  • Works during partitions
  • May return stale or conflicting data
  • Eventually becomes consistent when partition heals

Examples:

  • DynamoDB
  • Cassandra
  • Couchbase
  • DNS

Use cases:

  • Social media feeds
  • Shopping cart (usually)
  • Content delivery
  • Anything where uptime > perfect accuracy

Interview example: "For our social feed, we're going AP. If a user posts and there's a partition, some users might not see it immediately, but everyone's feed stays responsive. The post will propagate once the network recovers."


Quick Decision Framework

Ask yourself: What's worse for my use case?

If this is worse...Choose...Example
Showing wrong dataCPBank account balance
System being unavailableAPSocial media feed
Users seeing different data temporarilyCPStock trading platform
Users getting errorsAPVideo streaming service

The Concrete Interview Example

Setup: You're designing Instagram. You have data centers in US and Europe.

Scenario: User in US posts a photo. Network partition happens.

If you choose CP (Consistency + Partition):

  • Users in Europe see an error when trying to view feeds
  • OR their feed is delayed until partition resolves
  • But when they do see data, it's guaranteed correct

If you choose AP (Availability + Partition):

  • Users in Europe see their feed instantly
  • New US post isn't visible yet in Europe
  • Feed might be 30 seconds stale
  • Once network recovers, everything syncs up

What Instagram actually does: AP (with eventual consistency)

Why: Users care way more about the app working than seeing every post instantly. A few seconds of lag is fine. The app being down is not fine.


Common Interview Mistakes

Mistake 1: "We'll just use both"

Wrong. During a partition, you MUST choose. You can't have both consistency and availability when the network is broken.

Mistake 2: "CA is common"

Wrong. CA systems can't handle network partitions, which happen in every distributed system.

Mistake 3: "Eventually consistent = broken"

Wrong. Eventually consistent systems are still correct. They just tolerate temporary divergence for better availability.

Mistake 4: Memorizing labels without understanding trade-offs

Don't just say "We'll use a CP system." Explain WHY and what happens during failures.


How to Talk About CAP in Interviews

Bad answer: "We'll use Cassandra because it's AP."

Good answer: "We'll use Cassandra because availability is more critical than immediate consistency for our use case. During a network partition, we'd rather serve slightly stale data than block user requests. Once the partition heals, the system will converge to a consistent state through its eventual consistency model."

Great answer: "For user profiles, we'll prioritize consistency. If there's a partition, we'll reject writes rather than risk conflicting profile updates. But for the activity feed, we'll prioritize availability. Users can still scroll their feed even if the latest posts haven't replicated yet. We're making different CAP trade-offs for different parts of the system based on their requirements."


The Reality: CAP is a Spectrum

Modern systems don't just pick CP or AP and call it a day. They tune the knobs:

Cassandra:

  • Has tunable consistency levels
  • Can choose strong consistency per-query if needed
  • Usually runs in AP mode

MongoDB:

  • Can be configured as CP or AP
  • Read/write concerns let you tune per-operation

In interviews: Show you understand that CAP isn't binary. Systems make trade-offs based on the specific operation and requirements.


The One Thing to Remember

When the interviewer asks about network partitions, this is the mental model:

Network breaks → Must choose:

Option A (CP): Reject requests, maintain correctness
Option B (AP): Serve requests, accept staleness

Your choice depends on: What hurts your users more?

Summary: What Actually Matters for Interviews

The theorem:

  • During network partitions, choose Consistency OR Availability
  • Partition tolerance is mandatory
  • So the real choice is CP vs AP

Decision framework:

  • Money/inventory/critical data → CP
  • Social/media/content → AP
  • Many systems mix both for different data types

What interviewers want to hear:

  • Acknowledge partitions will happen
  • Explain behavior during failure
  • Justify your choice based on requirements
  • Show you understand trade-offs, not just labels

The golden rule:

CAP isn't about what you choose during normal operation.
It's about what you choose when shit hits the fan.