Introduction
Okay, so you're in your system design interview, and you just finished explaining your beautiful distributed architecture. You've got multiple data centers, load balancers, the whole nine yards. You're feeling pretty good.
Then the interviewer drops this bomb: "So what happens when the network between your data centers goes down? How does your system handle that?"
And now you're sweating because you know there's a trade-off here, but you can't quite articulate it. You've heard of the CAP Theorem, but what does it actually mean in practice?
Here's how it actually works in practice.
The Core Idea: You Can't Have Everything
The CAP Theorem is brutal in its simplicity:
When network failures happen, you MUST choose between:
- Returning the correct data, or
- Returning any data at all
You cannot have both. This isn't a tooling problem or an implementation issue. It's physics.
The Three Properties
C = Consistency Every read gets the most recent write. All nodes show the same data at the same time.
A = Availability Every request gets a response (not an error). The system always answers.
P = Partition Tolerance The system keeps working even when network messages are lost or delayed.
The theorem:
You can only guarantee TWO out of three when network partitions occur
Breaking Down Each Property
Consistency (C)
What it means: After you write data, EVERY subsequent read, no matter which server handles it, returns that new value.
In other words:
- Single source of truth
- No stale reads allowed
- Some requests might fail to maintain correctness
When you need it:
- Banking and payments
- Inventory management
- Anything where wrong data = real money lost
Example: You transfer $100 from checking to savings. If consistency is guaranteed, you'll never see a moment where the $100 disappeared from checking but hasn't appeared in savings yet.
Availability (A)
What it means: Every request gets a response. Period. No errors, no blocking.
The catch: That response might not have the latest data. But you get something.
When you need it:
- Social media feeds
- Search results
- News feeds
- Anywhere downtime pisses off users more than stale data does
Example: You post on social media. Your friend in another region might not see it for a few seconds, but their feed still loads instantly with the posts it knows about.
Partition Tolerance (P)
What it means: The system keeps working even when the network is screwed up: messages dropped, delayed, or nodes can't talk to each other.
Here's the reality: Network failures WILL happen. Cables get cut. Data centers go down. AWS regions have outages.
Because of this: Partition tolerance is basically mandatory for distributed systems. You can't just ignore it.
So the real choice is:
CP or AP
(Not CA, because partitions are inevitable)
Why You Can't Have All Three
Let's make this concrete. You've got two data centers: East and West.
Normal operation (no partition):
- Both data centers are in sync
- Reads return correct data
- All requests are answered
Everything is perfect. CA + P all working together.
Then the network cable between them gets cut (partition happens):
Now a user in the West writes data. But East doesn't know about it yet because the network is down.
You have two choices:
Choice 1: Choose Consistency (CP)
- Block reads in East until the network recovers
- Users in East get errors or have to wait
- When the network comes back, everyone sees the correct data
Choice 2: Choose Availability (AP)
- East keeps serving reads with the old data
- Users get instant responses
- The data is stale, but nobody is blocked
You CANNOT:
- Serve instant responses (A)
- With guaranteed correct data (C)
- When the network is broken (P)
One of these has to give.
The Three System Types
CA Systems (Consistency + Availability)
What they sacrifice: Partition tolerance
Reality check: These basically don't exist in distributed systems.
Why: Because you can't just "not tolerate" network partitions. They happen whether you like it or not.
Where you see them:
- Single-node databases
- Tightly coupled systems in one data center
- Mostly theoretical for interviews
Interview tip: If someone asks about CA, acknowledge they're mostly theoretical since partitions are inevitable.
CP Systems (Consistency + Partition Tolerance)
What they sacrifice: Availability
The behavior:
- Maintains strict correctness
- Works during partitions
- Will reject or delay requests to avoid serving stale data
Examples:
- Traditional relational databases (PostgreSQL, MySQL in certain configs)
- Zookeeper
- HBase
- MongoDB (with certain settings)
Use cases:
- Financial systems
- Inventory management
- Anything where wrong data causes serious problems
Interview example: "For our payment system, we're going CP. If there's a network partition, we'd rather reject some transactions than risk double-charging customers or showing incorrect balances."
AP Systems (Availability + Partition Tolerance)
What they sacrifice: Consistency
The behavior:
- Always responds to requests
- Works during partitions
- May return stale or conflicting data
- Eventually becomes consistent when partition heals
Examples:
- DynamoDB
- Cassandra
- Couchbase
- DNS
Use cases:
- Social media feeds
- Shopping cart (usually)
- Content delivery
- Anything where uptime > perfect accuracy
Interview example: "For our social feed, we're going AP. If a user posts and there's a partition, some users might not see it immediately, but everyone's feed stays responsive. The post will propagate once the network recovers."
Quick Decision Framework
Ask yourself: What's worse for my use case?
| If this is worse... | Choose... | Example |
|---|---|---|
| Showing wrong data | CP | Bank account balance |
| System being unavailable | AP | Social media feed |
| Users seeing different data temporarily | CP | Stock trading platform |
| Users getting errors | AP | Video streaming service |
The Concrete Interview Example
Setup: You're designing Instagram. You have data centers in US and Europe.
Scenario: User in US posts a photo. Network partition happens.
If you choose CP (Consistency + Partition):
- Users in Europe see an error when trying to view feeds
- OR their feed is delayed until partition resolves
- But when they do see data, it's guaranteed correct
If you choose AP (Availability + Partition):
- Users in Europe see their feed instantly
- New US post isn't visible yet in Europe
- Feed might be 30 seconds stale
- Once network recovers, everything syncs up
What Instagram actually does: AP (with eventual consistency)
Why: Users care way more about the app working than seeing every post instantly. A few seconds of lag is fine. The app being down is not fine.
Common Interview Mistakes
Mistake 1: "We'll just use both"
Wrong. During a partition, you MUST choose. You can't have both consistency and availability when the network is broken.
Mistake 2: "CA is common"
Wrong. CA systems can't handle network partitions, which happen in every distributed system.
Mistake 3: "Eventually consistent = broken"
Wrong. Eventually consistent systems are still correct. They just tolerate temporary divergence for better availability.
Mistake 4: Memorizing labels without understanding trade-offs
Don't just say "We'll use a CP system." Explain WHY and what happens during failures.
How to Talk About CAP in Interviews
Bad answer: "We'll use Cassandra because it's AP."
Good answer: "We'll use Cassandra because availability is more critical than immediate consistency for our use case. During a network partition, we'd rather serve slightly stale data than block user requests. Once the partition heals, the system will converge to a consistent state through its eventual consistency model."
Great answer: "For user profiles, we'll prioritize consistency. If there's a partition, we'll reject writes rather than risk conflicting profile updates. But for the activity feed, we'll prioritize availability. Users can still scroll their feed even if the latest posts haven't replicated yet. We're making different CAP trade-offs for different parts of the system based on their requirements."
The Reality: CAP is a Spectrum
Modern systems don't just pick CP or AP and call it a day. They tune the knobs:
Cassandra:
- Has tunable consistency levels
- Can choose strong consistency per-query if needed
- Usually runs in AP mode
MongoDB:
- Can be configured as CP or AP
- Read/write concerns let you tune per-operation
In interviews: Show you understand that CAP isn't binary. Systems make trade-offs based on the specific operation and requirements.
The One Thing to Remember
When the interviewer asks about network partitions, this is the mental model:
Network breaks → Must choose:
Option A (CP): Reject requests, maintain correctness
Option B (AP): Serve requests, accept staleness
Your choice depends on: What hurts your users more?
Summary: What Actually Matters for Interviews
The theorem:
- During network partitions, choose Consistency OR Availability
- Partition tolerance is mandatory
- So the real choice is CP vs AP
Decision framework:
- Money/inventory/critical data → CP
- Social/media/content → AP
- Many systems mix both for different data types
What interviewers want to hear:
- Acknowledge partitions will happen
- Explain behavior during failure
- Justify your choice based on requirements
- Show you understand trade-offs, not just labels
The golden rule:
CAP isn't about what you choose during normal operation.
It's about what you choose when shit hits the fan.