Wide-Column Databases

Understanding Cassandra, partition keys, and write-heavy workloads

S
System Design Sandbox··12 min read
Learn how wide-column databases like Cassandra organize data by partition key and clustering key, why LSM trees make writes fast, and how to design schemas query-first. Covers Cassandra vs DynamoDB vs MongoDB and when to choose each.

Introduction

You're designing a chat application and you say: "I'll use Cassandra for message storage."

The interviewer leans in. "Why Cassandra and not PostgreSQL? What's your partition key? What happens when a conversation has 500,000 messages? How does your query perform? How do you handle hot partitions?"

And now you realize that saying "Cassandra" isn't enough. You need to understand why wide-column databases exist, how they organize data differently from SQL, and when they're the right choice.


What Is a Wide-Column Database?

A wide-column store organizes data into rows and columns, but not like a relational database. Think of it as a two-level key-value store: you have a partition key that determines where the data lives, and a clustering key that determines how data is sorted within that partition.

The most common wide-column databases are Apache Cassandra and Apache HBase. DynamoDB follows similar principles with a different interface.

Here's the mental model:

SQL Table (relational):
  Every row has the same columns, fixed schema
  Rows are scattered across pages based on primary key
  Joins are the primary way to connect data

Wide-Column Store:
  Rows are grouped by partition key
  Within a partition, rows are sorted by clustering key
  No joins. Ever. You denormalize everything.

The fundamental difference: SQL databases are optimized for flexible querying across normalized data. Wide-column stores are optimized for fast reads and writes on specific access patterns that you define upfront.


How Data Is Organized

Partition Key: Where Data Lives

The partition key determines which node in the cluster stores the data. Cassandra hashes the partition key and uses consistent hashing to assign it to a node.

Table: messages
Partition Key: chat_id

chat_id = "chat-A"  ->  hashes to Node 2
chat_id = "chat-B"  ->  hashes to Node 5
chat_id = "chat-C"  ->  hashes to Node 1

All messages for chat-A live on the same node (plus replicas). This means reading all messages for a conversation is a single-node operation with no scatter-gather across the cluster.

Choosing the partition key is the most important design decision. It determines:

  • Data locality (what gets stored together)
  • Query efficiency (single-partition reads are fast)
  • Data distribution (even distribution prevents hot spots)

Clustering Key: How Data Is Sorted

Within a partition, rows are sorted by the clustering key. This gives you ordered data for free, with no sorting at query time.

Table: messages
Partition Key: chat_id
Clustering Key: message_timestamp (DESC)

Partition: chat_id = "chat-A"
  ┌──────────────────────────────────────────────┐
  │ message_timestamp    │ sender  │ content      │
  │ 2024-01-15 12:05:00  │ Alice   │ "See you!"   │
  │ 2024-01-15 12:04:30  │ Bob     │ "Sounds good" │
  │ 2024-01-15 12:04:00  │ Alice   │ "Lunch at 1?" │
  │ ...                  │ ...     │ ...          │
  └──────────────────────────────────────────────┘

Reading the last 20 messages of a conversation is now: find the partition (one hash lookup), read the first 20 rows (already sorted). Done. Constant time regardless of how many messages exist in total.

Compare that to SQL: SELECT * FROM messages WHERE chat_id = 'chat-A' ORDER BY timestamp DESC LIMIT 20. The SQL database needs an index, traverses it, and if the table is large enough, this gets slow.


Why Wide-Column Stores Excel at Write-Heavy Workloads

Chat apps, IoT telemetry, time-series data, and activity logs all share one property: they generate an enormous volume of small writes. Wide-column stores are purpose-built for this.

LSM Trees vs B-Trees

Traditional SQL databases use B-trees for indexing. Every write updates the tree in-place, which means random disk I/O. Under heavy write load, this becomes the bottleneck.

Cassandra uses LSM trees (Log-Structured Merge trees). Writes go to an in-memory buffer (memtable) first, which is extremely fast. When the buffer fills up, it's flushed to disk as an immutable sorted file (SSTable). Reads merge results from the memtable and SSTables.

Write path (B-tree / SQL):
  Write -> find page on disk -> update in-place -> random I/O
  Speed: limited by disk seeks

Write path (LSM tree / Cassandra):
  Write -> append to memtable (in-memory) -> sequential flush to disk
  Speed: limited by memory and sequential I/O (much faster)

This is why Cassandra handles hundreds of thousands of writes per second per node. The writes are sequential, not random. Sequential disk writes are 100x faster than random writes.

Distributed Writes with No Single Bottleneck

In a SQL database (even with replicas), writes typically go through a single primary node. That primary is the bottleneck. With leader-based replication, you scale reads but not writes. For more on replication trade-offs, see CAP Theorem.

Cassandra has no primary node. Every node can accept writes for any partition. Data is replicated to N nodes (typically 3) using consistent hashing. This means write throughput scales linearly with cluster size: double the nodes, double the write capacity.

SQL (leader-based):
  All writes -> Primary node -> replicate to replicas
  Write bottleneck: single primary

Cassandra (leaderless):
  Write for partition A -> Node 2, Node 5, Node 7 (replicas)
  Write for partition B -> Node 1, Node 3, Node 6 (replicas)
  No single bottleneck. Add nodes to add capacity.

Schema Design: Queries First

This is where most people coming from SQL get it wrong. In SQL, you model your data first and write queries later. In Cassandra, you design your queries first and model your schema around them.

The SQL Way (Data-First)

-- Normalize the data
Users table: user_id, name, email
Messages table: message_id, chat_id, sender_id, content, timestamp
Chats table: chat_id, name, created_at
Chat_members table: chat_id, user_id

-- Query: get messages for a chat
SELECT m.*, u.name FROM messages m
JOIN users u ON m.sender_id = u.user_id
WHERE m.chat_id = 'chat-A'
ORDER BY m.timestamp DESC
LIMIT 20;

Flexible. You can query messages by chat, by user, by time range, with joins. The schema doesn't dictate your access patterns.

The Cassandra Way (Query-First)

-- Start with the query: "Get the last 20 messages for a conversation"
-- Design the table to serve exactly that query

CREATE TABLE messages_by_chat (
  chat_id       TEXT,
  message_id    TIMEUUID,
  sender_name   TEXT,      // denormalized, no join needed
  content       TEXT,
  PRIMARY KEY (chat_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

-- Query: one partition read, already sorted
SELECT * FROM messages_by_chat
WHERE chat_id = 'chat-A'
LIMIT 20;

Notice: sender_name is stored directly in the messages table, duplicated across every message. In SQL, this would be a normalization violation. In Cassandra, it's standard practice. You trade storage for read performance.

One Table Per Query Pattern

If you have a second access pattern ("Get all chats for a user"), you create a second table:

CREATE TABLE chats_by_user (
  user_id     TEXT,
  chat_id     TEXT,
  chat_name   TEXT,
  last_message_at TIMESTAMP,
  PRIMARY KEY (user_id, last_message_at)
) WITH CLUSTERING ORDER BY (last_message_at DESC);

Same data, different organization. When a message is sent, you write to both tables. Data duplication is the price of fast reads without joins.


Cassandra vs DynamoDB vs MongoDB

These are the three NoSQL options that come up most in interviews. Know when to reach for which.

AspectCassandraDynamoDBMongoDB
Data modelWide-column (partition + cluster)Key-value / wide-columnDocument (JSON-like)
Write performanceExcellent (LSM trees)Excellent (managed)Good
Read performanceFast for partition queriesFast for key lookupsFlexible queries
ScalingLinear, add nodesAutomatic (managed by AWS)Sharding (manual config)
ConsistencyTunable (ONE to ALL)Eventually consistent or strongTunable
OperationsSelf-managed (complex)Fully managed (simple)Moderate
Best forWrite-heavy, time-series, chatKey-value at scale, serverlessFlexible schemas, rapid dev
JoinsNoneNoneLimited ($lookup)

Interview quick picks:

  • Chat messages, IoT data, activity logs (high write volume, time-sorted reads) -> Cassandra
  • Session storage, user profiles, serverless backends (key-value with managed scaling) -> DynamoDB
  • Product catalogs, content management, prototyping (flexible schemas, some querying) -> MongoDB
  • Complex queries, transactions, relationships -> Stick with PostgreSQL. For more on SQL vs NoSQL decision-making, see Databases & Caching.

When NOT to Use Wide-Column Databases

Wide-column stores are powerful but they're not general-purpose. Here's when they're the wrong choice.

Complex queries and joins: If your access patterns require joining across entities, you need SQL. Wide-column stores have no joins. You can denormalize, but if you have 10 different query patterns that all need different data relationships, the denormalization overhead becomes unmanageable.

Small datasets: If your data fits on a single PostgreSQL instance (say under 100GB), the operational complexity of Cassandra isn't worth it. You get a distributed system's headaches without needing distributed system's benefits.

Strong consistency requirements: Cassandra favors availability and partition tolerance (AP in CAP Theorem terms). You can tune consistency with quorum reads/writes, but if you need ACID transactions across multiple rows, use a SQL database.

Frequently changing access patterns: Every new query pattern in Cassandra may require a new table with different partition and clustering keys. If your queries evolve frequently during early development, the rigid schema-per-query model slows you down. Use something flexible first, migrate to Cassandra when your access patterns stabilize.

Ad-hoc analytics: Cassandra is terrible for exploratory queries. Full table scans are expensive. If you need analytics, stream your Cassandra data to a data warehouse (BigQuery, Redshift) and run analytics there.


Common Interview Mistakes

Mistake 1: Saying "Cassandra" without knowing why

"I'll use Cassandra for the database."

Problem: The interviewer immediately asks "Why not PostgreSQL?" If you can't explain the specific trade-offs (write throughput, partition-based access, leaderless replication), you look like you're name-dropping.

Better: Explain the access pattern first. "Chat messages are write-heavy, time-sorted, and always queried by conversation. Cassandra's partition model lets us co-locate messages by chat ID and read them in sorted order without joins."

Mistake 2: Choosing a bad partition key

"I'll partition messages by user_id."

Problem: A power user in 500 group chats generates data across all of them. You can't efficiently read "all messages in chat-A" if they're partitioned by sender. You'd need to scan every user's partition.

Better: Partition by the access pattern. Messages are always read per-conversation, so chat_id is the partition key. message_timestamp or a TIMEUUID is the clustering key for sort order.

Mistake 3: Trying to normalize data like SQL

"I'll create a users table and join it with messages to get the sender name."

Problem: There are no joins in Cassandra. A query that reads from two tables means two network round trips and client-side joining. That defeats the purpose.

Better: Denormalize. Store sender_name directly in the messages table. Yes, it's duplicated. That's the trade-off. Storage is cheap. Latency is not.

Mistake 4: Ignoring hot partitions

"All messages go into one big partition."

Problem: If a viral group chat has 10 million messages in one partition, that partition lives on one node. That node becomes a hotspot. Reads and writes for that partition saturate a single node while others sit idle.

Better: Use time-bucketed partition keys. Instead of just chat_id, use chat_id + date_bucket (e.g., chat-A:2024-01-15). Each day gets its own partition. Recent messages are spread across time-bounded partitions. You query the current bucket first, then page back to older buckets if needed.

Partition Key: (chat_id, date_bucket)
Clustering Key: message_id

"chat-A:2024-01-15"  ->  today's messages
"chat-A:2024-01-14"  ->  yesterday's messages
"chat-A:2024-01-13"  ->  ...

Summary: What to Remember

  • Wide-column stores organize data by partition key (where it lives) and clustering key (how it's sorted within a partition)
  • Cassandra uses LSM trees for fast writes and leaderless replication for linear write scaling
  • Schema design is query-first: design your table around the query, not the data model. One table per access pattern.
  • Denormalization is expected. No joins exist. Store redundant data to avoid multi-table reads.
  • Partition key is the most critical decision. It determines data distribution, query efficiency, and hot spot risk.
  • Use time-bucketed partitions for high-volume data to prevent unbounded partition growth
  • Choose Cassandra for write-heavy, time-sorted, partition-queryable workloads (chat, IoT, logs)
  • Choose PostgreSQL when you need joins, transactions, or flexible querying

Key numbers:

  • Cassandra write throughput: 100K+ writes/sec per node
  • Typical replication factor: 3 (data stored on 3 nodes)
  • Sequential disk writes are ~100x faster than random writes (why LSM trees win for write-heavy loads)

Interview golden rule:

Don't just say "I'll use Cassandra." Explain your partition key
choice, how the clustering key serves your query pattern, why
writes are fast (LSM trees, leaderless), and acknowledge the
trade-offs (no joins, denormalization, no ad-hoc queries).