Time-Series Databases

Handling high-write metrics, rollups, retention, and label cardinality

S
System Design Sandbox··9 min read
Learn how time-series databases power monitoring systems. Covers push vs pull ingestion, rollups, retention tiers, cardinality control, and time-range query patterns.

#Introduction

Metrics are not normal rows.

A request record says, "this request happened." A metric says, "this value changed over time, with these labels, and you will probably ask for it as a range." That difference drives the storage design.

Time-series databases are built for high write throughput, time-range scans, compression, downsampling, and label-based filtering. They show up in monitoring systems, IoT telemetry, financial tick data, and any product where the main question is "what changed over the last N minutes?"


#What Makes Time-Series Data Different

A metric sample usually looks like this:

{
  "name": "cpu_usage",
  "labels": {
    "host": "api-17",
    "region": "us-east-1",
    "service": "checkout"
  },
  "timestamp": "2026-04-20T12:00:00Z",
  "value": 0.82
}

The primary access pattern is not "find row by ID." It is:

  • read cpu_usage for the last hour
  • group by host, region, or service
  • compute average, p95, max, or rate
  • compare the latest window against a threshold

That is why TSDBs organize data by metric, label set, and time. They optimize writes by appending points and optimize reads by scanning compressed blocks for a time range.


#Push vs Pull Ingestion

There are two common ingestion models.

Pull-based collection: A monitoring system scrapes /metrics endpoints on a schedule. Prometheus made this model popular.

  • Good for service discovery and central control
  • Easy to know when a target stops responding
  • Harder when hosts sit behind firewalls or short-lived jobs disappear quickly

Push-based collection: Agents or services send metrics to collectors.

  • Good for edge devices, mobile clients, batch jobs, and locked-down networks
  • Easier to buffer locally during network issues
  • Needs stronger backpressure and authentication because clients can flood collectors

At scale, both models usually converge on a collector tier. Collectors normalize samples, reject bad labels, compress batches, and write to a durable buffer before storage.


#Rollups and Retention

Raw metrics are expensive. If 500,000 nodes emit 200 samples every 10 seconds, the system receives 10 million samples per second.

Dashboards rarely need all raw samples forever. They need useful summaries:

raw samples:    keep 7 days
10s rollups:    keep 14 days
1m rollups:     keep 30 days
1h rollups:     keep 1 year

Rollups are precomputed aggregates over time buckets. A 1-minute bucket might store count, sum, min, max, and p95 sketches. The dashboard can render a month of CPU data without scanning every raw point.

This is the main interview point: do not send every dashboard query to raw storage. Pre-aggregate into time buckets.


#Cardinality Control

Cardinality is the number of unique time series.

This metric has low cardinality:

http_requests_total{service="checkout", status="200"}

This one can destroy the system:

http_requests_total{service="checkout", user_id="u_928374923"}

Adding user_id creates a separate time series for every user. Add request_id and it becomes almost one series per request. Indexes grow, memory grows, compaction gets slower, and queries become expensive.

Production controls:

  • per-tenant active series quotas
  • label allow-lists or deny-lists
  • rejection of obvious unique IDs
  • separate traces/logs for high-cardinality debugging data
  • downsampling and retention tiers

In interviews, call this "cardinality explosion." It is one of the most important monitoring-system failure modes.


#Query Patterns

TSDB queries are range queries over metric names and labels:

avg(cpu_usage{service="checkout", region="us-east-1"}) by host over 1h step 60s
rate(http_requests_total{status="500"}[5m])
p95(request_latency_ms{service="api"}) over 30m

The storage engine should make these operations cheap:

  • locate matching series by label index
  • scan only blocks for the requested time range
  • use rollups when the query step is coarse
  • cache common dashboard panels

The API should expose time range, step size, metric name, filters, and aggregation. A dashboard does not want one data point. It wants a time-aligned vector.


#Common Interview Mistakes

Storing raw points forever. This fails on cost and query speed. Mention retention and rollups.

Ignoring cardinality. Labels are powerful, but uncontrolled labels can take the system down.

Writing directly to the TSDB from every agent. A collector tier and durable buffer are safer during bursts.

Treating alerting as an afterthought. Monitoring exists so humans can react. Alert evaluation needs windows, deduplication, and routing.


#Summary: What to Remember

  • Time-series databases optimize append-heavy samples and time-range queries.
  • Push and pull ingestion both need collector and backpressure strategy at scale.
  • Rollups make dashboards cheap; retention tiers keep storage cost bounded.
  • Cardinality control is mandatory for label-heavy metrics.
  • Query APIs should be designed around metric, labels, time range, step, and aggregation.