Time-Series Databases

#Introduction

Metrics are not normal rows.

A request record says, "this request happened." A metric says, "this value changed over time, with these labels, and you will probably ask for it as a range." That difference drives the storage design.

Time-series databases are built for high write throughput, time-range scans, compression, downsampling, and label-based filtering. They show up in monitoring systems, IoT telemetry, financial tick data, and any product where the main question is "what changed over the last N minutes?"

#What Makes Time-Series Data Different

A metric sample usually looks like this:

{
  "name": "cpu_usage",
  "labels": {
    "host": "api-17",
    "region": "us-east-1",
    "service": "checkout"
  },
  "timestamp": "2026-04-20T12:00:00Z",
  "value": 0.82
}

The primary access pattern is not "find row by ID." It is:

read cpu_usage for the last hour
group by host, region, or service
compute average, p95, max, or rate
compare the latest window against a threshold

That is why TSDBs organize data by metric, label set, and time. They optimize writes by appending points and optimize reads by scanning compressed blocks for a time range.

#Push vs Pull Ingestion

There are two common ingestion models.

Pull-based collection: A monitoring system scrapes /metrics endpoints on a schedule. Prometheus made this model popular.

Good for service discovery and central control
Easy to know when a target stops responding
Harder when hosts sit behind firewalls or short-lived jobs disappear quickly

Push-based collection: Agents or services send metrics to collectors.

Good for edge devices, mobile clients, batch jobs, and locked-down networks
Easier to buffer locally during network issues
Needs stronger backpressure and authentication because clients can flood collectors

At scale, both models usually converge on a collector tier. Collectors normalize samples, reject bad labels, compress batches, and write to a durable buffer before storage.

#Rollups and Retention

Raw metrics are expensive. If 500,000 nodes emit 200 samples every 10 seconds, the system receives 10 million samples per second.

Dashboards rarely need all raw samples forever. They need useful summaries:

raw samples:    keep 7 days
10s rollups:    keep 14 days
1m rollups:     keep 30 days
1h rollups:     keep 1 year

Rollups are precomputed aggregates over time buckets. A 1-minute bucket might store count, sum, min, max, and p95 sketches. The dashboard can render a month of CPU data without scanning every raw point.

This is the main interview point: do not send every dashboard query to raw storage. Pre-aggregate into time buckets.

#Cardinality Control

Cardinality is the number of unique time series.

This metric has low cardinality:

http_requests_total{service="checkout", status="200"}

This one can destroy the system:

http_requests_total{service="checkout", user_id="u_928374923"}

Adding user_id creates a separate time series for every user. Add request_id and it becomes almost one series per request. Indexes grow, memory grows, compaction gets slower, and queries become expensive.

Production controls:

per-tenant active series quotas
label allow-lists or deny-lists
rejection of obvious unique IDs
separate traces/logs for high-cardinality debugging data
downsampling and retention tiers

In interviews, call this "cardinality explosion." It is one of the most important monitoring-system failure modes.

#Query Patterns

TSDB queries are range queries over metric names and labels:

avg(cpu_usage{service="checkout", region="us-east-1"}) by host over 1h step 60s
rate(http_requests_total{status="500"}[5m])
p95(request_latency_ms{service="api"}) over 30m

The storage engine should make these operations cheap:

locate matching series by label index
scan only blocks for the requested time range
use rollups when the query step is coarse
cache common dashboard panels

The API should expose time range, step size, metric name, filters, and aggregation. A dashboard does not want one data point. It wants a time-aligned vector.

#Common Interview Mistakes

Storing raw points forever. This fails on cost and query speed. Mention retention and rollups.

Ignoring cardinality. Labels are powerful, but uncontrolled labels can take the system down.

Writing directly to the TSDB from every agent. A collector tier and durable buffer are safer during bursts.

Treating alerting as an afterthought. Monitoring exists so humans can react. Alert evaluation needs windows, deduplication, and routing.

#Summary: What to Remember

Time-series databases optimize append-heavy samples and time-range queries.
Push and pull ingestion both need collector and backpressure strategy at scale.
Rollups make dashboards cheap; retention tiers keep storage cost bounded.
Cardinality control is mandatory for label-heavy metrics.
Query APIs should be designed around metric, labels, time range, step, and aggregation.