Metrics Monitoring & Alerting

#Introduction

The interviewer says: "Design a metrics monitoring system."

You start with agents sending CPU and memory every few seconds. Then the follow-up comes: "What if 500,000 nodes emit 200 metrics every 10 seconds? What if a developer adds request_id as a label? What if alerts flap every time CPU spikes for one second?"

Now this is not a simple dashboard. It is a high-write telemetry pipeline with time-series storage, Kafka for burst buffering, Flink for windowed rollups, and alert evaluation that has to be useful instead of noisy.

#Functional Requirements

1. Metric ingestion

Agents or exporters submit samples with metric name, labels, timestamp, and value
The system supports push-based agents and can also support pull-based scraping through collectors
Collectors validate labels, batch samples, compress payloads, and write to a durable buffer

2. Rollup visualization

Users can build dashboards for CPU, memory, latency, error rate, and custom metrics
Dashboards query time ranges with a step size such as 10 seconds, 1 minute, or 1 hour
The system stores raw samples briefly and precomputed rollups for longer periods

3. Threshold alerting

Users define rules such as "CPU > 90% for 5 minutes"
Alert evaluation runs on recent windows, not on one isolated sample
Active alerts are deduplicated and routed to channels such as email, Slack, or PagerDuty

#Non-Functional Requirements

High write throughput

At 10 million samples per second, direct writes from every agent to storage will not hold. Use collectors and Kafka to absorb bursts, then scale stream processors and TSDB writers horizontally.

Cardinality control

Labels create time series. A metric with service, region, and status is manageable. A metric with request_id or user_id can create millions of unique series. Enforce quotas, label allow-lists, and retention tiers.

Low-latency queries

Dashboards should read precomputed rollups. A graph refresh should not scan billions of raw samples.

High availability

Telemetry bursts should not take down ingestion. Collectors should be stateless, Kafka should retain data during downstream outages, and query APIs should degrade gracefully if raw data is delayed.

#API Design

Write metrics

POST /api/v1/metrics
Content-Encoding: gzip

{
  "samples": [
    {
      "name": "cpu_usage",
      "labels": {
        "host": "api-17",
        "service": "checkout",
        "region": "us-east-1"
      },
      "timestamp": "2026-04-20T12:00:00Z",
      "value": 0.82
    }
  ]
}

Query metrics

GET /api/v1/metrics/query?metric=cpu_usage&from=2026-04-20T11:00:00Z&to=2026-04-20T12:00:00Z&step=60s

Create alert rule

POST /api/v1/alerts

{
  "metric": "cpu_usage",
  "filter": {
    "service": "checkout"
  },
  "threshold": 0.9,
  "operator": ">",
  "window": "5m",
  "channels": ["pagerduty-primary"]
}

#High Level Design

#Key Components

Agents / exporters

Run on hosts or inside services. They collect CPU, memory, request latency, queue depth, and custom application metrics.

Collectors

Stateless ingestion services. They authenticate agents, validate label cardinality, normalize samples, and write batches to Kafka.

Kafka

Durable buffer between ingestion and processing. It absorbs telemetry bursts and allows stream processors to replay data after deploys or incidents.

Flink windowing

Consumes raw samples, computes rollups, handles late data, and writes compact aggregates to the TSDB.

TSDB

Stores raw samples for short retention and rollups for long retention. It indexes metric names and labels for time-range queries.

Query API and dashboards

The query API reads rollups from the TSDB and powers dashboard graphs. Grafana or an internal UI sits on top of it.

Alert evaluator

Reads recent windows, evaluates rules, tracks active incidents, and sends notifications.

#Detailed Design

#Query Path

The dashboard sends metric, filters, time range, and step. The query service chooses the cheapest resolution:

last 15 minutes, step 10s -> 10s rollup
last 7 days, step 1h      -> 1h rollup

Common dashboard panels can be cached for a few seconds. That short cache is enough to protect the TSDB when many engineers open the same incident dashboard.

#Alert Evaluation

Alert rules should use sustained windows:

cpu_usage > 0.90 for 5 minutes
error_rate > 0.05 for 3 consecutive windows
p95_latency_ms > 500 for 10 minutes

The evaluator keeps state for active alerts. It should not page repeatedly for the same incident. It should also support recovery notifications when the metric returns to normal.

#Cardinality Guardrails

Collectors reject or rewrite labels that violate policy. Examples:

reject labels named request_id, trace_id, or session_id
cap active series per tenant
sample or drop abusive custom metrics
move high-cardinality debugging data to logs or traces

#Common Interview Mistakes

Skipping rollups. Raw samples are too expensive for every dashboard query.

Ignoring cardinality. This is the classic monitoring-system failure mode.

Writing directly to the TSDB. A queue or log protects the system during bursts.

Alerting on single points. Good alerts use windows, deduplication, and recovery states.