Flink

#Introduction

Kafka stores the stream. Flink computes on it.

Apache Flink is a stream processing engine for stateful, windowed, low-latency computation. It is useful when raw events need to become aggregates, alerts, joins, or enriched facts before they land in storage.

In system design interviews, Flink usually appears in analytics pipelines: metrics rollups, click aggregation, fraud detection, real-time dashboards, and alert evaluation.

#Stream Processing Model

Flink jobs are pipelines of operators:

source -> parse -> filter -> keyBy -> window -> aggregate -> sink

A source reads from Kafka or another stream. Operators transform events. A sink writes results to storage, such as a TSDB, ClickHouse, Druid, or object storage.

The important step is often keyBy. It partitions events by a key so related events go to the same operator instance:

keyBy(metric_name + label_set)
keyBy(ad_id)
keyBy(account_id)

Once events are keyed, Flink can maintain state per key and compute rolling results.

#Windowing

Windows group events by time.

Tumbling windows are fixed, non-overlapping buckets:

12:00:00-12:00:59
12:01:00-12:01:59
12:02:00-12:02:59

Use them for minute-level click counts or metrics rollups.

Sliding windows overlap:

last 5 minutes, updated every 30 seconds

Use them for smoother alert evaluation or rolling averages.

Session windows group events separated by gaps. They are common in user analytics.

The interview point: do not update a database counter for every event. Compute windowed aggregates in the stream processor, then write compact results.

#State and Checkpointing

Flink is stateful. It can remember counts, deduplication sets, last-seen values, joins, and rolling aggregates.

State is only useful if it survives failures. Checkpointing periodically snapshots operator state and source offsets. If a worker crashes, Flink restores from the latest checkpoint and resumes from a known position.

This matters for correctness:

a metrics rollup should not skip a minute after a crash
a click aggregator should not double-count already processed clicks
an alert evaluator should not forget which incidents are already active

Exactly-once results require checkpointed source reads and idempotent or transactional sink writes.

#Late Events and Watermarks

Events do not always arrive in timestamp order.

A host can buffer metrics during a network issue. A mobile click can arrive late. A Kafka partition can lag. Flink uses watermarks to estimate event-time progress and decide when a window is complete.

Example:

window: 12:00-12:01
allowed lateness: 30 seconds
finalize after watermark passes 12:01:30

Late events create a business decision. You can drop them, update old windows, or send corrections. Monitoring systems and ad billing systems should define this explicitly.

#Common Interview Mistakes

Saying "Kafka will aggregate it." Kafka stores events. A stream processor computes aggregates.

Ignoring late data. Event time and processing time are different.

Skipping checkpointing. Without checkpointing, worker crashes can lose state or double count results.

Writing every event to an analytics database. Windowed aggregation reduces write load and query cost.

#Summary: What to Remember

Flink performs stateful stream processing over event logs.
Windowing turns raw events into time buckets.
Checkpointing protects state and offsets across failures.
Watermarks handle out-of-order and late events.
End-to-end exactly-once still depends on sink behavior and idempotency.