Async Job Worker Pools

Queueing work, leasing jobs, and scaling workers by resource profile

S
System Design Sandbox··10 min read
Learn how asynchronous worker pools process expensive jobs without blocking API requests. Covers submission lifecycle, durable queues, worker leases, heartbeats, retries, timeouts, dead-letter queues, backpressure, and pool isolation.

#Introduction

The user clicks Submit. The code might compile for three seconds, run 40 test cases, time out, or crash.

If the API server waits synchronously for all of that, the system falls over during a contest.

Online judges, video transcoders, PDF generators, and ML batch inference services all use the same shape: accept work quickly, put it on a message queue, process it with workers, and publish the result.

This is the execution backbone for the Online Judge solution, but the same pattern appears in video processing, notification delivery, and other asynchronous processing systems.


#Submission Lifecycle

A solid async lifecycle looks like this:

POST /submissions
  -> validate request
  -> persist submission
  -> enqueue execution job
  -> return 202 Accepted

Then a worker:

claim job
  -> create sandbox or runtime
  -> run compile step
  -> run tests
  -> persist verdict
  -> publish result event
API Server
Durable Queue
Worker Pool
Lease + Heartbeat
Result Store
Result Pub/Sub

The important detail is that the request thread stops after durable enqueue. The user can receive the final verdict through WebSockets, server-sent events, or polling as a fallback. The API pattern is similar to the 202 Accepted flow in Message Queues.


#Worker Pool Design

Worker pools should be split by resource profile.

Examples:

  • Python workers
  • Java workers
  • C++ compile-heavy workers
  • GPU workers
  • contest-priority workers

This avoids one workload starving another. A C++ compile storm should not block lightweight Python submissions if they use different queues and autoscaling policies.

Each worker should have:

  • a maximum number of concurrent jobs
  • health checks
  • heartbeat while running a job
  • graceful shutdown behavior
  • language image cache
  • per-job timeout

The queue message should be small. Store source code and test cases in durable storage, then pass references in the job payload. For judge systems, that durable storage decision connects to Sandboxed Code Execution; for media systems, it often connects to Object Storage & CDN.


#Retries, Timeouts, and Backpressure

Retries are for infrastructure failures, not wrong answers.

Retry when:

  • worker crashes before recording a verdict
  • sandbox cannot start because the host is unhealthy
  • result persistence fails after a transient database issue

Do not retry when:

  • code fails to compile
  • test output is wrong
  • submission exceeds time or memory limits

Use visibility timeouts or leases so stuck jobs can be reclaimed. Use dead-letter queues for repeated infrastructure failures. Apply backpressure when queues grow: rate-limit submissions, reserve contest capacity, or show delayed feedback instead of letting the whole platform degrade. For duplicate deliveries, pair this with Idempotency & Deduplication.


#Common Interview Mistakes

Mistake 1: Blocking the API request.

Return 202 Accepted and process asynchronously.

Mistake 2: One global worker pool.

Different languages and job sizes need different pools and limits.

Mistake 3: Retrying user failures.

A wrong answer is a valid result, not a retryable failure.

Mistake 4: Ignoring job leases.

Without leases or heartbeats, a crashed worker can leave a submission stuck forever.


#Summary: What to Remember

Async worker pools separate admission from execution.

Persist the submission, enqueue a durable job, process it with bounded workers, and publish the result. Design retries around infrastructure failures, use leases for stuck jobs, and split worker pools by resource profile when one workload can starve another.

Related articles: Message Queues, Async Processing, Sandboxed Code Execution, WebSockets & Real-Time Communication, and Design an Online Judge.