Async Job Worker Pools

#Introduction

The user clicks Submit. The code might compile for three seconds, run 40 test cases, time out, or crash.

If the API server waits synchronously for all of that, the system falls over during a contest.

Online judges, video transcoders, PDF generators, and ML batch inference services all use the same shape: accept work quickly, put it on a message queue, process it with workers, and publish the result.

This is the execution backbone for the Online Judge solution, but the same pattern appears in video processing, notification delivery, and other asynchronous processing systems.

#Submission Lifecycle

A solid async lifecycle looks like this:

POST /submissions
  -> validate request
  -> persist submission
  -> enqueue execution job
  -> return 202 Accepted

Then a worker:

claim job
  -> create sandbox or runtime
  -> run compile step
  -> run tests
  -> persist verdict
  -> publish result event

The important detail is that the request thread stops after durable enqueue. The user can receive the final verdict through WebSockets, server-sent events, or polling as a fallback. The API pattern is similar to the 202 Accepted flow in Message Queues.

#Worker Pool Design

Worker pools should be split by resource profile.

Examples:

Python workers
Java workers
C++ compile-heavy workers
GPU workers
contest-priority workers

This avoids one workload starving another. A C++ compile storm should not block lightweight Python submissions if they use different queues and autoscaling policies.

Each worker should have:

a maximum number of concurrent jobs
health checks
heartbeat while running a job
graceful shutdown behavior
language image cache
per-job timeout

The queue message should be small. Store source code and test cases in durable storage, then pass references in the job payload. For judge systems, that durable storage decision connects to Sandboxed Code Execution; for media systems, it often connects to Object Storage & CDN.

#Retries, Timeouts, and Backpressure

Retries are for infrastructure failures, not wrong answers.

Retry when:

worker crashes before recording a verdict
sandbox cannot start because the host is unhealthy
result persistence fails after a transient database issue

Do not retry when:

code fails to compile
test output is wrong
submission exceeds time or memory limits

Use visibility timeouts or leases so stuck jobs can be reclaimed. Use dead-letter queues for repeated infrastructure failures. Apply backpressure when queues grow: rate-limit submissions, reserve contest capacity, or show delayed feedback instead of letting the whole platform degrade. For duplicate deliveries, pair this with Idempotency & Deduplication.

#Common Interview Mistakes

Mistake 1: Blocking the API request.

Return 202 Accepted and process asynchronously.

Mistake 2: One global worker pool.

Different languages and job sizes need different pools and limits.

Mistake 3: Retrying user failures.

A wrong answer is a valid result, not a retryable failure.

Mistake 4: Ignoring job leases.

Without leases or heartbeats, a crashed worker can leave a submission stuck forever.

#Summary: What to Remember

Async worker pools separate admission from execution.

Persist the submission, enqueue a durable job, process it with bounded workers, and publish the result. Design retries around infrastructure failures, use leases for stuck jobs, and split worker pools by resource profile when one workload can starve another.