#Introduction
The user clicks Submit. The code might compile for three seconds, run 40 test cases, time out, or crash.
If the API server waits synchronously for all of that, the system falls over during a contest.
Online judges, video transcoders, PDF generators, and ML batch inference services all use the same shape: accept work quickly, put it on a message queue, process it with workers, and publish the result.
This is the execution backbone for the Online Judge solution, but the same pattern appears in video processing, notification delivery, and other asynchronous processing systems.
#Submission Lifecycle
A solid async lifecycle looks like this:
POST /submissions
-> validate request
-> persist submission
-> enqueue execution job
-> return 202 Accepted
Then a worker:
claim job
-> create sandbox or runtime
-> run compile step
-> run tests
-> persist verdict
-> publish result event
The important detail is that the request thread stops after durable enqueue. The user can receive the final verdict through WebSockets, server-sent events, or polling as a fallback. The API pattern is similar to the 202 Accepted flow in Message Queues.
#Worker Pool Design
Worker pools should be split by resource profile.
Examples:
- Python workers
- Java workers
- C++ compile-heavy workers
- GPU workers
- contest-priority workers
This avoids one workload starving another. A C++ compile storm should not block lightweight Python submissions if they use different queues and autoscaling policies.
Each worker should have:
- a maximum number of concurrent jobs
- health checks
- heartbeat while running a job
- graceful shutdown behavior
- language image cache
- per-job timeout
The queue message should be small. Store source code and test cases in durable storage, then pass references in the job payload. For judge systems, that durable storage decision connects to Sandboxed Code Execution; for media systems, it often connects to Object Storage & CDN.
#Retries, Timeouts, and Backpressure
Retries are for infrastructure failures, not wrong answers.
Retry when:
- worker crashes before recording a verdict
- sandbox cannot start because the host is unhealthy
- result persistence fails after a transient database issue
Do not retry when:
- code fails to compile
- test output is wrong
- submission exceeds time or memory limits
Use visibility timeouts or leases so stuck jobs can be reclaimed. Use dead-letter queues for repeated infrastructure failures. Apply backpressure when queues grow: rate-limit submissions, reserve contest capacity, or show delayed feedback instead of letting the whole platform degrade. For duplicate deliveries, pair this with Idempotency & Deduplication.
#Common Interview Mistakes
Mistake 1: Blocking the API request.
Return 202 Accepted and process asynchronously.
Mistake 2: One global worker pool.
Different languages and job sizes need different pools and limits.
Mistake 3: Retrying user failures.
A wrong answer is a valid result, not a retryable failure.
Mistake 4: Ignoring job leases.
Without leases or heartbeats, a crashed worker can leave a submission stuck forever.
#Summary: What to Remember
Async worker pools separate admission from execution.
Persist the submission, enqueue a durable job, process it with bounded workers, and publish the result. Design retries around infrastructure failures, use leases for stuck jobs, and split worker pools by resource profile when one workload can starve another.
Related articles: Message Queues, Async Processing, Sandboxed Code Execution, WebSockets & Real-Time Communication, and Design an Online Judge.