Sandboxed Code Execution

Running untrusted user code with isolation, limits, and safe teardown

S
System Design Sandbox··11 min read
Learn how online judges and execution platforms run untrusted code safely. Covers threat modeling, containers versus MicroVMs, cgroups, seccomp, filesystem and network isolation, timeouts, output limits, and warm pools.

#Introduction

You are designing an online judge. A user submits:

while true; do fork(); done

The product requirement says "run user code." The security requirement says "do not let random internet code take over the host."

That is the real problem. The judge is not just a queue and a worker. It is a controlled execution environment for untrusted code, with strict limits on CPU, memory, filesystem access, network access, output size, and runtime.

This article pairs with Async Job Worker Pools and the Online Judge solution. If you want to practice the full design, start with the Online Judge practice problem.


#Threat Model

Assume submitted code is hostile.

It may try to:

  • read environment variables
  • scan the internal network
  • write huge files
  • fork too many processes
  • use too much memory
  • run forever
  • exploit the kernel or runtime
  • leak test cases through logs or side channels

The system should fail closed. If the sandbox cannot be created safely, reject or delay the submission instead of running it on a shared worker host. That failure path should be modeled as part of the broader async execution pipeline, not hidden inside a best-effort worker script.


#Sandbox Layers

A practical design uses multiple layers:

  • container or microVM per submission
  • read-only base image
  • mounted working directory with size limits
  • no outbound network by default
  • CPU and memory cgroups
  • process count limits
  • seccomp profile or runtime syscall filters
  • wall-clock timeout
  • output byte limit
Execution Job
Judge Worker
Runtime Limits
MicroVM Sandbox
Verdict Store

Containers are fast and convenient, but they share the host kernel. MicroVMs such as Firecracker add a stronger boundary at the cost of more startup overhead. In an interview, make the tradeoff explicit:

Isolation modelBenefitCost
Process sandboxfastest startupweakest isolation
Containermature tooling, fast poolsshared kernel risk
MicroVMstronger isolationhigher startup and orchestration cost

For a serious online judge, warm microVM or container pools are a strong default. Warm pools are also where scaling, capacity reservations, and per-language isolation become concrete instead of abstract.


#Runtime Controls

Each run should have a signed execution spec:

{
  "submissionId": "sub_123",
  "language": "python3",
  "image": "judge-python:3.12",
  "cpuMillis": 2000,
  "memoryMb": 256,
  "wallClockMillis": 5000,
  "network": "disabled",
  "maxOutputBytes": 65536
}

The worker should compile and execute inside the sandbox, collect stdout/stderr, classify the result, and destroy the environment. The verdict should be persisted durably, then pushed to the user over WebSockets or a similar realtime channel.

Common verdicts include:

  • accepted
  • wrong answer
  • compile error
  • runtime error
  • time limit exceeded
  • memory limit exceeded
  • output limit exceeded
  • internal error

Do not store raw logs without size limits. A malicious submission can generate gigabytes of output.


#Common Interview Mistakes

Mistake 1: Running code directly on the worker.

Workers are orchestration processes. The submitted program should run in a sandboxed child environment.

Mistake 2: Saying "Docker" and stopping there.

Docker is not a complete security answer. Mention cgroups, namespaces, seccomp, read-only filesystems, network isolation, and timeouts.

Mistake 3: Forgetting warm pools.

Creating a fresh VM per test case can blow the latency budget. Keep warm capacity for common languages.

Mistake 4: Trusting language runtimes.

Python, JavaScript, Java, and C++ all need external resource limits. Language-level timeouts are not enough.


#Summary: What to Remember

Sandboxed code execution is defense in depth.

For an online judge, model submitted code as hostile. Run it in isolated containers or microVMs, disable network access, enforce CPU and memory limits, cap output, and destroy the environment after execution. Use warm pools when latency matters, but do not trade away the isolation boundary.

Related articles: Async Job Worker Pools, Message Queues, WebSockets & Real-Time Communication, and Design an Online Judge.