Atlas Engine: Sub-2-Minute Cold Start for Multi-Model Orchestration on DGX Spark | Thoughts & Talks

A pure Rust inference engine achieves 96 tok/s on Qwen3.5-35B-A3B—a 3× speedup over vLLM's 31 tok/s, with cold startup in 15 seconds versus vLLM's 10-minute torch.compile cycle. Atlas Engine brings this performance to NVIDIA's DGX Spark with custom SM12.1 kernels and Multi-Token Prediction (MTP) speculative decoding. The result: production-ready multi-model orchestration where cybersecurity, coding, and orchestration models run simultaneously on a single GB10 workstation.

The Multi-Model Production Problem

Running three specialised models on a single machine maps cleanly to production workloads:

Model	Purpose	Parameters (active/total)	Acceptance
Qwen3.6-35B-A3B-FP8	Cybersecurity queries	3B / 35B	80% at k=1, 56% at k=2
OpenCode Agent	Code generation	4B / 27B dense	specialised tool-calling
Qwen Orchestrator	Agentic coordination	3B / 35B	stateful reasoning

The challenge is threefold: fast cold start for model switching, efficient memory sharing across concurrent models, and deterministic request routing that respects each model's strengths. The vLLM + IronClaw + LiteLLM stack addresses this with a 5-minute cold start penalty and orchestration overhead. Atlas achieves sub-2-minute cold start and eliminates runtime middleware by embedding routing directly in the inference engine.

Atlas Architecture: No PyTorch, No Python Dispatch Overhead

Atlas is written entirely in Rust and CUDA—no PyTorch, no Python interpreter, no JIT compilation. The 2.5 GB Docker image ships with custom SM12.1 kernels for Blackwell GB10 cards, and the pipeline is:

docker pull avarok/atlas-gb10:latest

sudo docker run -d --name atlas \
  --network host --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:latest \
  serve Qwen/Qwen3.6-35B-A3B-FP8 \
    --port 8888 \
    --max-seq-len 65536 \
    --kv-cache-dtype fp8 \
    --kv-high-precision-layers auto \
    --gpu-memory-utilization 0.90 \
    --scheduling-policy slai \
    --quantization fp8 \
    --tool-call-parser qwen3_coder \
    --enable-prefix-caching \
    --speculative

The --speculative flag enables MTP with K=2 configuration. The model predicts two tokens per forward pass; the base model verifies them in parallel. Acceptance rates (nearly 60% for consecutive tokens on typical workloads) determine actual throughput. With K=2 and 80% acceptance at the first position, Atlas achieves 96 tok/s versus vLLM's 31 tok/s on the same hardware.

Multi-Token Prediction Explained

MTP adds a transformer layer to the base model that outputs N token predictions simultaneously, where N is the speculative depth. Unlike draft-model methods (EAGLE, MEDUSA) that require a separate small model, Atlas uses the target model's own multi-token capability:

K=1: Single-token speculation, baseline acceptance ~80%
K=2: Dual-token speculation, acceptance drops to ~56% at position 2
K=3: Triple-token speculation, acceptance falls to ~36% at position 3

The shared unembedding matrix saves 1.8B parameters compared to MEDUSA's per-head unembeddings. For Qwen3.6-35B-A3B, the 11.5B unique MTP parameters add minimal overhead but enable 3× throughput when acceptance rates hold. The trade-off: 56% acceptance means nearly half of draft tokens are rejected, wasting compute on low-confidence positions.

Production Orchestration Patterns

Three specialised models on one DGX Spark requires careful memory allocation and request routing. The key patterns:

Memory Budgeting

GB10 Unified Memory: 128 GB
├── Qwen3.6-35B-FP8 model weights: 35 GB (FP8 quantisation)
├── Qwen Orchestrator shared KV cache: 12 GB
├── Cybersecurity model KV cache: 8 GB
├── Code model KV cache: 7 GB
├── MTP extra parameters: 14 GB (shared across models)
└── OS + Atlas overhead: ~12 GB
└── Remaining for expansion: ~40 GB

Prefix caching mitigates KV cache blowout. Identical prefixes across concurrency sessions share memory. A 128K-token prompt loaded once costs 18 GB; subsequent requests reusing the prefix allocate only the delta.

Request Routing Strategy

LiteLLM or a custom dispatcher routes requests:

models:
  cybersecurity:
    provider: atlas
    model: qwen36-cyber
    routing_trigger: [malware, exploit, CVE, threat]
    max_tokens: 4096

  coding:
    provider: atlas
    model: qwen36-coder
    routing_trigger: [python, rust, function, class, impl]
    max_tokens: 8192
    tool_call_parser: qwen3_coder

  orchestration:
    provider: atlas
    model: qwen36-orch
    routing_trigger: [orchestrate, coordinate, agent, task]
    max_tokens: 32768
    stateful: true

Simple keyword matching works for production routing because model specialisation is deliberate. More complex semantic routing (vLLM Semantic Router's Athena release) adds marginal latency for multi-agent scenarios but handles nuance better.

Cold Start Optimisation

Atlas achieves sub-2-minute cold start through five techniques:

Fixed kernel compilation: SM12.1 kernels are compiled upstream, not JIT on first request
Lazy model loading: Download happens on-demand via Hugging Face CLI, bundled weights optional
Prefetch optimization: Pipeline pre-warms KV cache slots while model initialises
Zero-copy tensor transfer: Shared memory IPC between inference stages
Simplified scheduler: slai policy bypasses complex fair-share heuristics

In practice, the first model loads in 15 seconds from local cache, subsequent models load in 30-45 seconds due to shared KV cache infrastructure. The total cold start rarely exceeds 90 seconds when models are cached locally.

Performance Benchmarks Against vLLM Stack

The vLLM + IronClaw + LiteLLM production stack takes 5 minutes to initialise. Atlas reaches steady state in under 2 minutes. Throughput on single-stream generation:

Engine	Model	tok/s	Memory	Cold Start
Atlas	Qwen3.5-35B-A3B (NVFP4)	95.9	10.3 GB	15 s
vLLM	Qwen3.5-35B-A3B (NVFP4)	~31	~18 GB	5 min
Ollama	Nemotron 3 Super 120B	~5.8	89.7 GB	45 s

DFlash speculative decoding (vLLM with K=15) achieves ~50 tok/s on Qwen3.6-35B-FP8, but requires a separate 0.5B drafter model and 30% more memory. Atlas's MTP requires no extra model, only the 11.5B shared unembedding overhead.

Trade-offs and Limitations

Atlas is not a universal replacement. Consider these constraints:

AGPL-3.0 licensing forces source sharing if you distribute a modified version
Alpha quality: Closed-source Rust codebase, limited public documentation
Hardware lock-in: Optimised for Blackwell GB10, ASU compatibility confirmed for some builds, Hopper support in development
Quantisation locked to FP8/NVFP4: Int4 or other formats not supported
No fine-tuning pipeline: Use SGLang or vLLM offline for training, Atlas only for inference

The community notes that "Atlas is AGPL-3.0 licensed, closed source, and in alpha" as a caveat. Local control is the selling point—no per-token costs, no API rate limits, complete data privacy—but the licensing model changes deployment economics for commercial products.

Getting Started

Deploy a three-model stack:

# Cybersecurity model
docker run -d --name atlas-cyber \
  --gpus all --ipc=host \
  avarok/atlas-gb10:latest \
  serve Qwen/Qwen3.6-35B-A3B-FP8 \
    --port 8100 --max-seq-len 262144 --speculative --num-speculative-tokens 2

# Code model
docker run -d --name atlas-code \
  --gpus all --ipc=host \
  avarok/atlas-gb10:latest \
  serve Qwen/Qwen3.6-35B-A3B-FP8 \
    --port 8200 --max-seq-len 262144 \
    --tool-call-parser qwen3_coder --num-speculative-tokens 1

# Orchestrator
docker run -d --name atlas-orch \
  --gpus all --ipc=host \
  avarok/atlas-gb10:latest \
  serve Qwen/Qwen3.6-35B-A3B-FP8 \
    --port 8300 --max-seq-len 262144 --speculative

Configure LiteLLM proxy to route:

model_list:
  - model_name: cyber
    litellm_params:
      model: openai/qwen36-cyber
      api_base: http://localhost:8100/v1
      api_key: not-needed

  - model_name: code
    litellm_params:
      model: openai/qwen36-code
      api_base: http://localhost:8200/v1
      api_key: not-needed

  - model_name: orchestrator
    litellm_params:
      model: openai/qwen36-orch
      api_base: http://localhost:8300/v1
      api_key: not-needed

router:
  routing_strategy: simple
  model_group_alias:
    cybersecurity: cyber
    coding: code
    orchestration: orchestrator

All three models share the same ~/.cache/huggingface directory via volume mount, eliminating redundant downloads. The first run downloads 35 GB of FP8 weights; subsequent runs start immediately.

Monitoring and Failover

IronClaw provides watchdog auto-recovery. If an Atlas container crashes, systemd restarts it within 60 seconds. Multi-model routing must handle transient failures:

import litellm

response = litellm.completion(
    model="openai/qwen36-code",
    messages=[{"role": "user", "content": "Write a Python class for HTTP retries"}],
    api_base="http://localhost:8200/v1",
    fallback=["openai/qwen36-cyber"],
    timeout=30,
    num_retries=3
)

The fallback list provides graceful degradation if the specialised model is down. This pattern is production-tested in the vLLM stack but relies on LiteLLM's translation layer, which sometimes obscures root causes in failures.

The Road Ahead

Atlas's open-source release is "coming soon," according to the developers. Until then, production deployment means accepting the AGPL license and alpha quality. The Rust codebase compiles cleanly on GB10, and community builds exist for ASUS GX10.

The broader question: when does local inference beat cloud APIs? At 100 tok/s on a 35B MoE model, local throughput rivals commercial endpoints including GPT-5.5. The break-even point depends on query volume and hardware amortisation. For high-throughput production workloads, the economics increasingly favour on-premise deployments—especially when you can run three specialised models on one machine without API costs.