Atlas Engine: Sub-2-Minute Cold Start for Multi-Model Orchestration on DGX Spark
· ~6 min readA pure Rust inference engine achieves 96 tok/s on Qwen3.5-35B-A3B—a 3× speedup over vLLM's 31 tok/s, with cold startup in 15 seconds versus vLLM's 10-minute torch.compile cycle. Atlas Engine brings this performance to NVIDIA's DGX Spark with custom SM12.1 kernels and Multi-Token Prediction (MTP) speculative decoding. The result: production-ready multi-model orchestration where cybersecurity, coding, and orchestration models run simultaneously on a single GB10 workstation.
The Multi-Model Production Problem
Running three specialised models on a single machine maps cleanly to production workloads:
| Model | Purpose | Parameters (active/total) | Acceptance |
|---|---|---|---|
| Qwen3.6-35B-A3B-FP8 | Cybersecurity queries | 3B / 35B | 80% at k=1, 56% at k=2 |
| OpenCode Agent | Code generation | 4B / 27B dense | specialised tool-calling |
| Qwen Orchestrator | Agentic coordination | 3B / 35B | stateful reasoning |
The challenge is threefold: fast cold start for model switching, efficient memory sharing across concurrent models, and deterministic request routing that respects each model's strengths. The vLLM + IronClaw + LiteLLM stack addresses this with a 5-minute cold start penalty and orchestration overhead. Atlas achieves sub-2-minute cold start and eliminates runtime middleware by embedding routing directly in the inference engine.
Atlas Architecture: No PyTorch, No Python Dispatch Overhead
Atlas is written entirely in Rust and CUDA—no PyTorch, no Python interpreter, no JIT compilation. The 2.5 GB Docker image ships with custom SM12.1 kernels for Blackwell GB10 cards, and the pipeline is:
docker pull avarok/atlas-gb10:latest
sudo docker run -d --name atlas \
--network host --gpus all --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
avarok/atlas-gb10:latest \
serve Qwen/Qwen3.6-35B-A3B-FP8 \
--port 8888 \
--max-seq-len 65536 \
--kv-cache-dtype fp8 \
--kv-high-precision-layers auto \
--gpu-memory-utilization 0.90 \
--scheduling-policy slai \
--quantization fp8 \
--tool-call-parser qwen3_coder \
--enable-prefix-caching \
--speculative
The --speculative flag enables MTP with K=2 configuration. The model predicts two tokens per forward pass; the base model verifies them in parallel. Acceptance rates (nearly 60% for consecutive tokens on typical workloads) determine actual throughput. With K=2 and 80% acceptance at the first position, Atlas achieves 96 tok/s versus vLLM's 31 tok/s on the same hardware.
Multi-Token Prediction Explained
MTP adds a transformer layer to the base model that outputs N token predictions simultaneously, where N is the speculative depth. Unlike draft-model methods (EAGLE, MEDUSA) that require a separate small model, Atlas uses the target model's own multi-token capability:
- K=1: Single-token speculation, baseline acceptance ~80%
- K=2: Dual-token speculation, acceptance drops to ~56% at position 2
- K=3: Triple-token speculation, acceptance falls to ~36% at position 3
The shared unembedding matrix saves 1.8B parameters compared to MEDUSA's per-head unembeddings. For Qwen3.6-35B-A3B, the 11.5B unique MTP parameters add minimal overhead but enable 3× throughput when acceptance rates hold. The trade-off: 56% acceptance means nearly half of draft tokens are rejected, wasting compute on low-confidence positions.
Production Orchestration Patterns
Three specialised models on one DGX Spark requires careful memory allocation and request routing. The key patterns:
Memory Budgeting
GB10 Unified Memory: 128 GB
├── Qwen3.6-35B-FP8 model weights: 35 GB (FP8 quantisation)
├── Qwen Orchestrator shared KV cache: 12 GB
├── Cybersecurity model KV cache: 8 GB
├── Code model KV cache: 7 GB
├── MTP extra parameters: 14 GB (shared across models)
└── OS + Atlas overhead: ~12 GB
└── Remaining for expansion: ~40 GB
Prefix caching mitigates KV cache blowout. Identical prefixes across concurrency sessions share memory. A 128K-token prompt loaded once costs 18 GB; subsequent requests reusing the prefix allocate only the delta.
Request Routing Strategy
LiteLLM or a custom dispatcher routes requests:
models:
cybersecurity:
provider: atlas
model: qwen36-cyber
routing_trigger: [malware, exploit, CVE, threat]
max_tokens: 4096
coding:
provider: atlas
model: qwen36-coder
routing_trigger: [python, rust, function, class, impl]
max_tokens: 8192
tool_call_parser: qwen3_coder
orchestration:
provider: atlas
model: qwen36-orch
routing_trigger: [orchestrate, coordinate, agent, task]
max_tokens: 32768
stateful: true
Simple keyword matching works for production routing because model specialisation is deliberate. More complex semantic routing (vLLM Semantic Router's Athena release) adds marginal latency for multi-agent scenarios but handles nuance better.
Cold Start Optimisation
Atlas achieves sub-2-minute cold start through five techniques:
- Fixed kernel compilation: SM12.1 kernels are compiled upstream, not JIT on first request
- Lazy model loading: Download happens on-demand via Hugging Face CLI, bundled weights optional
- Prefetch optimization: Pipeline pre-warms KV cache slots while model initialises
- Zero-copy tensor transfer: Shared memory IPC between inference stages
- Simplified scheduler:
slaipolicy bypasses complex fair-share heuristics
In practice, the first model loads in 15 seconds from local cache, subsequent models load in 30-45 seconds due to shared KV cache infrastructure. The total cold start rarely exceeds 90 seconds when models are cached locally.
Performance Benchmarks Against vLLM Stack
The vLLM + IronClaw + LiteLLM production stack takes 5 minutes to initialise. Atlas reaches steady state in under 2 minutes. Throughput on single-stream generation:
| Engine | Model | tok/s | Memory | Cold Start |
|---|---|---|---|---|
| Atlas | Qwen3.5-35B-A3B (NVFP4) | 95.9 | 10.3 GB | 15 s |
| vLLM | Qwen3.5-35B-A3B (NVFP4) | ~31 | ~18 GB | 5 min |
| Ollama | Nemotron 3 Super 120B | ~5.8 | 89.7 GB | 45 s |
DFlash speculative decoding (vLLM with K=15) achieves ~50 tok/s on Qwen3.6-35B-FP8, but requires a separate 0.5B drafter model and 30% more memory. Atlas's MTP requires no extra model, only the 11.5B shared unembedding overhead.
Trade-offs and Limitations
Atlas is not a universal replacement. Consider these constraints:
- AGPL-3.0 licensing forces source sharing if you distribute a modified version
- Alpha quality: Closed-source Rust codebase, limited public documentation
- Hardware lock-in: Optimised for Blackwell GB10, ASU compatibility confirmed for some builds, Hopper support in development
- Quantisation locked to FP8/NVFP4: Int4 or other formats not supported
- No fine-tuning pipeline: Use SGLang or vLLM offline for training, Atlas only for inference
The community notes that "Atlas is AGPL-3.0 licensed, closed source, and in alpha" as a caveat. Local control is the selling point—no per-token costs, no API rate limits, complete data privacy—but the licensing model changes deployment economics for commercial products.
Getting Started
Deploy a three-model stack:
# Cybersecurity model
docker run -d --name atlas-cyber \
--gpus all --ipc=host \
avarok/atlas-gb10:latest \
serve Qwen/Qwen3.6-35B-A3B-FP8 \
--port 8100 --max-seq-len 262144 --speculative --num-speculative-tokens 2
# Code model
docker run -d --name atlas-code \
--gpus all --ipc=host \
avarok/atlas-gb10:latest \
serve Qwen/Qwen3.6-35B-A3B-FP8 \
--port 8200 --max-seq-len 262144 \
--tool-call-parser qwen3_coder --num-speculative-tokens 1
# Orchestrator
docker run -d --name atlas-orch \
--gpus all --ipc=host \
avarok/atlas-gb10:latest \
serve Qwen/Qwen3.6-35B-A3B-FP8 \
--port 8300 --max-seq-len 262144 --speculative
Configure LiteLLM proxy to route:
model_list:
- model_name: cyber
litellm_params:
model: openai/qwen36-cyber
api_base: http://localhost:8100/v1
api_key: not-needed
- model_name: code
litellm_params:
model: openai/qwen36-code
api_base: http://localhost:8200/v1
api_key: not-needed
- model_name: orchestrator
litellm_params:
model: openai/qwen36-orch
api_base: http://localhost:8300/v1
api_key: not-needed
router:
routing_strategy: simple
model_group_alias:
cybersecurity: cyber
coding: code
orchestration: orchestrator
All three models share the same ~/.cache/huggingface directory via volume mount, eliminating redundant downloads. The first run downloads 35 GB of FP8 weights; subsequent runs start immediately.
Monitoring and Failover
IronClaw provides watchdog auto-recovery. If an Atlas container crashes, systemd restarts it within 60 seconds. Multi-model routing must handle transient failures:
import litellm
response = litellm.completion(
model="openai/qwen36-code",
messages=[{"role": "user", "content": "Write a Python class for HTTP retries"}],
api_base="http://localhost:8200/v1",
fallback=["openai/qwen36-cyber"],
timeout=30,
num_retries=3
)
The fallback list provides graceful degradation if the specialised model is down. This pattern is production-tested in the vLLM stack but relies on LiteLLM's translation layer, which sometimes obscures root causes in failures.
The Road Ahead
Atlas's open-source release is "coming soon," according to the developers. Until then, production deployment means accepting the AGPL license and alpha quality. The Rust codebase compiles cleanly on GB10, and community builds exist for ASUS GX10.
The broader question: when does local inference beat cloud APIs? At 100 tok/s on a 35B MoE model, local throughput rivals commercial endpoints including GPT-5.5. The break-even point depends on query volume and hardware amortisation. For high-throughput production workloads, the economics increasingly favour on-premise deployments—especially when you can run three specialised models on one machine without API costs.