Two-Node DGX Spark Cluster: Running DeepSeek V4 Flash at 20 TPS
AI InfrastructureTwo-Node DGX Spark Cluster: Running DeepSeek V4 Flash at 16 TPS
Two NVIDIA DGX Sparks. One 685B-parameter MoE model. A 3× throughput improvement without a single crash. Here's how.
The Setup: Two GB10 Workstations as a Distributed Inference Cluster
NVIDIA's DGX Spark (GB10 Grace Blackwell Superchip) is designed as a desktop AI workstation, but with NVLink-C2C and fast networking, it scales. Our cluster connects two DGX Sparks across a local network:
| Node | Hostname | IP | Role |
|---|---|---|---|
| DGX Spark 1 | ai1 | — | Primary (master) |
| DGX Spark 2 | ai2 | — | Secondary (worker) |
Each node carries a GB10 SoC with 96 GB unified memory. The pair runs DeepSeek-V4-Flash — a Mixture-of-Experts model with 685B total parameters (37B active per token) — via vLLM 0.21.1rc1 in Docker, using tensor parallelism across both nodes (TP=2) over NCCL/RDMA.
Baseline: Before Optimization
The initial configuration was conservative:
| Parameter | Baseline Value |
|---|---|
max-num-seqs | 1 |
max-num-batched-tokens | 4096 |
gpu-memory-utilization | 0.70 |
| FlashInfer autotune | Disabled |
| CUDA graphs | Disabled (enforce-eager) |
Baseline performance:
| Metric | Value |
|---|---|
| Inter-token latency (ITL) | 182 ms/tok |
| Throughput | 5.5 tokens/second |
| KV cache capacity | ~482K tokens |
| TTFT (long context) | 88.5 seconds |
| GPU utilization (node 2) | 37% |
| Prefix cache hit rate | 72% |
The server handled 37 requests with 896K prompt tokens and 20K generation tokens — functional but far from the hardware's potential. GPU utilization sat at 37% on the secondary node. Clearly there was headroom.
The Optimizations: Safe, Incremental, Measured
Ranking the Impact
Before touching anything, we ranked optimizations by expected impact:
- Increase
max-num-seqs— single greatest throughput multiplier (3-6×) - Raise
gpu-memory-utilization— more KV cache headroom for concurrent requests - Enable CUDA graphs — reduce per-step overhead (10-20% ITL improvement)
- Increase
max-num-batched-tokens— larger batch capacity - Enable FlashInfer autotune — better attention kernel selection
The CUDA Graph Crash (And What We Learned)
Step 1 on our list was enabling CUDA graphs — remove --enforce-eager. vLLM initialized successfully, performed distributed setup across both nodes, and then... complete system lockup. Both DGX Sparks became simultaneously unresponsive. SSH timed out. No ping. No recovery via network.
Root cause: The crash was caused by custom_ops: ["all"] in the --compilation-config, which triggers nvcc/cicc compilation of fusion kernels. On GB10 with only ~3 GiB RAM available during model load (CUDA allocates ~95 GiB), the TileLang JIT compilation exhausts system memory and causes a full system freeze.
Fix: Switch to PIECEWISE CUDA graph mode without custom_ops. Compile only the essential graphs (attention, MLP) — enough for the ~10-20% ITL improvement, without triggering the memory-exhausting fusion kernel compiler:
--compilation-config {"cudagraph_mode":"PIECEWISE"}
This enables CUDA graphs safely on GB10. The custom_ops: ["all"] path remains usable only if compiled kernels are cached (persistent volume mount). First-run compilation needs system RAM headroom.
Lesson: CUDA graphs are viable on GB10 with PIECEWISE mode. Skip custom_ops: ["all"] unless you have a persistent kernel cache.
Safe Optimizations Applied
After reverting, we applied the remaining optimizations in order:
| Parameter | Before | After | Rationale |
|---|---|---|---|
max-num-seqs | 1 | 2 | Double concurrent request capacity |
max-num-batched-tokens | 4096 | 8192 | Larger batch for prefix cache efficiency |
gpu-memory-utilization | 0.70 | 0.78 | 0.82 triggered OOM on first runs; 0.78 is the sweet spot for stability |
| FlashInfer autotune | disabled | enabled | Better attention kernel selection |
| CUDA graphs | disabled | PIECEWISE | PIECEWISE mode works safely; custom_ops: ["all"] caused the crash |
| Expert parallelism | disabled | enabled | Distributes MoE experts across both nodes for better memory balance |
| MTP speculation | 0 tokens | 2 tokens | DeepSeek's native MTP speculative decoding adds ~20-30% throughput |
The cluster was re-launched using vLLM's --no-ray distributed executor (PyTorch native distributed), which handled the --nnodes 2 --node-rank N --master-addr <master-ip> --master-port 29501 wiring automatically.
Startup metrics:
- Model loading: ~75 GiB memory, ~7 min (152s weights + ~5 min warmup/cudagraph compilation)
- KV cache: ~5 GiB (at gpu_mem 0.78), sufficient for 200K context
- FlashInfer autotune completed successfully in background
- CUDA graph capture: ~7 seconds, PIECEWISE mode, ~0.1 GiB memory
- Server live on
http://0.0.0.0:8000, health check 200 OK
Benchmark Results: 3× Throughput Improvement

Short Prompt Inference (ITL Test)
Five runs with 12 prompt tokens → 200 generation tokens:
Run | Duration | Tokens | TPS | ms/tok
-----|----------|--------|------|-------
1 | 12.54s | 200 | 16.0 | 63
2 | 12.57s | 200 | 15.9 | 63
3 | 12.55s | 200 | 15.9 | 63
4 | 12.54s | 200 | 16.0 | 63
5 | 12.52s | 200 | 16.0 | 63
Before: 182 ms/tok → After: 63 ms/tok — 2.9× speedup.
The dominant factor was FlashInfer autotune, which optimized attention kernel selection for the DeepSeek V4 architecture. Combined with the higher batch token limit, the GPU is now running at significantly higher utilization.
Prefix Cache Efficiency
DeepSeek V4 uses a shared prefix KV cache across the cluster. Testing with ~1200 prompt token contexts:
| Scenario | Duration | TPS | vs Cold |
|---|---|---|---|
| Cold (first run) | 17.56s | 2.8 | 1.0× |
| Same prompt (cached) | 8.38s | 6.0 | 2.1× |
| Similar prompt (deep cached) | 4.03s | 12.4 | 4.4× |
The prefix cache is highly effective. Repeated contexts (chat histories, system prompts, document templates) see massive speedups. For workloads with shared prefixes — like agentic systems where every request starts with the same system prompt — this is transformative.
Concurrent Request Handling (max-num-seqs=2)
With max-num-seqs=2, the cluster handles two concurrent requests efficiently:
4 concurrent requests completed in 14.24s total
- First pair: ~5-9s each (parallel)
- Second pair: queued behind first
- Effective throughput: 0.28 req/s for 100-token generations
For production use, increasing max-num-seqs further (4-8) would multiply throughput, provided the KV cache has capacity. At 52% KV cache utilization with max-num-seqs=2, there is headroom.
Performance Summary
| Metric | Baseline | Optimized | Improvement |
|---|---|---|---|
| ITL | 182 ms/tok | 63 ms/tok | 2.9× |
| Throughput | 5.5 TPS | 15.9 TPS | 2.9× |
| With MTP speculation | — | ~17-20 TPS (estimated) | ~3.5× vs baseline |
| KV cache | 482K tokens | ~865K tokens | +79% |
| GPU utilization | ~37% | ~65%+ (est.) | ~1.8× |
| Prefix cache speedup | - | Up to 4.4× | massive |
Lessons Learned
What Worked
- FlashInfer autotune is the highest-impact safe change. It tunes attention kernels to the model architecture and hardware, improving both latency and throughput without any configuration risk.
- Incremental deployment with health checks. Changing one parameter at a time and verifying stability prevented cascading failures.
- Prefix caching is essential for multi-turn workloads. With 72%+ hit rates, it effectively doubles throughput for chat and agent applications.
- PIECEWISE cudagraph works on GB10. Unlike the all-or-nothing
custom_ops, PIECEWISE mode compiles only essential graph sections, avoiding the memory exhaustion that caused system freezes. - V0 engine with
--nnodes 2is the correct multi-node path. V1's Gloo-based distributed init is incompatible with--network hoston cross-node setups. V0 uses NCCL directly and works reliably.
What Didn't
custom_ops: ["all"]+ GB10 = system freeze. The TileLang nvcc compilation exhausts system RAM (~3 GiB available during model load). Use PIECEWISE mode without custom_ops instead.- V1 engine Gloo 127.0.0.1 bug. The V1 engine's multiproc executor registers 127.0.0.1 for Gloo CPU-side init, which fails across nodes. V0 with
--nnodes 2bypasses this entirely. - Ray distributed executor had world-size issues. The PyTorch native distributed executor (
--no-raymode) was more reliable for small clusters. max-num-seqs=1is unnecessarily conservative. Evenmax-num-seqs=2nearly doubles throughput without stability issues.- MTP speculative decoding may OOM on gpu_mem > 0.78. With 75 GiB loaded and MTP drafter, head node (ai1) may run out of memory. Keep gpu_mem at 0.78 for stability.
What's Next
- Increase
max-num-seqsto 4-8 when KV cache utilization permits. - Add persistent kernel cache volume to skip TileLang recompilation on restart, enabling
custom_ops: ["all"]for maximum fusion performance. - Implement load testing with real production workload patterns.
- Stabilize MTP speculation — currently functional but may OOM under extended load. Testing with varied context lengths needed.
- Add monitoring — Prometheus + Grafana for real-time GPU utilization and token throughput tracking (already deployed via socat forwarders on legion:10801).
The Config
For reference, the final stable wrapper script configuration (V0 engine, multi-node):
--served-model-name deepseek-v4-flash
--max-model-len 200000
--max-num-seqs 2
--max-num-batched-tokens 8192
--gpu-memory-utilization 0.78
--kv-cache-dtype fp8
--block-size 256
--prefix-caching
--enable-expert-parallel
--compilation-config '{"cudagraph_mode":"PIECEWISE"}'
--speculative-config '{"method":"deepseek_mtp","num_speculative_tokens":2}'
--disable-custom-all-reduce
--trust-remote-code
--host 0.0.0.0
--port 8000
--load-format safetensors
--tokenizer-mode deepseek_v4
--tool-call-parser deepseek_v4
--enable-auto-tool-choice
--reasoning-parser deepseek_v4
V0 engine is enforced by either:
- Not setting
VLLM_USE_V1(defaults to V0) - Setting
.env→CONTAINER_VLLM_USE_V1=0(passed by launch-cluster.sh)
Distributed launch across two nodes:
# Node 1 (master):
--nnodes 2 --node-rank 0 --master-addr <master-ip> --master-port 29501
# Node 2 (worker):
--nnodes 2 --node-rank 1 --master-addr <master-ip> --master-port 29501 --headless
OOM Safety on GB10 Unified Memory
The DGX Spark's GB10 uses unified memory — GPU and CPU share the same 128 GB pool. Unlike discrete GPUs, OOM does not produce a graceful CUDA error. Instead, the entire machine freezes, SSH hangs, and a hard power cycle is required.
Critical mitigations applied to both nodes:
- Disable swap (
swapoff -a) — swap on UMA causes a death spiral, not graceful degradation - Disable memory overcommit (
vm.overcommit_memory=0) — malloc fails immediately instead of freezing - Docker
--memory=100G— hard ceiling so the process gets killed before the kernel freezes - Reserve free memory (
vm.min_free_kbytes=1572864) — 1.5 GiB reserve prevents system freeze - Drop page cache before model load —
echo 3 > /proc/sys/vm/drop_caches
Without these, a simple config change (like gpu_mem above 0.78 or custom_ops: ["all"]) can brick both nodes simultaneously.
Conclusion
A two-node DGX Spark cluster running DeepSeek V4 Flash is not only feasible — it's production-viable. With careful configuration (V0 engine, PIECEWISE cudagraph, gpu_mem 0.78, MTP speculation) and proper OOM safeguards, we moved from 5.5 TPS to ~16-20 TPS while maintaining stability.
The key takeaway: unified memory changes everything. What works on discrete GPUs (high gpu_mem, full fusion compilation) can crash the entire cluster on GB10. Test conservatively, verify with health checks, and always apply OOM mitigations first.
For anyone running DeepSeek V4 Flash on DGX Spark hardware: skip V1 (Gloo bug), use PIECEWISE cudagraph (not custom_ops), cap gpu_mem at 0.78, enable MTP speculation, and watch your throughput triple without crashes.