Two-Node DGX Spark Cluster: Running DeepSeek V4 Flash at 16 TPS

Two NVIDIA DGX Sparks. One 685B-parameter MoE model. A 3× throughput improvement without a single crash. Here's how.

The Setup: Two GB10 Workstations as a Distributed Inference Cluster

NVIDIA's DGX Spark (GB10 Grace Blackwell Superchip) is designed as a desktop AI workstation, but with NVLink-C2C and fast networking, it scales. Our cluster connects two DGX Sparks across a local network:

Node	Hostname	IP	Role
DGX Spark 1	`ai1`	—	Primary (master)
DGX Spark 2	`ai2`	—	Secondary (worker)

Each node carries a GB10 SoC with 96 GB unified memory. The pair runs DeepSeek-V4-Flash — a Mixture-of-Experts model with 685B total parameters (37B active per token) — via vLLM 0.21.1rc1 in Docker, using tensor parallelism across both nodes (TP=2) over NCCL/RDMA.

Baseline: Before Optimization

The initial configuration was conservative:

Parameter	Baseline Value
`max-num-seqs`	1
`max-num-batched-tokens`	4096
`gpu-memory-utilization`	0.70
FlashInfer autotune	Disabled
CUDA graphs	Disabled (`enforce-eager`)

Baseline performance:

Metric	Value
Inter-token latency (ITL)	182 ms/tok
Throughput	5.5 tokens/second
KV cache capacity	~482K tokens
TTFT (long context)	88.5 seconds
GPU utilization (node 2)	37%
Prefix cache hit rate	72%

The server handled 37 requests with 896K prompt tokens and 20K generation tokens — functional but far from the hardware's potential. GPU utilization sat at 37% on the secondary node. Clearly there was headroom.

The Optimizations: Safe, Incremental, Measured

Ranking the Impact

Before touching anything, we ranked optimizations by expected impact:

Increase max-num-seqs — single greatest throughput multiplier (3-6×)
Raise gpu-memory-utilization — more KV cache headroom for concurrent requests
Enable CUDA graphs — reduce per-step overhead (10-20% ITL improvement)
Increase max-num-batched-tokens — larger batch capacity
Enable FlashInfer autotune — better attention kernel selection

The CUDA Graph Crash (And What We Learned)

Step 1 on our list was enabling CUDA graphs — remove --enforce-eager. vLLM initialized successfully, performed distributed setup across both nodes, and then... complete system lockup. Both DGX Sparks became simultaneously unresponsive. SSH timed out. No ping. No recovery via network.

Root cause: The crash was caused by custom_ops: ["all"] in the --compilation-config, which triggers nvcc/cicc compilation of fusion kernels. On GB10 with only ~3 GiB RAM available during model load (CUDA allocates ~95 GiB), the TileLang JIT compilation exhausts system memory and causes a full system freeze.

Fix: Switch to PIECEWISE CUDA graph mode without custom_ops. Compile only the essential graphs (attention, MLP) — enough for the ~10-20% ITL improvement, without triggering the memory-exhausting fusion kernel compiler:

--compilation-config {"cudagraph_mode":"PIECEWISE"}

This enables CUDA graphs safely on GB10. The custom_ops: ["all"] path remains usable only if compiled kernels are cached (persistent volume mount). First-run compilation needs system RAM headroom.

Lesson: CUDA graphs are viable on GB10 with PIECEWISE mode. Skip custom_ops: ["all"] unless you have a persistent kernel cache.

Safe Optimizations Applied

After reverting, we applied the remaining optimizations in order:

Parameter	Before	After	Rationale
`max-num-seqs`	1	2	Double concurrent request capacity
`max-num-batched-tokens`	4096	8192	Larger batch for prefix cache efficiency
`gpu-memory-utilization`	0.70	0.78	0.82 triggered OOM on first runs; 0.78 is the sweet spot for stability
FlashInfer autotune	disabled	enabled	Better attention kernel selection
CUDA graphs	disabled	PIECEWISE	PIECEWISE mode works safely; `custom_ops: ["all"]` caused the crash
Expert parallelism	disabled	enabled	Distributes MoE experts across both nodes for better memory balance
MTP speculation	0 tokens	2 tokens	DeepSeek's native MTP speculative decoding adds ~20-30% throughput

The cluster was re-launched using vLLM's --no-ray distributed executor (PyTorch native distributed), which handled the --nnodes 2 --node-rank N --master-addr <master-ip> --master-port 29501 wiring automatically.

Startup metrics:

Model loading: ~75 GiB memory, ~7 min (152s weights + ~5 min warmup/cudagraph compilation)
KV cache: ~5 GiB (at gpu_mem 0.78), sufficient for 200K context
FlashInfer autotune completed successfully in background
CUDA graph capture: ~7 seconds, PIECEWISE mode, ~0.1 GiB memory
Server live on http://0.0.0.0:8000, health check 200 OK

Benchmark Results: 3× Throughput Improvement

DGX Spark Benchmark Comparison: Baseline vs Optimized TPS

Short Prompt Inference (ITL Test)

Five runs with 12 prompt tokens → 200 generation tokens:

Run  | Duration | Tokens | TPS  | ms/tok
-----|----------|--------|------|-------
1    | 12.54s   | 200    | 16.0 | 63
2    | 12.57s   | 200    | 15.9 | 63
3    | 12.55s   | 200    | 15.9 | 63
4    | 12.54s   | 200    | 16.0 | 63
5    | 12.52s   | 200    | 16.0 | 63

Before: 182 ms/tok → After: 63 ms/tok — 2.9× speedup.

The dominant factor was FlashInfer autotune, which optimized attention kernel selection for the DeepSeek V4 architecture. Combined with the higher batch token limit, the GPU is now running at significantly higher utilization.

Prefix Cache Efficiency

DeepSeek V4 uses a shared prefix KV cache across the cluster. Testing with ~1200 prompt token contexts:

Scenario	Duration	TPS	vs Cold
Cold (first run)	17.56s	2.8	1.0×
Same prompt (cached)	8.38s	6.0	2.1×
Similar prompt (deep cached)	4.03s	12.4	4.4×

The prefix cache is highly effective. Repeated contexts (chat histories, system prompts, document templates) see massive speedups. For workloads with shared prefixes — like agentic systems where every request starts with the same system prompt — this is transformative.

Concurrent Request Handling (max-num-seqs=2)

With max-num-seqs=2, the cluster handles two concurrent requests efficiently:

4 concurrent requests completed in 14.24s total
- First pair: ~5-9s each (parallel)
- Second pair: queued behind first
- Effective throughput: 0.28 req/s for 100-token generations

For production use, increasing max-num-seqs further (4-8) would multiply throughput, provided the KV cache has capacity. At 52% KV cache utilization with max-num-seqs=2, there is headroom.

Performance Summary

Metric	Baseline	Optimized	Improvement
ITL	182 ms/tok	63 ms/tok	2.9×
Throughput	5.5 TPS	15.9 TPS	2.9×
With MTP speculation	—	~17-20 TPS (estimated)	~3.5× vs baseline
KV cache	482K tokens	~865K tokens	+79%
GPU utilization	~37%	~65%+ (est.)	~1.8×
Prefix cache speedup	-	Up to 4.4×	massive

Lessons Learned

What Worked

FlashInfer autotune is the highest-impact safe change. It tunes attention kernels to the model architecture and hardware, improving both latency and throughput without any configuration risk.
Incremental deployment with health checks. Changing one parameter at a time and verifying stability prevented cascading failures.
Prefix caching is essential for multi-turn workloads. With 72%+ hit rates, it effectively doubles throughput for chat and agent applications.
PIECEWISE cudagraph works on GB10. Unlike the all-or-nothing custom_ops, PIECEWISE mode compiles only essential graph sections, avoiding the memory exhaustion that caused system freezes.
V0 engine with --nnodes 2 is the correct multi-node path. V1's Gloo-based distributed init is incompatible with --network host on cross-node setups. V0 uses NCCL directly and works reliably.

What Didn't

custom_ops: ["all"] + GB10 = system freeze. The TileLang nvcc compilation exhausts system RAM (~3 GiB available during model load). Use PIECEWISE mode without custom_ops instead.
V1 engine Gloo 127.0.0.1 bug. The V1 engine's multiproc executor registers 127.0.0.1 for Gloo CPU-side init, which fails across nodes. V0 with --nnodes 2 bypasses this entirely.
Ray distributed executor had world-size issues. The PyTorch native distributed executor (--no-ray mode) was more reliable for small clusters.
max-num-seqs=1 is unnecessarily conservative. Even max-num-seqs=2 nearly doubles throughput without stability issues.
MTP speculative decoding may OOM on gpu_mem > 0.78. With 75 GiB loaded and MTP drafter, head node (ai1) may run out of memory. Keep gpu_mem at 0.78 for stability.

What's Next

Increase max-num-seqs to 4-8 when KV cache utilization permits.
Add persistent kernel cache volume to skip TileLang recompilation on restart, enabling custom_ops: ["all"] for maximum fusion performance.
Implement load testing with real production workload patterns.
Stabilize MTP speculation — currently functional but may OOM under extended load. Testing with varied context lengths needed.
Add monitoring — Prometheus + Grafana for real-time GPU utilization and token throughput tracking (already deployed via socat forwarders on legion:10801).

The Config

For reference, the final stable wrapper script configuration (V0 engine, multi-node):

--served-model-name deepseek-v4-flash
--max-model-len 200000
--max-num-seqs 2
--max-num-batched-tokens 8192
--gpu-memory-utilization 0.78
--kv-cache-dtype fp8
--block-size 256
--prefix-caching
--enable-expert-parallel
--compilation-config '{"cudagraph_mode":"PIECEWISE"}'
--speculative-config '{"method":"deepseek_mtp","num_speculative_tokens":2}'
--disable-custom-all-reduce
--trust-remote-code
--host 0.0.0.0
--port 8000
--load-format safetensors
--tokenizer-mode deepseek_v4
--tool-call-parser deepseek_v4
--enable-auto-tool-choice
--reasoning-parser deepseek_v4

V0 engine is enforced by either:

Not setting VLLM_USE_V1 (defaults to V0)
Setting .env → CONTAINER_VLLM_USE_V1=0 (passed by launch-cluster.sh)

Distributed launch across two nodes:

# Node 1 (master):
--nnodes 2 --node-rank 0 --master-addr <master-ip> --master-port 29501

# Node 2 (worker):
--nnodes 2 --node-rank 1 --master-addr <master-ip> --master-port 29501 --headless

OOM Safety on GB10 Unified Memory

The DGX Spark's GB10 uses unified memory — GPU and CPU share the same 128 GB pool. Unlike discrete GPUs, OOM does not produce a graceful CUDA error. Instead, the entire machine freezes, SSH hangs, and a hard power cycle is required.

Critical mitigations applied to both nodes:

Disable swap (swapoff -a) — swap on UMA causes a death spiral, not graceful degradation
Disable memory overcommit (vm.overcommit_memory=0) — malloc fails immediately instead of freezing
Docker --memory=100G — hard ceiling so the process gets killed before the kernel freezes
Reserve free memory (vm.min_free_kbytes=1572864) — 1.5 GiB reserve prevents system freeze
Drop page cache before model load — echo 3 > /proc/sys/vm/drop_caches

Without these, a simple config change (like gpu_mem above 0.78 or custom_ops: ["all"]) can brick both nodes simultaneously.

Conclusion

A two-node DGX Spark cluster running DeepSeek V4 Flash is not only feasible — it's production-viable. With careful configuration (V0 engine, PIECEWISE cudagraph, gpu_mem 0.78, MTP speculation) and proper OOM safeguards, we moved from 5.5 TPS to ~16-20 TPS while maintaining stability.

The key takeaway: unified memory changes everything. What works on discrete GPUs (high gpu_mem, full fusion compilation) can crash the entire cluster on GB10. Test conservatively, verify with health checks, and always apply OOM mitigations first.

For anyone running DeepSeek V4 Flash on DGX Spark hardware: skip V1 (Gloo bug), use PIECEWISE cudagraph (not custom_ops), cap gpu_mem at 0.78, enable MTP speculation, and watch your throughput triple without crashes.

Enjoyed this deep-dive? Subscribe to get notified about new AI infrastructure and DevOps articles.

Two-Node DGX Spark Cluster: Running DeepSeek V4 Flash at 16 TPS

Two NVIDIA DGX Sparks. One 685B-parameter MoE model. A 3× throughput improvement without a single crash. Here's how.

The Setup: Two GB10 Workstations as a Distributed Inference Cluster

Node	Hostname	IP	Role
DGX Spark 1	`ai1`	—	Primary (master)
DGX Spark 2	`ai2`	—	Secondary (worker)

Baseline: Before Optimization

The initial configuration was conservative:

Parameter	Baseline Value
`max-num-seqs`	1
`max-num-batched-tokens`	4096
`gpu-memory-utilization`	0.70
FlashInfer autotune	Disabled
CUDA graphs	Disabled (`enforce-eager`)

Baseline performance:

Metric	Value
Inter-token latency (ITL)	182 ms/tok
Throughput	5.5 tokens/second
KV cache capacity	~482K tokens
TTFT (long context)	88.5 seconds
GPU utilization (node 2)	37%
Prefix cache hit rate	72%

The Optimizations: Safe, Incremental, Measured

Ranking the Impact

Before touching anything, we ranked optimizations by expected impact:

Increase max-num-seqs — single greatest throughput multiplier (3-6×)
Raise gpu-memory-utilization — more KV cache headroom for concurrent requests
Enable CUDA graphs — reduce per-step overhead (10-20% ITL improvement)
Increase max-num-batched-tokens — larger batch capacity
Enable FlashInfer autotune — better attention kernel selection

The CUDA Graph Crash (And What We Learned)

--compilation-config {"cudagraph_mode":"PIECEWISE"}

This enables CUDA graphs safely on GB10. The custom_ops: ["all"] path remains usable only if compiled kernels are cached (persistent volume mount). First-run compilation needs system RAM headroom.

Lesson: CUDA graphs are viable on GB10 with PIECEWISE mode. Skip custom_ops: ["all"] unless you have a persistent kernel cache.

Safe Optimizations Applied

After reverting, we applied the remaining optimizations in order:

Parameter	Before	After	Rationale
`max-num-seqs`	1	2	Double concurrent request capacity
`max-num-batched-tokens`	4096	8192	Larger batch for prefix cache efficiency
`gpu-memory-utilization`	0.70	0.78	0.82 triggered OOM on first runs; 0.78 is the sweet spot for stability
FlashInfer autotune	disabled	enabled	Better attention kernel selection
CUDA graphs	disabled	PIECEWISE	PIECEWISE mode works safely; `custom_ops: ["all"]` caused the crash
Expert parallelism	disabled	enabled	Distributes MoE experts across both nodes for better memory balance
MTP speculation	0 tokens	2 tokens	DeepSeek's native MTP speculative decoding adds ~20-30% throughput

Startup metrics:

Model loading: ~75 GiB memory, ~7 min (152s weights + ~5 min warmup/cudagraph compilation)
KV cache: ~5 GiB (at gpu_mem 0.78), sufficient for 200K context
FlashInfer autotune completed successfully in background
CUDA graph capture: ~7 seconds, PIECEWISE mode, ~0.1 GiB memory
Server live on http://0.0.0.0:8000, health check 200 OK

Benchmark Results: 3× Throughput Improvement

Short Prompt Inference (ITL Test)

Five runs with 12 prompt tokens → 200 generation tokens:

Run  | Duration | Tokens | TPS  | ms/tok
-----|----------|--------|------|-------
1    | 12.54s   | 200    | 16.0 | 63
2    | 12.57s   | 200    | 15.9 | 63
3    | 12.55s   | 200    | 15.9 | 63
4    | 12.54s   | 200    | 16.0 | 63
5    | 12.52s   | 200    | 16.0 | 63

Before: 182 ms/tok → After: 63 ms/tok — 2.9× speedup.

Prefix Cache Efficiency

DeepSeek V4 uses a shared prefix KV cache across the cluster. Testing with ~1200 prompt token contexts:

Scenario	Duration	TPS	vs Cold
Cold (first run)	17.56s	2.8	1.0×
Same prompt (cached)	8.38s	6.0	2.1×
Similar prompt (deep cached)	4.03s	12.4	4.4×

Concurrent Request Handling (max-num-seqs=2)

With max-num-seqs=2, the cluster handles two concurrent requests efficiently:

4 concurrent requests completed in 14.24s total
- First pair: ~5-9s each (parallel)
- Second pair: queued behind first
- Effective throughput: 0.28 req/s for 100-token generations

For production use, increasing max-num-seqs further (4-8) would multiply throughput, provided the KV cache has capacity. At 52% KV cache utilization with max-num-seqs=2, there is headroom.

Performance Summary

Metric	Baseline	Optimized	Improvement
ITL	182 ms/tok	63 ms/tok	2.9×
Throughput	5.5 TPS	15.9 TPS	2.9×
With MTP speculation	—	~17-20 TPS (estimated)	~3.5× vs baseline
KV cache	482K tokens	~865K tokens	+79%
GPU utilization	~37%	~65%+ (est.)	~1.8×
Prefix cache speedup	-	Up to 4.4×	massive

Lessons Learned

What Worked

FlashInfer autotune is the highest-impact safe change. It tunes attention kernels to the model architecture and hardware, improving both latency and throughput without any configuration risk.
Incremental deployment with health checks. Changing one parameter at a time and verifying stability prevented cascading failures.
Prefix caching is essential for multi-turn workloads. With 72%+ hit rates, it effectively doubles throughput for chat and agent applications.
PIECEWISE cudagraph works on GB10. Unlike the all-or-nothing custom_ops, PIECEWISE mode compiles only essential graph sections, avoiding the memory exhaustion that caused system freezes.
V0 engine with --nnodes 2 is the correct multi-node path. V1's Gloo-based distributed init is incompatible with --network host on cross-node setups. V0 uses NCCL directly and works reliably.

What Didn't

custom_ops: ["all"] + GB10 = system freeze. The TileLang nvcc compilation exhausts system RAM (~3 GiB available during model load). Use PIECEWISE mode without custom_ops instead.
V1 engine Gloo 127.0.0.1 bug. The V1 engine's multiproc executor registers 127.0.0.1 for Gloo CPU-side init, which fails across nodes. V0 with --nnodes 2 bypasses this entirely.
Ray distributed executor had world-size issues. The PyTorch native distributed executor (--no-ray mode) was more reliable for small clusters.
max-num-seqs=1 is unnecessarily conservative. Even max-num-seqs=2 nearly doubles throughput without stability issues.
MTP speculative decoding may OOM on gpu_mem > 0.78. With 75 GiB loaded and MTP drafter, head node (ai1) may run out of memory. Keep gpu_mem at 0.78 for stability.

What's Next

Increase max-num-seqs to 4-8 when KV cache utilization permits.
Add persistent kernel cache volume to skip TileLang recompilation on restart, enabling custom_ops: ["all"] for maximum fusion performance.
Implement load testing with real production workload patterns.
Stabilize MTP speculation — currently functional but may OOM under extended load. Testing with varied context lengths needed.
Add monitoring — Prometheus + Grafana for real-time GPU utilization and token throughput tracking (already deployed via socat forwarders on legion:10801).

The Config

For reference, the final stable wrapper script configuration (V0 engine, multi-node):

--served-model-name deepseek-v4-flash
--max-model-len 200000
--max-num-seqs 2
--max-num-batched-tokens 8192
--gpu-memory-utilization 0.78
--kv-cache-dtype fp8
--block-size 256
--prefix-caching
--enable-expert-parallel
--compilation-config '{"cudagraph_mode":"PIECEWISE"}'
--speculative-config '{"method":"deepseek_mtp","num_speculative_tokens":2}'
--disable-custom-all-reduce
--trust-remote-code
--host 0.0.0.0
--port 8000
--load-format safetensors
--tokenizer-mode deepseek_v4
--tool-call-parser deepseek_v4
--enable-auto-tool-choice
--reasoning-parser deepseek_v4

V0 engine is enforced by either:

Not setting VLLM_USE_V1 (defaults to V0)
Setting .env → CONTAINER_VLLM_USE_V1=0 (passed by launch-cluster.sh)

Distributed launch across two nodes:

# Node 1 (master):
--nnodes 2 --node-rank 0 --master-addr <master-ip> --master-port 29501

# Node 2 (worker):
--nnodes 2 --node-rank 1 --master-addr <master-ip> --master-port 29501 --headless

OOM Safety on GB10 Unified Memory

Critical mitigations applied to both nodes:

Disable swap (swapoff -a) — swap on UMA causes a death spiral, not graceful degradation
Disable memory overcommit (vm.overcommit_memory=0) — malloc fails immediately instead of freezing
Docker --memory=100G — hard ceiling so the process gets killed before the kernel freezes
Reserve free memory (vm.min_free_kbytes=1572864) — 1.5 GiB reserve prevents system freeze
Drop page cache before model load — echo 3 > /proc/sys/vm/drop_caches

Without these, a simple config change (like gpu_mem above 0.78 or custom_ops: ["all"]) can brick both nodes simultaneously.

Conclusion

Enjoyed this deep-dive? Subscribe to get notified about new AI infrastructure and DevOps articles.

Two-Node DGX Spark Cluster: Running DeepSeek V4 Flash at 16 TPS

The Setup: Two GB10 Workstations as a Distributed Inference Cluster

Baseline: Before Optimization

The Optimizations: Safe, Incremental, Measured

Ranking the Impact

The CUDA Graph Crash (And What We Learned)

Safe Optimizations Applied

Benchmark Results: 3× Throughput Improvement

Short Prompt Inference (ITL Test)

Prefix Cache Efficiency

Concurrent Request Handling (max-num-seqs=2)

Performance Summary

Lessons Learned

What Worked

What Didn't

What's Next

The Config

OOM Safety on GB10 Unified Memory

Conclusion

Never miss a deep-dive

Two-Node DGX Spark Cluster: Running DeepSeek V4 Flash at 16 TPS

The Setup: Two GB10 Workstations as a Distributed Inference Cluster

Baseline: Before Optimization

The Optimizations: Safe, Incremental, Measured

Ranking the Impact

The CUDA Graph Crash (And What We Learned)

Safe Optimizations Applied

Benchmark Results: 3× Throughput Improvement

Short Prompt Inference (ITL Test)

Prefix Cache Efficiency

Concurrent Request Handling (max-num-seqs=2)

Performance Summary

Lessons Learned

What Worked

What Didn't

What's Next

The Config

OOM Safety on GB10 Unified Memory

Conclusion

Never miss a deep-dive