DeepSeek V4: 1.6T Parameters, FP4 Precision, and the Huawei NPU Question | Thoughts & Talks

DeepSeek V4 arrives with two models: a 1.6-trillion-parameter Pro variant and a 284-billion-parameter Flash variant. The headline number is the Pro model's 49 billion active parameters per token — but the real story is the inference architecture. DeepSeek has introduced a hybrid sparse attention mechanism and FP4 quantisation-aware training that cuts KV cache memory by an order of magnitude, and validated the entire stack on Huawei Ascend NPUs alongside Nvidia GPUs. One million token context is now the default, not an upsell.

The two models

Specification	V4-Pro	V4-Flash
Total parameters	1.6T	284B
Active parameters per token	49B	13B
Training tokens	33T	—
Context length	1M	1M
Precision	FP8 + FP4 (QAT)	FP8 + FP4 (QAT)
Modes	Thinking + Non-Thinking	Thinking + Non-Thinking

V4-Pro targets frontier-level performance. DeepSeek claims it leads all open-weight models and rivals proprietary models across coding, mathematics, STEM reasoning, and world knowledge benchmarks. V4-Flash is the workhorse: smaller, faster, and cheap enough to route every incoming request through without thinking twice about cost. Both support thinking mode (chain-of-thought) and non-thinking mode (standard completion) via the same API call.

What actually changed under the hood

Sparse attention and KV cache compression

The most consequential architectural change is the attention mechanism. V4 uses a hybrid of two techniques: Compressed Sparse Attention (CSA) and Heavy Compressed Attention (HCA). Together they form DeepSeek Sparse Attention (DSA), which performs token-wise compression to reduce both compute and memory during inference.

The KV cache savings are the headline metric: V4 uses 9.5× to 13.7× less memory for KV caches than DeepSeek V3.2. For anyone running inference at scale — where KV cache offloading to system RAM or NVMe is the standard cold-start mitigation — this is the difference between a model that fits in GPU memory and one that doesn't.

FP4 quantisation-aware training

V3 was among the first open-weight models trained at FP8. V4 goes further: MoE expert weights use a mixture of FP8 and FP4, with quantisation-aware training (QAT) applied to the expert weights specifically. FP4 halves the memory required to store model weights compared to FP8.

This is a weights-only optimisation on Hopper GPUs, which lack FP4 hardware acceleration. You don't get faster matrix multiplication, but you do get reduced memory footprint and bandwidth — a worthwhile trade-off when the bottleneck is memory capacity, not compute throughput. On hardware with native FP4 support, the gains compound.

Muon optimizer

A new optimiser called Muon replaces the previous training optimiser. DeepSeek reports faster convergence and improved training stability. This is a training-time change that doesn't directly affect inference, but it explains how a 1.6T-parameter model was trained on 33 trillion tokens without spiralling costs.

The Huawei Ascend NPU angle

The technical paper mentions in passing that DeepSeek "validated its fine-grained Expert Parallel scheme on both Nvidia GPUs and Ascend NPU platforms." That validation is for inference serving, not training. The paper does not state that V4 was trained on Huawei hardware.

This distinction matters. DeepSeek previously attempted to train models on Huawei's Ascend chips but abandoned the effort, reportedly due to defective chips, slow interconnects, and an immature software stack. Validating inference on Ascend NPUs is a meaningful step — it means organisations outside the Nvidia ecosystem can serve these models — but it does not represent the same breakthrough as training on non-Nvidia hardware.

For inference providers building multi-vendor GPU clusters, or organisations in markets where Nvidia hardware is restricted, the Ascend validation removes a real deployment barrier.

API pricing and migration

The API uses the same base URL. Change the model name and you're running V4:

from openai import OpenAI

client = OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com"
)

# Flash model — high throughput, low cost
response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Explain MoE expert routing"}],
    max_tokens=512
)

# Pro model — frontier performance
response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[{"role": "user", "content": "Solve this step by step"}],
    max_tokens=2048
)

Model	Input (per 1M tokens)	Output (per 1M tokens)
deepseek-v4-flash	$0.14	$0.28
deepseek-v4-pro	$1.74	$3.48
GPT-5.5 (comparison)	$5.00	$30.00
Claude Opus 4.7 (comparison)	$15.00	$75.00

V4-Flash undercuts GPT-5.5 by 35× on output tokens. V4-Pro is still 8.6× cheaper than GPT-5.5 on output. For high-volume inference workloads — batch processing, RAG pipelines, agentic loops — the cost differential is structural, not marginal.

Both models support the OpenAI ChatCompletions API and the Anthropic API format. Both support thinking and non-thinking modes. Context caching works across both.

Migration deadline: deepseek-chat and deepseek-reasoner will be fully retired on 24 July 2026 at 15:59 UTC. Currently they route to deepseek-v4-flash. Any integration referencing the old model IDs will break after that date.

Agentic integration

DeepSeek V4 has been validated with Claude Code, OpenClaw, and OpenCode — the three leading open-source AI coding agents. DeepSeek reports using V4 internally for its own agentic coding workflows. The model targets open-source SOTA on agentic coding benchmarks, though independent verification of these claims is still pending.

The thinking mode integration is relevant here: agentic frameworks can toggle between chain-of-thought reasoning (for complex planning) and standard completion (for straightforward edits) within the same model, simplifying routing logic.

Caveats

DeepSeek's benchmarks are self-reported. The V3 and R1 families established a strong track record, but independent evaluations of V4 are not yet available. The FP4 precision trade-off may degrade output quality on tasks that are sensitive to numerical precision — scientific computation, low-resource language processing, or fine-grained classification. The Huawei Ascend validation covers inference serving, not training, and performance on Ascend hardware relative to Nvidia GPUs has not been disclosed.

What to do next

Test V4-Flash against your current model — at $0.28/M output tokens, the cost of switching is near zero. Route a fraction of traffic and compare quality.
Plan the migration — if you're using deepseek-chat or deepseek-reasoner, update model IDs before the 24 July retirement deadline.
Evaluate V4-Pro for agentic workflows — if you're running Claude Code or OpenCode with a LiteLLM proxy, add deepseek-v4-pro to your model pool.
Check the technical report — DeepSeek V4 PDF on HuggingFace for full architectural details.