# DeepSeek V4 (1.6T MoE, Multimodal)

{% hint style="info" %}
**Status (April 29, 2026):** DeepSeek V4 dropped on **April 22, 2026** with **full open weights under MIT license**. Two checkpoints are live: [deepseek-ai/DeepSeek-V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) (1.6T total / \~49B active, 1M context) and [deepseek-ai/DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) (284B total / \~13B active). The Pro model has already crossed **174K downloads in its first week**, with day-0 support in vLLM and SGLang.
{% endhint %}

DeepSeek V4 is the first open-weight frontier model of 2026 to ship as a **two-tier release**. **V4-Pro** is the flagship — a **1.6 trillion parameter Mixture-of-Experts** with roughly **49B active parameters per token**, a **1M token context window**, and a hybrid attention design that combines Compressed Sparse Attention with a new Heavily Compressed Attention head for cheap long-context prefill. **V4-Flash** is the practical sibling — **284B total / 13B active**, the same architecture, fits on a single 80GB GPU when quantized, and runs comfortably on a 2×48GB box with Unsloth GGUF builds.

The architecture is the headline. DeepSeek's hybrid attention drops KV-cache memory dramatically at long context, and the MoE router has been retrained for sharper expert selection — early independent runs report Pro hitting V3-level coding scores at roughly half the active-parameter compute. For Clore.ai users this matters because **V4-Flash is the first time a sub-15B-active frontier-class model has shipped with full weights**, putting serious open inference within reach of a single H100 or a cheap multi-4090 box.

For most teams the realistic Clore deployment is **V4-Flash on 1× A100 80GB or 2× RTX 4090** — that's where the price-performance lives. V4-Pro is reserved for serious infra: 8× H100, 4× H200, or 8× B200, ideally with NVLink. If you've been running [DeepSeek V3](/guides/language-models/deepseek-v3.md) or [DeepSeek-R1](/guides/language-models/deepseek-r1.md), the migration path is straightforward — same model family, same chat template, drop-in replacement on vLLM.

### Key Specs

| Property          | DeepSeek V4-Pro                                                                   | DeepSeek V4-Flash                                                                     |
| ----------------- | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
| Total Parameters  | 1.6T (MoE)                                                                        | 284B (MoE)                                                                            |
| Active Parameters | \~49B per token                                                                   | \~13B per token                                                                       |
| Context Window    | 1,000,000 tokens                                                                  | 256,000 tokens                                                                        |
| Attention         | Compressed Sparse + Heavily Compressed Attention                                  | Compressed Sparse + HCA                                                               |
| License           | MIT                                                                               | MIT                                                                                   |
| Release Date      | April 22, 2026                                                                    | April 22, 2026                                                                        |
| HuggingFace       | [deepseek-ai/DeepSeek-V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) | [deepseek-ai/DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) |
| Primary Tooling   | vLLM, SGLang (day-0)                                                              | vLLM, SGLang, llama.cpp (Unsloth GGUF)                                                |

### Why DeepSeek V4?

* **Truly open frontier weights** — MIT license, no usage restrictions, full commercial use
* **1M context on Pro, 256K on Flash** — handles entire codebases, books, or hour-long transcripts in one pass
* **Hybrid sparse attention** — KV cache scales sub-linearly at long context, prefill is cheap
* **Two-tier release** — Flash is the first 13B-active MoE good enough to replace V3 for most workflows
* **Day-0 vLLM and SGLang support** — no waiting for community patches, just `pip install -U` and go
* **MoE efficiency** — you pay 13B/49B inference cost, not 284B/1.6T

***

## Requirements

{% hint style="warning" %}
**V4-Pro is a frontier model.** Full BF16 weights are \~3.2TB and require multi-node H100/H200 or 8× B200 NVLink. There is no single-server BF16 path. If you don't have multi-node infra, run V4-Flash — it's 80% of the quality at 5% of the hardware cost.
{% endhint %}

| Component | Min (V4-Flash, GGUF Q4) | Recommended (V4-Flash FP8)   | Full V4-Pro (BF16)               |
| --------- | ----------------------- | ---------------------------- | -------------------------------- |
| GPU VRAM  | 1× 80GB or 2× 48GB      | 1× H100 80GB or 1× A100 80GB | 8× H100 80GB or 4× H200 141GB    |
| RAM       | 64GB                    | 128GB                        | 1TB+                             |
| Disk      | 200GB NVMe              | 600GB NVMe                   | 4TB NVMe                         |
| CUDA      | 12.4+                   | 12.6+                        | 12.6+                            |
| Network   | —                       | —                            | NVLink / 400Gb IB for multi-node |

**Clore.ai pick:** For 95% of users, **V4-Flash on a single A100 80GB at FP8** is the sweet spot — full 256K context, no quantization loss, \~$5–7/day on the marketplace. Reach for [H100](https://clore.ai/rent-h100.html) or [H200](https://clore.ai/rent-h200.html) tensor-parallel setups only when you actually need the V4-Pro 1M context or the extra reasoning headroom.

***

## Option A — Ollama / GGUF (Quantized, V4-Flash only)

Unsloth published GGUF quants for V4-Flash within 48 hours of release. Q4\_K\_M is the sweet spot — fits on 1× 80GB or 2× 48GB and keeps quality close to FP8.

```bash
# Pull the Unsloth Q4_K_M build
docker exec ollama ollama pull hf.co/unsloth/DeepSeek-V4-Flash-GGUF:Q4_K_M
docker exec ollama ollama run hf.co/unsloth/DeepSeek-V4-Flash-GGUF:Q4_K_M

# Or with llama.cpp directly on a downloaded GGUF
docker run --gpus all -it --rm -p 8080:8080 \
  -v $(pwd)/models:/models \
  ghcr.io/ggerganov/llama.cpp:server-cuda \
  -m /models/DeepSeek-V4-Flash-Q4_K_M.gguf \
  --n-gpu-layers 99 --ctx-size 65536 \
  --port 8080 --host 0.0.0.0
```

{% hint style="info" %}
GGUF quants for V4-**Pro** exist but are not practical — even Q2\_K is \~400GB and offload performance is unusable for chat. Stick to Flash for quantized deployments.
{% endhint %}

***

## Option B — vLLM (Production API, recommended)

vLLM 0.7.x added day-0 support for both V4 checkpoints. The hybrid attention kernels need `--trust-remote-code` and Hopper or Blackwell hardware for full speed.

**V4-Flash on a single H100 / A100 80GB:**

```yaml
version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - hf_cache:/root/.cache/huggingface
    command: >
      --model deepseek-ai/DeepSeek-V4-Flash
      --tensor-parallel-size 1
      --max-model-len 131072
      --dtype bfloat16
      --gpu-memory-utilization 0.92
      --enable-chunked-prefill
      --served-model-name deepseek-v4-flash
      --trust-remote-code
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    shm_size: "16gb"

volumes:
  hf_cache:
```

**V4-Pro on 8× H100 80GB:** swap the command for:

```yaml
    command: >
      --model deepseek-ai/DeepSeek-V4-Pro
      --tensor-parallel-size 8
      --max-model-len 262144
      --dtype bfloat16
      --gpu-memory-utilization 0.90
      --enable-chunked-prefill
      --enable-prefix-caching
      --served-model-name deepseek-v4-pro
      --trust-remote-code
```

```bash
# Test the API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [{"role": "user", "content": "Write a Rust async TCP echo server with graceful shutdown."}],
    "max_tokens": 2048,
    "temperature": 0.6
  }'
```

{% hint style="info" %}
Start with `--max-model-len 131072` even if you ultimately want the full 1M ctx — long contexts dramatically increase prefill time and KV memory. Bump it up only after the baseline is stable.
{% endhint %}

***

## Option C — SGLang (alternative, often faster on Hopper)

SGLang's RadixAttention and prefix caching pair well with V4's hybrid attention — for agentic workloads with shared prompts, expect noticeably better tok/s than vLLM.

```bash
docker pull lmsysorg/sglang:latest

# V4-Flash on 1× H100/A100
python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V4-Flash \
  --tp-size 1 \
  --context-length 131072 \
  --mem-fraction-static 0.90 \
  --enable-torch-compile \
  --served-model-name deepseek-v4-flash \
  --trust-remote-code

# V4-Pro on 8× H100
python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V4-Pro \
  --tp-size 8 \
  --context-length 262144 \
  --mem-fraction-static 0.88 \
  --enable-torch-compile \
  --served-model-name deepseek-v4-pro \
  --trust-remote-code
```

SGLang's `--enable-torch-compile` typically adds another 10–20% throughput on Hopper after the initial warmup.

***

## Clore.ai GPU Recommendations

| Setup                                                      | Model                                | VRAM       | Expected Throughput                   | Clore.ai Cost   |
| ---------------------------------------------------------- | ------------------------------------ | ---------- | ------------------------------------- | --------------- |
| 2× [RTX 4090](https://clore.ai/rent-4090.html) (Q4 GGUF)   | V4-Flash                             | 48GB       | Hobby use, single-stream              | \~$2–3/day      |
| 1× [A100 80GB](https://clore.ai/rent-a100-80gb.html) (FP8) | V4-Flash                             | 80GB       | Solid production single-tenant        | \~$5–7/day      |
| 1× RTX 5090 32GB (Q4 GGUF, partial offload)                | V4-Flash                             | 32GB + RAM | Constrained, dev only                 | \~$3.94/hr peak |
| 4× [H100 80GB](https://clore.ai/rent-h100.html)            | V4-Flash FP8 (overkill) or V4-Pro Q4 | 320GB      | Multi-tenant Flash, single-stream Pro | \~$24–32/day    |
| 8× [H100 80GB](https://clore.ai/rent-h100.html)            | V4-Pro BF16                          | 640GB      | Production frontier inference         | \~$48–64/day    |
| 4× [H200 141GB](https://clore.ai/rent-h200.html)           | V4-Pro BF16 + 1M ctx                 | 564GB      | Full 1M context, max throughput       | \~$32–48/day    |

{% hint style="success" %}
**Best value on Clore.ai:** 1× A100 80GB running V4-Flash FP8. You get 256K context, \~13B active inference cost, no quantization loss, and the bill is roughly the price of a Claude Sonnet API subscription — with weights that stay on your box.
{% endhint %}

***

## Use Cases

* **Whole-codebase reasoning** — V4-Pro's 1M context fits a typical 500K-LOC monorepo plus its tests in one prompt
* **Long-form RAG** — drop entire books, court filings, or annual reports into context, skip the chunking pipeline
* **Agentic coding** — V4-Flash matches V3 on SWE-Bench at a fraction of the inference cost; pair with SWE-agent or OpenHands
* **Multi-document synthesis** — research workflows that previously needed Gemini 2.5 Pro now run on your own hardware
* **Self-hosted Cursor / Copilot replacement** — V4-Flash on a single A100 saturates a 5-developer team
* **Fine-tuning base** — MIT license + clean MoE architecture makes it a strong starting point for domain fine-tunes

***

## Benchmarks

{% hint style="warning" %}
**Vendor-claimed — verify independently.** Numbers below come from DeepSeek's April 22, 2026 announcement and the model card. Independent reproductions are still being published; treat as directional, not gospel.
{% endhint %}

| Benchmark                            | V4-Pro | V4-Flash | DeepSeek V3 | GLM-5.1 |
| ------------------------------------ | ------ | -------- | ----------- | ------- |
| MMLU-Pro                             | \~84%  | \~78%    | \~76%       | \~80%   |
| SWE-Bench Verified                   | \~82%  | \~74%    | \~70%       | \~79%   |
| HumanEval                            | \~96%  | \~92%    | \~91%       | \~94%   |
| MATH-500                             | \~94%  | \~88%    | \~85%       | \~90%   |
| LiveCodeBench                        | \~76%  | \~68%    | \~62%       | \~72%   |
| Long-context (1M needle-in-haystack) | \~98%  | n/a      | n/a         | n/a     |

For an apples-to-apples open-weight comparison see the [GLM-5.1 guide](/guides/language-models/glm-5-1.md) — V4-Pro and GLM-5.1 trade blows depending on the benchmark.

***

## Troubleshooting

| Issue                                       | Solution                                                                                                                                                                   |
| ------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `OutOfMemoryError` loading V4-Pro on 8×H100 | BF16 needs \~3.2TB — you can't fit Pro on a single 8×H100 node. Use 4× H200 141GB or multi-node.                                                                           |
| `unsupported attention backend`             | V4 needs vLLM ≥ 0.7.0 or SGLang ≥ 0.4.4. Run `pip install -U vllm` (or pull `:latest` Docker image).                                                                       |
| Slow HuggingFace download                   | Use `huggingface-cli download deepseek-ai/DeepSeek-V4-Flash --local-dir ./weights --resume-download`. Pro is \~3.2TB; Flash is \~570GB.                                    |
| `--trust-remote-code` rejected              | The hybrid attention modules ship as custom code in the repo — `--trust-remote-code` is required for both engines until the kernels land in upstream Transformers.         |
| GGUF Q4 outputs gibberish                   | Make sure you're on the Unsloth build (`unsloth/DeepSeek-V4-Flash-GGUF`), not an early community quant. The MoE router needs special handling that early quants got wrong. |
| 1M context OOM on V4-Pro                    | Drop to `--max-model-len 262144` and add `--enable-prefix-caching`. Real 1M serving needs H200 or B200.                                                                    |
| Slow prefill at long context                | This is expected — even with hybrid attention, 500K+ prefill is minutes, not seconds. Use `--enable-chunked-prefill` and prefix caching to amortize.                       |

***

## Next Steps

* **Predecessor:** [DeepSeek V3](/guides/language-models/deepseek-v3.md) — the model V4-Flash effectively replaces
* **Reasoning sibling:** [DeepSeek-R1](/guides/language-models/deepseek-r1.md) — chain-of-thought tuned, still useful for math-heavy workflows
* **Open-weight alternative:** [GLM-5.1](/guides/language-models/glm-5-1.md) — 744B MoE, top of SWE-Bench Pro, comparable price-performance
* **Multimodal alternative:** [Qwen3.5-Omni](/guides/language-models/qwen35-omni.md) — if you need vision/audio in the same model
* **Rent the hardware:** [Clore.ai Marketplace](https://clore.ai/marketplace) — H100/H200/A100/RTX 4090 from $0.50/day

### Links

* [DeepSeek-V4-Pro on HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro)
* [DeepSeek-V4-Flash on HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash)
* [Unsloth V4-Flash GGUF quants](https://huggingface.co/unsloth/DeepSeek-V4-Flash-GGUF)
* [DeepSeek GitHub](https://github.com/deepseek-ai)
* [vLLM docs](https://docs.vllm.ai)
* [SGLang repo](https://github.com/sgl-project/sglang)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/deepseek-v4.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
