# DeepSeek V4（1.6T MoE，多模态）

{% hint style="info" %}
**状态（2026年4月29日）：** DeepSeek V4 于 **2026年4月22日** 发布， **采用 MIT 许可的完全开放权重**。现已上线两个检查点： [deepseek-ai/DeepSeek-V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) （总参数 1.6T / 约 490 亿激活，100 万上下文）以及 [deepseek-ai/DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) （总参数 2840 亿 / 约 130 亿激活）。Pro 模型在首周就已突破 **17.4 万次下载**，并在 vLLM 和 SGLang 上实现了首日支持。
{% endhint %}

DeepSeek V4 是 2026 年首个以 **双层发布**. **V4-Pro** 作为旗舰——一款 **1.6 万亿参数的混合专家模型** ，每个 token 约有 **490 亿激活参数**，配备 **100 万 token 上下文窗口**，并采用混合注意力设计，将压缩稀疏注意力与新的高压缩注意力头结合，用于低成本的长上下文预填充。 **V4-Flash** 是实用版兄弟型号—— **总参数 2840 亿 / 激活 130 亿**，架构相同，量化后可运行在单张 80GB GPU 上，并可通过 Unsloth GGUF 构建在 2×48GB 设备上轻松运行。

其架构才是亮点。DeepSeek 的混合注意力在长上下文下大幅降低了 KV 缓存内存占用，而 MoE 路由器也经过重新训练以实现更精准的专家选择——早期独立运行报告显示，Pro 在大约一半激活参数计算量下即可达到 V3 级别的代码分数。对于 Clore.ai 用户来说，这很重要，因为 **V4-Flash 是首个以完整权重发布的、激活参数低于 150 亿的前沿级模型**，让真正的开源推理首次进入单张 H100 或廉价多张 4090 服务器的可及范围。

对大多数团队而言，现实中的 Clore 部署是 **在 1× A100 80GB 或 2× RTX 4090 上运行 V4-Flash** ——这就是最佳性价比所在。V4-Pro 则留给更严肃的基础设施：8× H100、4× H200 或 8× B200，最好配合 NVLink。如果你一直在运行 [DeepSeek V3](/guides/guides_v2-zh/yu-yan-mo-xing/deepseek-v3.md) 或 [DeepSeek-R1](/guides/guides_v2-zh/yu-yan-mo-xing/deepseek-r1.md)，迁移路径非常直接——同一模型家族、同一聊天模板、在 vLLM 上可直接替换。

### 关键规格

| 属性          | DeepSeek V4-Pro                                                                   | DeepSeek V4-Flash                                                                     |
| ----------- | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
| 总参数量        | 1.6T（MoE）                                                                         | 2840亿（MoE）                                                                            |
| 激活参数量       | 每个 token 约 490 亿                                                                  | 每个 token 约 130 亿                                                                      |
| 上下文窗口       | 1,000,000 个 token                                                                 | 256,000 个 token                                                                       |
| 注意力         | 压缩稀疏 + 高压缩注意力                                                                     | 压缩稀疏 + HCA                                                                            |
| 许可证         | MIT                                                                               | MIT                                                                                   |
| 发布日期        | 2026年4月22日                                                                        | 2026年4月22日                                                                            |
| HuggingFace | [deepseek-ai/DeepSeek-V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) | [deepseek-ai/DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) |
| 主要工具链       | vLLM，SGLang（首日）                                                                   | vLLM，SGLang，llama.cpp（Unsloth GGUF）                                                   |

### 为什么选择 DeepSeek V4？

* **真正开放的前沿权重** —— MIT 许可、无使用限制、可完全商用
* **Pro 版 100 万上下文，Flash 版 25.6 万上下文** —— 一次即可处理整个代码库、书籍或长达一小时的转录内容
* **混合稀疏注意力** —— KV 缓存随长上下文增长呈亚线性，预填充成本更低
* **双层发布** —— Flash 是首个足够优秀、可替代 V3 的 130 亿激活 MoE，适用于大多数工作流
* **首日支持 vLLM 和 SGLang** —— 不用等社区补丁，直接 `pip install -U` 即可开始使用
* **MoE 效率** —— 你支付的是 130 亿/490 亿的推理成本，而不是 2840 亿/1.6T

***

## 需求

{% hint style="warning" %}
**V4-Pro 是一款前沿模型。** 完整 BF16 权重约为 3.2TB，需要多节点 H100/H200 或 8× B200 NVLink。不存在单服务器 BF16 路径。如果你没有多节点基础设施，就运行 V4-Flash——它以 5% 的硬件成本提供约 80% 的质量。
{% endhint %}

| 组件     | 最低配置（V4-Flash，GGUF Q4） | 推荐配置（V4-Flash FP8）          | 完整 V4-Pro（BF16）              |
| ------ | ---------------------- | --------------------------- | ---------------------------- |
| GPU 显存 | 1× 80GB 或 2× 48GB      | 1× H100 80GB 或 1× A100 80GB | 8× H100 80GB 或 4× H200 141GB |
| 内存     | 64GB                   | 128GB                       | 1TB 以上                       |
| 磁盘     | 200GB NVMe             | 600GB NVMe                  | 4TB NVMe                     |
| CUDA   | 12.4+                  | 12.6+                       | 12.6+                        |
| 网络     | ——                     | ——                          | 多节点需 NVLink / 400Gb IB       |

**Clore.ai 选择：** 对于 95% 的用户来说， **在单张 A100 80GB 上以 FP8 运行 V4-Flash** 是最佳选择——完整 256K 上下文、无量化损失，在市场上每天大约只需 5–7 美元。只有当你真正需要 V4-Pro 的 100 万上下文或额外推理余量时，才考虑 [H100](https://clore.ai/rent-h100.html) 或 [H200](https://clore.ai/rent-h200.html) 张量并行方案。

***

## 选项 A —— Ollama / GGUF（量化，仅限 V4-Flash）

Unsloth 在发布后 48 小时内就为 V4-Flash 提供了 GGUF 量化版本。Q4\_K\_M 是最佳平衡点——可在 1×80GB 或 2×48GB 上运行，并且质量接近 FP8。

```bash
# 拉取 Unsloth 的 Q4_K_M 构建
docker exec ollama ollama pull hf.co/unsloth/DeepSeek-V4-Flash-GGUF:Q4_K_M
docker exec ollama ollama run hf.co/unsloth/DeepSeek-V4-Flash-GGUF:Q4_K_M

# 或者直接使用 llama.cpp 运行已下载的 GGUF
docker run --gpus all -it --rm -p 8080:8080 \
  -v $(pwd)/models:/models \
  ghcr.io/ggerganov/llama.cpp:server-cuda \
  -m /models/DeepSeek-V4-Flash-Q4_K_M.gguf \
  --n-gpu-layers 99 --ctx-size 65536 \
  --port 8080 --host 0.0.0.0
```

{% hint style="info" %}
V4-**Pro** 存在 GGUF 量化，但并不实用——即使 Q2\_K 也约有 400GB，而且卸载性能用于聊天场景无法使用。量化部署请坚持使用 Flash。
{% endhint %}

***

## 方案 B —— vLLM（生产 API，推荐）

vLLM 0.7.x 已为两个 V4 检查点添加首日支持。混合注意力内核需要 `--trust-remote-code` 以及 Hopper 或 Blackwell 硬件才能发挥全部速度。

**在单张 H100 / A100 80GB 上运行 V4-Flash：**

```yaml
version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - hf_cache:/root/.cache/huggingface
    command: >
      --model deepseek-ai/DeepSeek-V4-Flash
      --tensor-parallel-size 1
      --max-model-len 131072
      --dtype bfloat16
      --gpu-memory-utilization 0.92
      --enable-chunked-prefill
      --served-model-name deepseek-v4-flash
      --trust-remote-code
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    shm_size: "16gb"

volumes:
  hf_cache:
```

**在 8× H100 上运行 V4-Pro：** 将命令替换为：

```yaml
    command: >
      --model deepseek-ai/DeepSeek-V4-Pro
      --tensor-parallel-size 8
      --max-model-len 262144
      --dtype bfloat16
      --gpu-memory-utilization 0.90
      --enable-chunked-prefill
      --enable-prefix-caching
      --served-model-name deepseek-v4-pro
      --trust-remote-code
```

```bash
# 测试 API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [{"role": "user", "content": "用 Rust 编写一个带优雅关闭的异步 TCP 回显服务器。"}],
    "max_tokens": 2048,
    "temperature": 0.6
  }'
```

{% hint style="info" %}
从 `--max-model-len 131072` 开始，即使你最终想要完整的 100 万 ctx——长上下文会显著增加预填充时间和 KV 内存。只有在基线稳定后再提高。
{% endhint %}

***

## 选项 C —— SGLang（替代方案，在 Hopper 上通常更快）

SGLang 的 RadixAttention 和前缀缓存与 V4 的混合注意力配合得很好——对于共享提示词的智能体工作负载，预计 tok/s 会明显优于 vLLM。

```bash
docker pull lmsysorg/sglang:latest

# 在 1× H100/A100 上运行 V4-Flash
python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V4-Flash \
  --tp-size 1 \
  --context-length 131072 \
  --mem-fraction-static 0.90 \
  --enable-torch-compile \
  --served-model-name deepseek-v4-flash \
  --trust-remote-code

# 在 8× H100 上运行 V4-Pro
python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V4-Pro \
  --tp-size 8 \
  --context-length 262144 \
  --mem-fraction-static 0.88 \
  --enable-torch-compile \
  --served-model-name deepseek-v4-pro \
  --trust-remote-code
```

SGLang 的 `--enable-torch-compile` 通常在初始预热后可在 Hopper 上再提升 10–20% 吞吐量。

***

## Clore.ai GPU 推荐

| 配置                                                         | 模型                            | 显存        | 预期吞吐量             | Clore.ai 成本    |
| ---------------------------------------------------------- | ----------------------------- | --------- | ----------------- | -------------- |
| 2× [RTX 4090](https://clore.ai/rent-4090.html) （Q4 GGUF）   | V4-Flash                      | 48GB      | 爱好者使用，单流          | 约 2–3 美元/天     |
| 1× [A100 80GB](https://clore.ai/rent-a100-80gb.html) （FP8） | V4-Flash                      | 80GB      | 稳定的生产单租户          | 约 5–7 美元/天     |
| 1× RTX 5090 32GB（Q4 GGUF，部分卸载）                             | V4-Flash                      | 32GB + 内存 | 受限，仅限开发           | 峰值约 3.94 美元/小时 |
| 4× [H100 80GB](https://clore.ai/rent-h100.html)            | V4-Flash FP8（大材小用）或 V4-Pro Q4 | 320GB     | 多租户 Flash，单流 Pro  | 约 24–32 美元/天   |
| 8× [H100 80GB](https://clore.ai/rent-h100.html)            | V4-Pro BF16                   | 640GB     | 生产级前沿推理           | 约 48–64 美元/天   |
| 4× [H200 141GB](https://clore.ai/rent-h200.html)           | V4-Pro BF16 + 100 万 ctx       | 564GB     | 完整 100 万上下文，最高吞吐量 | 约 32–48 美元/天   |

{% hint style="success" %}
**Clore.ai 上的最佳性价比：** 1× A100 80GB 运行 V4-Flash FP8。你可获得 256K 上下文、约 130 亿激活的推理成本、无量化损失，而且账单大致相当于 Claude Sonnet API 订阅的价格——但权重保留在你自己的机器上。
{% endhint %}

***

## 使用场景

* **整个代码库推理** —— V4-Pro 的 100 万上下文可一次性容纳典型 50 万 LOC 的单体仓库及其测试
* **长文 RAG** —— 将整本书、法院文件或年度报告直接放入上下文，跳过分块流水线
* **智能体编程** —— V4-Flash 在 SWE-Bench 上的表现接近 V3，而推理成本只是其一小部分；可与 SWE-agent 或 OpenHands 搭配
* **多文档综合** —— 以前需要 Gemini 2.5 Pro 的研究工作流，现在可以在你自己的硬件上运行
* **自托管 Cursor / Copilot 替代方案** —— 单张 A100 上的 V4-Flash 足以满足 5 人开发团队的需求
* **微调基础模型** —— MIT 许可 + 干净的 MoE 架构，使其成为领域微调的强大起点

***

## 基准测试

{% hint style="warning" %}
**厂商宣称——请独立验证。** 以下数字来自 DeepSeek 于 2026 年 4 月 22 日的公告和模型卡。独立复现仍在陆续发布；请将其视为方向性参考，而非绝对定论。
{% endhint %}

| 基准                 | V4-Pro | V4-Flash | DeepSeek V3 | GLM-5.1 |
| ------------------ | ------ | -------- | ----------- | ------- |
| MMLU-Pro           | \~84%  | \~78%    | \~76%       | \~80%   |
| SWE-Bench Verified | \~82%  | \~74%    | \~70%       | \~79%   |
| HumanEval          | \~96%  | \~92%    | \~91%       | \~94%   |
| MATH-500           | \~94%  | \~88%    | \~85%       | \~90%   |
| LiveCodeBench      | \~76%  | \~68%    | \~62%       | \~72%   |
| 长上下文（100 万针入草堆）    | \~98%  | 不适用      | 不适用         | 不适用     |

如需与其他开放权重模型做苹果对苹果的比较，请查看 [GLM-5.1 指南](/guides/guides_v2-zh/yu-yan-mo-xing/glm-5-1.md) —— V4-Pro 和 GLM-5.1 会根据基准测试互有胜负。

***

## 故障排查

| 问题                                     | 解决方案                                                                                                                           |
| -------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
| `OutOfMemoryError` 在 8×H100 上加载 V4-Pro | BF16 需要约 3.2TB——你无法把 Pro 放进单个 8×H100 节点。请使用 4× H200 141GB 或多节点。                                                                |
| `不受支持的注意力后端`                           | V4 需要 vLLM ≥ 0.7.0 或 SGLang ≥ 0.4.4。运行 `pip install -U vllm` （或者拉取 `:latest` Docker 镜像）。                                       |
| HuggingFace 下载速度慢                      | 使用 `huggingface-cli download deepseek-ai/DeepSeek-V4-Flash --local-dir ./weights --resume-download`。Pro 约 3.2TB；Flash 约 570GB。 |
| `--trust-remote-code` 被拒绝              | 混合注意力模块在仓库中作为自定义代码发布—— `--trust-remote-code` 在上游 Transformers 的内核落地之前，两个引擎都需要它。                                                |
| GGUF Q4 输出胡言乱语                         | 请确保你使用的是 Unsloth 构建（`unsloth/DeepSeek-V4-Flash-GGUF`），而不是早期社区量化版本。MoE 路由器需要特殊处理，早期量化版本在这方面做错了。                                 |
| V4-Pro 的 100 万上下文发生 OOM                | 降到 `--max-model-len 262144` 并添加 `--enable-prefix-caching`。真正的 100 万上下文服务需要 H200 或 B200。                                        |
| 长上下文下预填充缓慢                             | 这是预期行为——即便有混合注意力，50 万以上的预填充也要以分钟计，而不是秒。使用 `--enable-chunked-prefill` 和前缀缓存来摊薄成本。                                               |

***

## 下一步

* **前代：** [DeepSeek V3](/guides/guides_v2-zh/yu-yan-mo-xing/deepseek-v3.md) —— V4-Flash 实际上替代的模型
* **推理兄弟型号：** [DeepSeek-R1](/guides/guides_v2-zh/yu-yan-mo-xing/deepseek-r1.md) —— 经过思维链调优，对数学密集型工作流仍然很有用
* **开放权重替代方案：** [GLM-5.1](/guides/guides_v2-zh/yu-yan-mo-xing/glm-5-1.md) —— 744B MoE，SWE-Bench Pro 顶级，性价比相当
* **多模态替代方案：** [Qwen3.5-Omni](/guides/guides_v2-zh/yu-yan-mo-xing/qwen35-omni.md) —— 如果你需要同一个模型同时支持视觉/音频
* **租用硬件：** [Clore.ai 市场](https://clore.ai/marketplace) —— H100/H200/A100/RTX 4090，低至 0.50 美元/天

### 链接

* [HuggingFace 上的 DeepSeek-V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro)
* [HuggingFace 上的 DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash)
* [Unsloth V4-Flash GGUF 量化版本](https://huggingface.co/unsloth/DeepSeek-V4-Flash-GGUF)
* [DeepSeek GitHub](https://github.com/deepseek-ai)
* [vLLM 文档](https://docs.vllm.ai)
* [SGLang 仓库](https://github.com/sgl-project/sglang)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/yu-yan-mo-xing/deepseek-v4.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.