# Hy3 Preview（腾讯混元 3，295B MoE）

{% hint style="info" %}
**状态（2026年4月）：** Hy3 Preview 是首个公开发布版本，来自 **腾讯混元重建后的训练基础设施**，发布于 **2026年4月13日** ，最后更新于 **2026年4月23日**。权重地址： [huggingface.co/tencent/Hy3-preview](https://huggingface.co/tencent/Hy3-preview) ，采用 **腾讯 Hy 社区许可协议**。vLLM 和 SGLang 已在 Day 0 提供支持。
{% endhint %}

Hy3 Preview 是一个 **2950亿参数的混合专家（MoE）** 语言模型，每个 token 仅激活 **约210亿参数** （192 个专家，路由 top-8）。它面向腾讯明显正在追赶的两个工作负载： **长程推理** （FrontierScience-Olympiad、IMOAnswerBench、数学博士考试）以及 **智能体式编码** （SWE-bench Verified 74.4%、Terminal-Bench 2.0 54.4%，为厂商声明）。256K 上下文窗口加上 MTP（Multi-Token Prediction）推测解码层，使其适用于 IDE 级编码代理和重文档 RAG。

对于 Clore.ai 用户来说，最关键的数字是 **210亿激活参数**。你不需要完整的 8×H200 机架。跨以下配置进行张量并行部署： **4×A100 80GB** 或 **2×H100 80GB** （BF16 并配合卸载）就足以以可用吞吐量提供服务——在市场上以约 10–20 美元/天即可获得前沿级代理编码能力，而且权重仍保留在你自己的机器上。

### 关键规格

| 属性    | 数值                                  |
| ----- | ----------------------------------- |
| 总参数量  | 295B（MoE）                           |
| 激活参数量 | 每次前向传播 21B                          |
| 专家数   | 总计 192 个，top-8 路由                   |
| 层数    | 80 层 Transformer + 1 层 MTP          |
| 注意力   | 64 个头，GQA，8 个 KV 头，头维度 128          |
| 隐藏层大小 | 4096                                |
| 中间层大小 | 13,312                              |
| 词表    | 120,832                             |
| 上下文窗口 | 256,000 个 token                     |
| 原生精度  | BF16                                |
| 许可证   | 腾讯 Hy 社区许可协议                        |
| 发布日期  | 2026年4月13日                          |
| 机构    | 腾讯混元                                |
| 主要工具链 | vLLM、SGLang、AngelSlim、LLaMA-Factory |

### 为什么选择 Hy3 Preview？

* **腾讯重建后的 RL 栈首发** —— 腾讯为此次发布重写了训练基础设施；预计在 2026 年会快速迭代
* **210 亿激活参数的 MoE** —— 你支付的是约 210 亿稠密模型的推理成本，而不是 2950 亿
* **256K 上下文** —— 足以一次性处理完整代码仓库、长代理轨迹或多文档 RAG
* **MTP 推测层** —— 内置多 token 预测在 Hopper 级 GPU 上可带来约 1.5–2× 的解码加速
* **两种推理模式** —— `reasoning_effort: "high"` 用于思维链， `"no_think"` 用于快速直接回答
* **面向代理编码** —— 明确针对 SWE-bench 风格的多轮工具使用和终端代理进行了优化
* **开源友好许可** —— 腾讯 Hy 社区许可协议对大多数用途类似 Apache；请根据你的具体场景核对 LICENSE 文件

***

## 需求

{% hint style="warning" %}
**但它仍然是 2950 亿级模型。** “210亿激活”描述的是推理计算量，而不是显存占用。完整的 BF16 权重大约为 590GB，必须放在 VRAM 中（或进行卸载）。如果你想要不受限制的吞吐量，请准备 8×H100/H200；4×A100 80GB 在配合卸载和较短上下文时也可运行。
{% endhint %}

| 组件     | 最低配置（Q4 GGUF，卸载）      | 推荐配置（BF16，TP）       | 完整 BF16（生产）              |
| ------ | --------------------- | ------------------- | ------------------------ |
| GPU 显存 | 约 80GB + 256GB RAM 卸载 | 4× A100 80GB（320GB） | 8× H100 80GB 或 8× H20-3e |
| 内存     | 256GB                 | 384GB               | 512GB                    |
| 磁盘     | 700GB NVMe            | 1TB NVMe            | 1.5TB NVMe               |
| CUDA   | 12.4+                 | 12.4+               | 12.6+                    |
| 驱动程序   | 550+                  | 550+                | 560+                     |

**Clore.ai 选择：** 对大多数团队来说， **4× A100 80GB** 配合 BF16 张量并行和 `--max-model-len 65536` 是最佳平衡点（约 10–16 美元/天）。如果你需要在并发用户下使用完整 256K 上下文，升级到 8× H100。

***

## 选项 A — Ollama / GGUF（量化，社区构建）

{% hint style="warning" %}
**提示：** Hy3 Preview 是全新发布的模型（2026 年 4 月 13 日），采用定制 MoE 架构。社区版 llama.cpp / GGUF 支持通常会在 **2–4 周** 后到来。如果你今天就要用，请使用 vLLM（方案 B）。拉取前请查看 [huggingface.co/models?search=hy3-preview+gguf](https://huggingface.co/models?search=hy3-preview+gguf) 以获取社区量化版本。
{% endhint %}

```bash
# 一旦发布 Q4_K_M 构建版本
docker exec ollama ollama pull hy3-preview:q4_K_M
docker exec ollama ollama run hy3-preview:q4_K_M

# 或者直接在社区 GGUF 上使用 llama.cpp
docker run --gpus all -it --rm -p 8080:8080 \
  -v $(pwd)/models:/models \
  ghcr.io/ggerganov/llama.cpp:server-cuda \
  -m /models/hy3-preview-q4_k_m.gguf \
  --n-gpu-layers 80 --ctx-size 32768 \
  --port 8080 --host 0.0.0.0
```

在 GGUF 发布之前，AngelSlim（腾讯自家的量化工具包）可以直接从 BF16 检查点生成 W4A16 / W8A8 权重。

***

## 方案 B —— vLLM（生产 API，推荐）

vLLM 是腾讯为 Hy3 Preview 首选的服务目标。MTP 推测层通过以下方式接入： `--speculative-config.method mtp`.

```yaml
version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - hf_cache:/root/.cache/huggingface
    command: >
      --model tencent/Hy3-preview
      --tensor-parallel-size 8
      --max-model-len 65536
      --gpu-memory-utilization 0.90
      --speculative-config.method mtp
      --speculative-config.num_speculative_tokens 1
      --tool-call-parser hy_v3
      --reasoning-parser hy_v3
      --enable-auto-tool-choice
      --served-model-name hy3-preview
      --trust-remote-code
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    shm_size: "16gb"

volumes:
  hf_cache:
```

```bash
# 使用高推理强度测试 API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hy3-preview",
    "messages": [
      {"role": "system", "content": "你是一名资深软件工程师。"},
      {"role": "user", "content": "将这个 Python 函数重构为使用 async/await，并添加适当的错误处理。"}
    ],
    "max_tokens": 4096,
    "temperature": 0.9,
    "top_p": 1.0,
    "reasoning_effort": "high"
  }'
```

{% hint style="info" %}
**推理模式。** 在 vLLM 中设置 `reasoning_effort: "high"` 以启用思维链轨迹（更慢，但在数学/编程/代理任务上效果好得多），或者 `"no_think"` 用于快速直接回答。厂商推荐的采样参数是 `temperature=0.9, top_p=1.0` —— 零温度采样可能会破坏推理轨迹。
{% endhint %}

{% hint style="info" %}
**GPU 紧张吗？** 降到 `--tensor-parallel-size 4` 在 4× A100 80GB 上。保持 `--max-model-len 32768` 并添加 `--enable-chunked-prefill` 以保持预填充延迟在合理范围内。
{% endhint %}

***

## 方案 C —— SGLang

SGLang 提供 Day-0 支持，并将 MTP 层与 EAGLE 推测解码结合，以在 Hopper 上获得额外吞吐量。

```bash
docker pull lmsysorg/sglang:latest

python3 -m sglang.launch_server \
  --model tencent/Hy3-preview \
  --tp 8 \
  --tool-call-parser hunyuan \
  --reasoning-parser hunyuan \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 1 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 2 \
  --mem-fraction-static 0.88 \
  --context-length 65536 \
  --served-model-name hy3-preview
```

与普通解码相比，在长代理循环中预计可提升 1.5–2× 吞吐量。

***

## Clore.ai GPU 推荐

| 配置            | 显存      | 预期性能                                   | Clore.ai 成本  | 租用                                                   |
| ------------- | ------- | -------------------------------------- | ------------ | ---------------------------------------------------- |
| 4× A100 80GB  | 320GB   | BF16 分片，64K 上下文，约 15–25 token/s        | 约 10–16 美元/天 | [租用 A100 80GB](https://clore.ai/rent-a100-80gb.html) |
| 2× H100 80GB  | 160GB   | BF16 配合卸载，较小上下文，约 12–20 token/s        | 约 12–18 美元/天 | [租用 H100](https://clore.ai/rent-h100.html)           |
| 8× H100 80GB  | 640GB   | 完整 BF16，256K 上下文，配合 MTP 可达 60+ token/s | 约 48–64 美元/天 | [租用 H100](https://clore.ai/rent-h100.html)           |
| 8× H200 141GB | 1,128GB | 完整 BF16 + 最高并发                         | 约 64–96 美元/天 | [租用 H200](https://clore.ai/rent-h200.html)           |
| 1× RTX 5090   | 32GB    | Q4 GGUF，RAM 卸载，单用户                     | 约 3.94 美元/小时 | [市场](https://clore.ai/marketplace)                   |

{% hint style="success" %}
**最佳性价比：** 4× A100 80GB，采用 BF16 张量并行和 64K 上下文窗口。你可以用大致相当于 Claude Pro 订阅价格，获得一个开源权重、2950 亿级的代理编码模型，而且权重永远不会离开你租用的机器。
{% endhint %}

***

## 使用场景

* **自治式 SWE 代理** —— SWE-bench Verified 74.4%（厂商声明），并明确针对长工具调用循环进行了调优；可与 OpenHands、SWE-agent 或 Aider 搭配使用
* **终端驱动代理** —— Terminal-Bench 2.0 达到 54.4%，使其在 shell/CLI 工作流中位列第一梯队
* **长程推理** —— 奥赛级数学（IMOAnswerBench、FrontierScience-Olympiad）以及博士级 STEM
* **代码库级 RAG** —— 256K 上下文可在单次提示中容纳完整的中型仓库外加测试
* **搜索和浏览代理** —— BrowseComp / WideSearch 调优使其成为多步网页研究的强规划器
* **代理的代理** —— 将 Hy3 Preview 作为规划器，并将更轻量的开源模型（[Qwen3.5](/guides/guides_v2-zh/yu-yan-mo-xing/qwen35.md), [GLM-4.7 Flash](/guides/guides_v2-zh/yu-yan-mo-xing/glm-47-flash.md)）作为执行者

***

## 基准测试

{% hint style="warning" %}
**厂商宣称——请独立验证。** 以下所有数字均来自腾讯 2026 年 4 月 13 日的模型卡。独立复现结果（尤其是 SWE-bench Verified）仍在陆续出现。在 LMSYS / OpenCompass 确认之前，应将其视为上限。
{% endhint %}

| 基准                 | Hy3 Preview | GLM-5.1 | DeepSeek R1 | GPT-5.4 |
| ------------------ | ----------- | ------- | ----------- | ------- |
| SWE-bench Verified | **74.4%**   | \~79%   | \~71%       | \~78%   |
| Terminal-Bench 2.0 | **54.4%**   | ——      | ——          | ——      |
| GPQA Diamond       | **87.2%**   | ——      | \~84%       | \~88%   |
| SuperGPQA          | 51.6%       | ——      | ——          | ——      |
| HLE                | \~30        | ——      | ——          | ——      |

腾讯还报告了在专有的 CL-bench / CL-bench-Life 上下文学习基准，以及清华“求真”数学博士考试（2026 春）上的强劲成绩。

***

## 故障排查

| 问题                      | 解决方案                                                                                                               |
| ----------------------- | ------------------------------------------------------------------------------------------------------------------ |
| `OutOfMemoryError` 在加载时 | BF16 需要总计约 590GB 显存。降到 4×A100 并使用 `--max-model-len 32768` ，或者使用 AngelSlim W4A16 量化版本。                              |
| HuggingFace 下载速度慢       | 使用 `huggingface-cli download tencent/Hy3-preview --local-dir ./weights --resume-download`。预计需要 590GB 以上。           |
| 工具调用被静默丢弃               | 确保 `--tool-call-parser hy_v3` （vLLM）或 `--tool-call-parser hunyuan` （SGLang）已设置，并且 `--enable-auto-tool-choice` 已开启。 |
| 推理轨迹为空 / 错误             | 使用 `temperature=0.9, top_p=1.0`。零温度贪心解码会破坏思维链。确认 `reasoning_effort: "high"`.                                       |
| MTP 推测解码错误              | 需要较新的 vLLM（2026 年 4 月之后的构建版本）。运行 `pip install -U vllm --pre` ，或锁定到在发行说明中列出 `mtp` 的标签版本。                            |
| 256K 上下文 OOM            | 从 `--max-model-len 32768`开始，启用 `--enable-chunked-prefill`，逐步提高。完整 256K 实际上需要 8× H200。                              |
| 自定义架构被拒绝                | 始终传入 `--trust-remote-code`。Hy3 会随检查点一起提供自定义建模代码。                                                                   |
| Ollama / GGUF 不可用       | 社区量化版本通常会在发布后 2–4 周到来。在此期间请使用 vLLM 或 AngelSlim。                                                                    |

***

## 下一步

* **最接近的开源同类：** [GLM-5.1](/guides/guides_v2-zh/yu-yan-mo-xing/glm-5-1.md) —— 7440 亿参数 / 400 亿激活参数的 MoE，MIT 许可，在 SWE-bench Pro 上取得顶尖成绩
* **多模态替代方案：** [Qwen3.5-Omni](/guides/guides_v2-zh/yu-yan-mo-xing/qwen35-omni.md) —— 文本 + 音频 + 图像 + 视频，可在单张 RTX 4090 上运行
* **仅推理替代方案：** [DeepSeek R1](/guides/guides_v2-zh/yu-yan-mo-xing/deepseek-r1.md) —— 纯长文本推理专家
* **租用硬件：** [在 Clore.ai 上租用 A100 80GB](https://clore.ai/rent-a100-80gb.html) —— 4× A100 80GB 实例，约 10 美元/天起
* **完整市场：** [clore.ai/marketplace](https://clore.ai/marketplace) —— H100、H200、A100、RTX 5090，低至 0.50 美元/天

### 链接

* [HuggingFace 上的 Hy3 Preview](https://huggingface.co/tencent/Hy3-preview)
* [Hy3 Preview GitHub 仓库](https://github.com/Tencent-Hunyuan/Hy3-preview)
* [腾讯混元组织](https://huggingface.co/tencent)
* [vLLM 文档](https://docs.vllm.ai)
* [SGLang 仓库](https://github.com/sgl-project/sglang)
* [AngelSlim —— 腾讯的量化工具包](https://github.com/Tencent/AngelSlim)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/yu-yan-mo-xing/hy3-preview.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.