# Ling-2.6-flash（蚂蚁集团 104B MoE）

{% hint style="info" %}
**状态（2026年4月29日）：** Ling-2.6-flash 由蚂蚁集团的 **inclusionAI** 团队于 **2026年4月28日** 发布（撰写时为一天前）。它是 [Ling-2.5-1T](/guides/guides_v2-zh/yu-yan-mo-xing/ling25.md) 的轻量、快速、面向 agent 调优的同系模型——同一血统、同样的混合线性注意力架构，但只有 **74亿个活跃参数** ，来自一个1040亿参数的稀疏 MoE。权重地址在 [huggingface.co/inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash) ，采用 **MIT 许可证**.
{% endhint %}

而 [Ling-2.5-1T](/guides/guides_v2-zh/yu-yan-mo-xing/ling25.md) 需要 8 张 GPU 机架才能启动的模型不同，Ling-2.6-flash 是 **首个可以跑在单张消费级 GPU 上的 inclusionAI 发布模型**。74亿活跃路径意味着，你以一个 80 亿稠密模型的推理成本，去调用一个 1040 亿参数池——而蚂蚁集团专门针对 **agent 工作流**对这个参数池进行了调优：工具调用、多步规划以及结构化函数分发。

厂商公布的数据将 Ling-2.6-flash 在其规模级别上置于 **BFCL-V4** 以及 **TAU2-bench** 的 SOTA，官方基准配置下吞吐量约为 **在 4× H20 上 340 tok/s** 。对于 Clore.ai 用户来说，更有意思的是下面这条更小的配置： **INT4 可轻松装进一张 RTX 4090（24GB）** ，并且还留有 32K+ 上下文的余量，而且 **FP8 可装进一张 H100 80GB**。这意味着，一个新鲜的、面向 agent 调优的前沿级小模型，在 [Clore.ai 市场](https://clore.ai/marketplace).

### 关键规格

| 属性    | 数值                                        |
| ----- | ----------------------------------------- |
| 总参数量  | 1040亿（MoE）                                |
| 激活参数量 | 每次前向传播使用 74 亿                             |
| 架构    | 1:7 MLA + Lightning Linear 混合注意力          |
| 上下文窗口 | 262,144 个 token                           |
| 量化方案  | BF16、FP8、INT4                             |
| 许可证   | MIT                                       |
| 发布日期  | 2026年4月28日                                |
| 机构    | 蚂蚁集团 — inclusionAI                        |
| 主要工具链 | SGLang（推荐）、vLLM、llama.cpp/Ollama（社区 GGUF） |

### 为什么选择 Ling-2.6-flash？

* **可在单 GPU 上部署** —— 单张 [RTX 4090](https://clore.ai/rent-4090.html) 或 [RTX 3090](https://clore.ai/rent-3090.html)上的 INT4，单张 H100 上的 FP8。无需折腾多 GPU，也不用和 NVLink 周旋。
* **面向 agent 调优** —— 明确针对 BFCL-V4 / TAU2-bench 风格的工具调用循环训练，而不是事后只做基准测试。
* **以 74 亿活跃成本获得稀疏 MoE 质量** —— 你能通过 74 亿推理路径，调用一个 1040 亿参数的知识池。
* **开箱即用的 256K 上下文** —— 原生 262K tokens，长 agent 轨迹无需 YaRN 技巧。
* **MIT 许可证** —— 完全商用、可微调、可再分发。
* **谱系** —— 直接继承自 [Ling-2.5-1T](/guides/guides_v2-zh/yu-yan-mo-xing/ling25.md) 和 Ring-2.5；其架构经受过实战考验。

***

## 需求

{% hint style="success" %}
**对 Clore 友好。** 这是 inclusionAI 系列中首个可在单张消费级 GPU 上运行的模型。如果你一直被 [Ling-2.5-1T](/guides/guides_v2-zh/yu-yan-mo-xing/ling25.md) 或 [GLM-5.1](/guides/guides_v2-zh/yu-yan-mo-xing/glm-5-1.md)的成本挡在门外，那这就是入门点。
{% endhint %}

| 组件     | INT4（单张 24GB）            | FP8（单张 80GB）        | BF16（完整质量）                   |
| ------ | ------------------------ | ------------------- | ---------------------------- |
| GPU 显存 | 1× RTX 4090 / 3090（24GB） | 1× H100 / A100 80GB | 2× A100 80GB 或 1× H200 141GB |
| 内存     | 32GB                     | 64GB                | 128GB                        |
| 磁盘     | 60GB NVMe                | 120GB NVMe          | 220GB NVMe                   |
| CUDA   | 12.0+                    | 12.4+               | 12.4+                        |
| 实际上下文  | 32K–64K                  | 128K                | 256K                         |

**Clore.ai 选择：** 对于大多数 agent 工作负载来说，一张 [RTX 4090（约 $0.70–2.50/小时）](https://clore.ai/rent-4090.html) 跑 INT4 GGUF 在性价比上无敌。如果你需要 FP8 质量或 128K+ 上下文，就升级到单张 H100。

***

## 方案 A — Ollama / GGUF（量化，单 GPU）

这是大多数 Clore.ai 用户会选择的路径。社区 GGUF 通常会在 inclusionAI 发布后的几天内出现在 HuggingFace 上。

{% hint style="warning" %}
**首日提示：** Ling-2.6-flash 于 2026 年 4 月 28 日发布。截至撰写时，社区 GGUF 量化版可能还在陆续上线。关注 [huggingface.co/models?search=ling-2.6-flash+gguf](https://huggingface.co/models?search=ling-2.6-flash+gguf) 以及 [unsloth](https://huggingface.co/unsloth) 以获取首批构建版本。如果 `ollama pull` 返回 404，就直接让 llama.cpp 指向 GGUF 文件。
{% endhint %}

```bash
# 一旦社区 Q4_K_M 构建发布
docker exec ollama ollama pull ling-2.6-flash:q4_K_M
docker exec ollama ollama run ling-2.6-flash:q4_K_M

# 或者直接使用 llama.cpp 运行已下载的 GGUF
docker run --gpus all -it --rm -p 8080:8080 \
  -v $(pwd)/models:/models \
  ghcr.io/ggerganov/llama.cpp:server-cuda \
  -m /models/ling-2.6-flash-q4_k_m.gguf \
  --n-gpu-layers 99 --ctx-size 32768 \
  --port 8080 --host 0.0.0.0
```

单张 RTX 4090 应该能达到 **约 80–120 tok/s** ，在 32K 上下文下使用 Q4\_K\_M——足够用于交互式 agent 工作。

***

## 方案 B — vLLM（生产级 API）

vLLM 是将 Ling-2.6-flash 提供给多个并发 agent 的首选。请在单张 H100 / A100 80GB 上使用 FP8 检查点：

```yaml
version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - hf_cache:/root/.cache/huggingface
    command: >
      --model inclusionAI/Ling-2.6-flash-FP8
      --tensor-parallel-size 1
      --max-model-len 65536
      --gpu-memory-utilization 0.90
      --enable-auto-tool-choice
      --tool-call-parser hermes
      --served-model-name ling-2.6-flash
      --trust-remote-code
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: "16gb"

volumes:
  hf_cache:
```

```bash
# 测试 agent 路径
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ling-2.6-flash",
    "messages": [
      {"role": "system", "content": "你是一名可使用工具的 agent。先规划，调用工具，然后作答。"},
      {"role": "user", "content": "帮我找出 Clore.ai 上目前最便宜的 RTX 4090。"}
    ],
    "tools": [{"type": "function", "function": {"name": "search_marketplace", "parameters": {"type":"object","properties":{"gpu":{"type":"string"}}}}}],
    "tool_choice": "auto",
    "max_tokens": 2048
  }'
```

{% hint style="info" %}
如果要在长上下文（200K+）下使用 BF16 完整质量，就扩展到 `--tensor-parallel-size 2` 2× A100 80GB，或者固定使用单张 H200 141GB。
{% endhint %}

***

## 选项 C — SGLang（最大吞吐推荐）

SGLang 是蚂蚁集团用于官方 340 tok/s 基准测试的方案——在 SGLang 的运行时下，混合线性注意力路径最快。

```bash
docker pull lmsysorg/sglang:latest

python3 -m sglang.launch_server \
  --model-path inclusionAI/Ling-2.6-flash-FP8 \
  --tp-size 1 \
  --tool-call-parser hermes \
  --mem-fraction-static 0.90 \
  --context-length 65536 \
  --served-model-name ling-2.6-flash \
  --host 0.0.0.0 --port 30000

# 复现厂商 340 tok/s 数字（需要 4x H20 / H100 级别）
python3 -m sglang.launch_server \
  --model-path inclusionAI/Ling-2.6-flash \
  --tp-size 4 \
  --mem-fraction-static 0.92 \
  --context-length 32768 \
  --served-model-name ling-2.6-flash
```

***

## Clore.ai GPU 推荐

| 配置                                                   | 显存    | 量化          | 预期吞吐量           | Clore.ai 成本           |
| ---------------------------------------------------- | ----- | ----------- | --------------- | --------------------- |
| 1× [RTX 3090](https://clore.ai/rent-3090.html)       | 24GB  | INT4 GGUF   | 约 60–90 tok/s   | **约 $0.33–1.24/小时**   |
| 1× [RTX 4090](https://clore.ai/rent-4090.html)       | 24GB  | INT4 GGUF   | 约 80–120 tok/s  | **约 0.70–2.50 美元/小时** |
| 1× [A100 80GB](https://clore.ai/rent-a100-80gb.html) | 80GB  | FP8         | 约 120–180 tok/s | 约 $2–4/小时             |
| 1× H100 80GB                                         | 80GB  | FP8         | 约 150–220 tok/s | 约 $6–8/小时             |
| 4× H100 80GB                                         | 320GB | BF16 + TP=4 | 约 340 tok/s（厂商） | 约 $24–32/小时           |

{% hint style="success" %}
**最佳性价比：** 从 $0.70/小时起租一张 RTX 4090，运行 Q4\_K\_M GGUF。你将以不到一杯咖啡每小时的价格，获得一个面向 agent 调优、MIT 许可、1040亿-MoE 的模型，并带有 32K 上下文。这正是 Clore.ai 的消费级 GPU 市场所为之打造的部署形态。
{% endhint %}

***

## 使用场景

* **工具调用型 agent** —— BFCL-V4 和 TAU2-bench 调优意味着结构化函数分发是一项优势，而不是事后补充。
* **多步规划循环** —— 能持续进行工具调用链，而不会出现小模型常见的漂移。
* **本地 Claude Code / OpenHands 替代方案** —— 在你自己的 RTX 4090 上即插即用的 OpenAI 兼容 API。
* **高吞吐 agent 批处理任务** —— 在 4×H100 上实现 340 tok/s，使其可以每小时处理数千条 agent 转录。
* **长上下文 RAG** —— 256K 原生上下文可在单个提示中覆盖大多数企业文档集。
* **面向** [**Ling-2.5-1T**](/guides/guides_v2-zh/yu-yan-mo-xing/ling25.md) **工作流的廉价开发沙箱** —— 用 flash 做原型，再部署到 1T 版本。

***

## 基准测试

{% hint style="warning" %}
**厂商宣称——请独立验证。** 以下所有数字均来自 inclusionAI 于 2026 年 4 月 28 日发布的模型卡。该模型发布仅一天；社区对 BFCL-V4 和 TAU2-bench 的复现尚未公开。请将这些数据视为方向性参考，而非绝对真理。
{% endhint %}

| 基准                            | Ling-2.6-flash（厂商） | 备注                                       |
| ----------------------------- | ------------------ | ---------------------------------------- |
| BFCL-V4                       | 同规模级别 SOTA         | Berkeley Function Calling Leaderboard v4 |
| TAU2-bench                    | 同规模级别 SOTA         | 工具 agent 基准 v2                           |
| SWE-bench Verified / Resolved | \~61.2%            | 已验证拆分上的解决率                               |
| MathArena AIME 2026           | 73.85              |                                          |
| MathArena HMMT 2026 年 2 月     | 49.29              |                                          |
| 吞吐量                           | 约 340 tok/s        | 4× H20-3e，TP=4，batch 32                  |

***

## 故障排查

| 问题                              | 解决方案                                                                                                                                                                            |
| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `OutOfMemoryError` 在 RTX 4090 上 | 降到 Q4\_K\_S 或 Q3\_K\_M；将 `--ctx-size` 减至 16384；关闭其他 GPU 进程                                                                                                                      |
| GGUF 还未出现在 HuggingFace 上        | 模型仅发布一天。请查看 [unsloth](https://huggingface.co/unsloth), [bartowski](https://huggingface.co/bartowski)，以及 [TheBloke](https://huggingface.co/TheBloke) 镜像；或者你也可以用 `llama-quantize` |
| vLLM 拒绝该架构                      | 请确保 vLLM ≥ 0.7.x，并使用 `--trust-remote-code`；混合线性注意力层是自定义的                                                                                                                        |
| 工具调用结果以纯文本返回                    | 在 vLLM 中设置 `--enable-auto-tool-choice --tool-call-parser hermes` 在 vLLM 中；SGLang 会自动处理这一点                                                                                       |
| 长上下文下的预填充很慢                     | 线性注意力有预热开销；第一次请求总是最慢的。请使用 `--enable-chunked-prefill` 到 vLLM 中                                                                                                                   |
| 吞吐量远低于 340 tok/s                | 厂商给出的数字是在 4× H20、TP=4、batch 32 的条件下得到的。单 GPU + batch 1 自然会慢得多——这是预期现象，不是 bug                                                                                                    |
| 高温度下输出乱码                        | 降到 `temperature=0.7` 用于聊天， `0.1` 用于工具调用                                                                                                                                         |

***

## 下一步

* **更大的兄弟模型：** [Ling-2.5-1T](/guides/guides_v2-zh/yu-yan-mo-xing/ling25.md) —— 同一家族，1T 总参数 / 630 亿活跃参数，多 GPU 成本下的前沿推理
* **类似的单 GPU agent：** [MiMo-V2-Flash](/guides/guides_v2-zh/yu-yan-mo-xing/mimo-v2-flash.md) —— 3090 亿/150 亿活跃参数，内置推测解码
* **开源权重的代码替代方案：** [GLM-5.1](/guides/guides_v2-zh/yu-yan-mo-xing/glm-5-1.md) —— 7440 亿/400 亿活跃参数，SWE-Bench Pro 领先者
* **廉价 GPU 租赁：** [RTX 4090 租金从 $0.70/小时起](https://clore.ai/rent-4090.html) 或 [RTX 3090 从 $0.33/小时起](https://clore.ai/rent-3090.html)
* **Clore.ai Marketplace：** [clore.ai/marketplace](https://clore.ai/marketplace) —— 完整 GPU 目录，支持按需和现货定价

### 链接

* [HuggingFace 上的 Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash)
* [inclusionAI 组织](https://huggingface.co/inclusionAI) —— 蚂蚁集团的开源 AI 实验室
* [SGLang 仓库](https://github.com/sgl-project/sglang) —— 推荐的服务框架
* [vLLM 文档](https://docs.vllm.ai)
* [BFCL-V4 排行榜](https://gorilla.cs.berkeley.edu/leaderboard.html) —— Berkeley Function Calling


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/yu-yan-mo-xing/ling-26-flash.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.