# Ling-2.6-flash (Ant Group 104B MoE)

{% hint style="info" %}
**Status (April 29, 2026):** Ling-2.6-flash was released by Ant Group's **inclusionAI** team on **April 28, 2026** (one day ago at the time of writing). It is the small, fast, agent-tuned sibling of [Ling-2.5-1T](/guides/language-models/ling25.md) — same lineage, same hybrid linear attention DNA, but with only **7.4B active parameters** out of a 104B sparse MoE. Weights live at [huggingface.co/inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash) under the **MIT license**.
{% endhint %}

Where [Ling-2.5-1T](/guides/language-models/ling25.md) needed an 8-GPU rack to even boot, Ling-2.6-flash is the **first inclusionAI release that fits on a single consumer GPU**. The 7.4B active path means you pay the inference cost of an 8B dense model while drawing on a 104B parameter pool — and Ant Group has tuned that pool specifically for **agentic workflows**: tool calling, multi-step planning, and structured function dispatch.

Vendor-published numbers put Ling-2.6-flash at SOTA on **BFCL-V4** and **TAU2-bench** for its size class, with throughput of roughly **340 tok/s on 4× H20** in the official benchmark configuration. For Clore.ai users the more interesting line is much smaller: **INT4 fits comfortably on one RTX 4090 (24GB)** with headroom for a 32K+ context, and **FP8 fits on a single H100 80GB**. That puts a fresh agent-tuned frontier-class small model at roughly $0.70–2.50/hr on the [Clore.ai marketplace](https://clore.ai/marketplace).

### Key Specs

| Property          | Value                                                         |
| ----------------- | ------------------------------------------------------------- |
| Total Parameters  | 104B (MoE)                                                    |
| Active Parameters | 7.4B per forward pass                                         |
| Architecture      | 1:7 MLA + Lightning Linear hybrid attention                   |
| Context Window    | 262,144 tokens                                                |
| Quantizations     | BF16, FP8, INT4                                               |
| License           | MIT                                                           |
| Release Date      | April 28, 2026                                                |
| Organization      | Ant Group — inclusionAI                                       |
| Primary Tooling   | SGLang (recommended), vLLM, llama.cpp/Ollama (community GGUF) |

### Why Ling-2.6-flash?

* **Single-GPU deployable** — INT4 on one [RTX 4090](https://clore.ai/rent-4090.html) or [RTX 3090](https://clore.ai/rent-3090.html), FP8 on one H100. No multi-GPU drama, no NVLink wrangling.
* **Agent-tuned** — explicitly trained for BFCL-V4 / TAU2-bench style tool-calling loops, not just benchmarked on them post-hoc.
* **Sparse MoE quality at 7.4B active cost** — you get a 104B parameter knowledge pool through a 7.4B inference path.
* **256K context out of the box** — 262K native tokens, no YaRN tricks needed for long agent traces.
* **MIT license** — fully commercial, fine-tunable, redistributable.
* **Lineage** — direct descendant of [Ling-2.5-1T](/guides/language-models/ling25.md) and Ring-2.5; the architecture is battle-tested.

***

## Requirements

{% hint style="success" %}
**Clore-friendly.** This is the first model in the inclusionAI lineup that runs on a single consumer GPU. If you've been priced out of [Ling-2.5-1T](/guides/language-models/ling25.md) or [GLM-5.1](/guides/language-models/glm-5-1.md), this is the entry point.
{% endhint %}

| Component         | INT4 (single 24GB)        | FP8 (single 80GB)   | BF16 (full quality)           |
| ----------------- | ------------------------- | ------------------- | ----------------------------- |
| GPU VRAM          | 1× RTX 4090 / 3090 (24GB) | 1× H100 / A100 80GB | 2× A100 80GB or 1× H200 141GB |
| RAM               | 32GB                      | 64GB                | 128GB                         |
| Disk              | 60GB NVMe                 | 120GB NVMe          | 220GB NVMe                    |
| CUDA              | 12.0+                     | 12.4+               | 12.4+                         |
| Practical Context | 32K–64K                   | 128K                | 256K                          |

**Clore.ai pick:** For most agent workloads, a single [RTX 4090 (\~$0.70–2.50/hr)](https://clore.ai/rent-4090.html) running an INT4 GGUF is unbeatable on price. Step up to a single H100 if you need FP8 quality or 128K+ context.

***

## Option A — Ollama / GGUF (Quantized, single GPU)

This is the path most Clore.ai users will want. Community GGUFs typically appear on HuggingFace within a few days of an inclusionAI release.

{% hint style="warning" %}
**Day-one heads-up:** Ling-2.6-flash dropped on April 28, 2026. As of this writing the GGUF community quants may still be landing. Watch [huggingface.co/models?search=ling-2.6-flash+gguf](https://huggingface.co/models?search=ling-2.6-flash+gguf) and [unsloth](https://huggingface.co/unsloth) for first builds. If `ollama pull` 404s, point llama.cpp at the GGUF file directly.
{% endhint %}

```bash
# Once a community Q4_K_M build is published
docker exec ollama ollama pull ling-2.6-flash:q4_K_M
docker exec ollama ollama run ling-2.6-flash:q4_K_M

# Or with llama.cpp directly on a downloaded GGUF
docker run --gpus all -it --rm -p 8080:8080 \
  -v $(pwd)/models:/models \
  ghcr.io/ggerganov/llama.cpp:server-cuda \
  -m /models/ling-2.6-flash-q4_k_m.gguf \
  --n-gpu-layers 99 --ctx-size 32768 \
  --port 8080 --host 0.0.0.0
```

A single RTX 4090 should hit **\~80–120 tok/s** on Q4\_K\_M with a 32K context — plenty for interactive agent work.

***

## Option B — vLLM (Production API)

vLLM is the go-to for serving Ling-2.6-flash to multiple concurrent agents. Use the FP8 checkpoint on a single H100 / A100 80GB:

```yaml
version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - hf_cache:/root/.cache/huggingface
    command: >
      --model inclusionAI/Ling-2.6-flash-FP8
      --tensor-parallel-size 1
      --max-model-len 65536
      --gpu-memory-utilization 0.90
      --enable-auto-tool-choice
      --tool-call-parser hermes
      --served-model-name ling-2.6-flash
      --trust-remote-code
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: "16gb"

volumes:
  hf_cache:
```

```bash
# Test the agent path
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ling-2.6-flash",
    "messages": [
      {"role": "system", "content": "You are an agent with access to tools. Plan, call tools, then answer."},
      {"role": "user", "content": "Find me the cheapest RTX 4090 on Clore.ai right now."}
    ],
    "tools": [{"type": "function", "function": {"name": "search_marketplace", "parameters": {"type":"object","properties":{"gpu":{"type":"string"}}}}}],
    "tool_choice": "auto",
    "max_tokens": 2048
  }'
```

{% hint style="info" %}
For BF16 full quality on long contexts (200K+), bump `--tensor-parallel-size 2` across 2× A100 80GB or pin to a single H200 141GB.
{% endhint %}

***

## Option C — SGLang (recommended for max throughput)

SGLang is what Ant Group uses for the official 340 tok/s benchmark — the hybrid linear attention path is fastest under SGLang's runtime.

```bash
docker pull lmsysorg/sglang:latest

python3 -m sglang.launch_server \
  --model-path inclusionAI/Ling-2.6-flash-FP8 \
  --tp-size 1 \
  --tool-call-parser hermes \
  --mem-fraction-static 0.90 \
  --context-length 65536 \
  --served-model-name ling-2.6-flash \
  --host 0.0.0.0 --port 30000

# To reproduce the vendor 340 tok/s number (requires 4x H20 / H100 class)
python3 -m sglang.launch_server \
  --model-path inclusionAI/Ling-2.6-flash \
  --tp-size 4 \
  --mem-fraction-static 0.92 \
  --context-length 32768 \
  --served-model-name ling-2.6-flash
```

***

## Clore.ai GPU Recommendations

| Setup                                                | VRAM  | Quant       | Expected Throughput  | Clore.ai Cost       |
| ---------------------------------------------------- | ----- | ----------- | -------------------- | ------------------- |
| 1× [RTX 3090](https://clore.ai/rent-3090.html)       | 24GB  | INT4 GGUF   | \~60–90 tok/s        | **\~$0.33–1.24/hr** |
| 1× [RTX 4090](https://clore.ai/rent-4090.html)       | 24GB  | INT4 GGUF   | \~80–120 tok/s       | **\~$0.70–2.50/hr** |
| 1× [A100 80GB](https://clore.ai/rent-a100-80gb.html) | 80GB  | FP8         | \~120–180 tok/s      | \~$2–4/hr           |
| 1× H100 80GB                                         | 80GB  | FP8         | \~150–220 tok/s      | \~$6–8/hr           |
| 4× H100 80GB                                         | 320GB | BF16 + TP=4 | \~340 tok/s (vendor) | \~$24–32/hr         |

{% hint style="success" %}
**Best value:** A single RTX 4090 from $0.70/hr running the Q4\_K\_M GGUF. You get an agent-tuned, MIT-licensed, 104B-MoE model with 32K context for less than the price of a coffee per hour. This is exactly the deployment shape Clore.ai's consumer-GPU marketplace was built for.
{% endhint %}

***

## Use Cases

* **Tool-calling agents** — BFCL-V4 and TAU2-bench tuning means structured function dispatch is a strength, not an afterthought.
* **Multi-step planning loops** — sustained chain-of-tool-call traces without the drift typical of small models.
* **Local Claude Code / OpenHands replacement** — drop-in OpenAI-compatible API on your own RTX 4090.
* **High-volume agentic batch jobs** — 340 tok/s on 4×H100 makes this viable for processing thousands of agent transcripts per hour.
* **Long-context RAG** — 256K native ctx covers most enterprise document sets in a single prompt.
* **Cheap dev sandbox for** [**Ling-2.5-1T**](/guides/language-models/ling25.md) **workflows** — prototype on flash, deploy on the 1T variant.

***

## Benchmarks

{% hint style="warning" %}
**Vendor-claimed — verify independently.** All numbers below come from inclusionAI's April 28, 2026 model card. The model is one day old; community reproductions on BFCL-V4 and TAU2-bench have not been published yet. Treat these as directional, not gospel.
{% endhint %}

| Benchmark                     | Ling-2.6-flash (vendor) | Notes                                    |
| ----------------------------- | ----------------------- | ---------------------------------------- |
| BFCL-V4                       | SOTA for size class     | Berkeley Function Calling Leaderboard v4 |
| TAU2-bench                    | SOTA for size class     | Tool agent benchmark v2                  |
| SWE-bench Verified / Resolved | \~61.2%                 | Resolved rate on verified split          |
| MathArena AIME 2026           | 73.85                   |                                          |
| MathArena HMMT Feb 2026       | 49.29                   |                                          |
| Throughput                    | \~340 tok/s             | 4× H20-3e, TP=4, batch 32                |

***

## Troubleshooting

| Issue                              | Solution                                                                                                                                                                                                                            |
| ---------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `OutOfMemoryError` on RTX 4090     | Drop to Q4\_K\_S or Q3\_K\_M; reduce `--ctx-size` to 16384; close other GPU processes                                                                                                                                               |
| GGUF not yet on HuggingFace        | Model is one day old. Check [unsloth](https://huggingface.co/unsloth), [bartowski](https://huggingface.co/bartowski), and [TheBloke](https://huggingface.co/TheBloke) mirrors; or quantize from BF16 yourself with `llama-quantize` |
| vLLM rejects the architecture      | Ensure vLLM ≥ 0.7.x with `--trust-remote-code`; the hybrid linear attention layers are custom                                                                                                                                       |
| Tool calls returned as plain text  | Set `--enable-auto-tool-choice --tool-call-parser hermes` in vLLM; SGLang handles this automatically                                                                                                                                |
| Slow prefill on long contexts      | Linear attention has warmup overhead; first request is always slowest. Use `--enable-chunked-prefill` in vLLM                                                                                                                       |
| Throughput well below 340 tok/s    | The vendor number is 4× H20 with TP=4 and batch 32. Single-GPU + batch 1 is naturally much slower — that's expected, not a bug                                                                                                      |
| Garbled output at high temperature | Drop to `temperature=0.7` for chat, `0.1` for tool calling                                                                                                                                                                          |

***

## Next Steps

* **Bigger sibling:** [Ling-2.5-1T](/guides/language-models/ling25.md) — same family, 1T total / 63B active, frontier reasoning at multi-GPU cost
* **Similar single-GPU agent:** [MiMo-V2-Flash](/guides/language-models/mimo-v2-flash.md) — 309B/15B active with built-in speculative decoding
* **Open-weight coding alternative:** [GLM-5.1](/guides/language-models/glm-5-1.md) — 744B/40B active, SWE-Bench Pro leader
* **Cheap GPU rentals:** [Rent RTX 4090 from $0.70/hr](https://clore.ai/rent-4090.html) or [RTX 3090 from $0.33/hr](https://clore.ai/rent-3090.html)
* **Clore.ai Marketplace:** [clore.ai/marketplace](https://clore.ai/marketplace) — full GPU catalog with on-demand and spot pricing

### Links

* [Ling-2.6-flash on HuggingFace](https://huggingface.co/inclusionAI/Ling-2.6-flash)
* [inclusionAI organization](https://huggingface.co/inclusionAI) — Ant Group's open-source AI lab
* [SGLang repo](https://github.com/sgl-project/sglang) — recommended serving framework
* [vLLM docs](https://docs.vllm.ai)
* [BFCL-V4 leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html) — Berkeley Function Calling


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/ling-26-flash.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
