For the complete documentation index, see llms.txt. This page is also available as Markdown.

Ling-2.6-flash (Ant Group 104B MoE)

Deploy Ling-2.6-flash (104B MoE, 7.4B active) by Ant Group on Clore.ai — the agent-tuned flash sibling that fits on a single RTX 4090

Status (April 29, 2026): Ling-2.6-flash was released by Ant Group's inclusionAI team on April 28, 2026 (one day ago at the time of writing). It is the small, fast, agent-tuned sibling of Ling-2.5-1T — same lineage, same hybrid linear attention DNA, but with only 7.4B active parameters out of a 104B sparse MoE. Weights live at huggingface.co/inclusionAI/Ling-2.6-flash under the MIT license.

Where Ling-2.5-1T needed an 8-GPU rack to even boot, Ling-2.6-flash is the first inclusionAI release that fits on a single consumer GPU. The 7.4B active path means you pay the inference cost of an 8B dense model while drawing on a 104B parameter pool — and Ant Group has tuned that pool specifically for agentic workflows: tool calling, multi-step planning, and structured function dispatch.

Vendor-published numbers put Ling-2.6-flash at SOTA on BFCL-V4 and TAU2-bench for its size class, with throughput of roughly 340 tok/s on 4× H20 in the official benchmark configuration. For Clore.ai users the more interesting line is much smaller: INT4 fits comfortably on one RTX 4090 (24GB) with headroom for a 32K+ context, and FP8 fits on a single H100 80GB. That puts a fresh agent-tuned frontier-class small model at roughly $0.70–2.50/hr on the Clore.ai marketplace.

Key Specs

Property
Value

Total Parameters

104B (MoE)

Active Parameters

7.4B per forward pass

Architecture

1:7 MLA + Lightning Linear hybrid attention

Context Window

262,144 tokens

Quantizations

BF16, FP8, INT4

License

MIT

Release Date

April 28, 2026

Organization

Ant Group — inclusionAI

Primary Tooling

SGLang (recommended), vLLM, llama.cpp/Ollama (community GGUF)

Why Ling-2.6-flash?

  • Single-GPU deployable — INT4 on one RTX 4090 or RTX 3090, FP8 on one H100. No multi-GPU drama, no NVLink wrangling.

  • Agent-tuned — explicitly trained for BFCL-V4 / TAU2-bench style tool-calling loops, not just benchmarked on them post-hoc.

  • Sparse MoE quality at 7.4B active cost — you get a 104B parameter knowledge pool through a 7.4B inference path.

  • 256K context out of the box — 262K native tokens, no YaRN tricks needed for long agent traces.

  • MIT license — fully commercial, fine-tunable, redistributable.

  • Lineage — direct descendant of Ling-2.5-1T and Ring-2.5; the architecture is battle-tested.


Requirements

Component
INT4 (single 24GB)
FP8 (single 80GB)
BF16 (full quality)

GPU VRAM

1× RTX 4090 / 3090 (24GB)

1× H100 / A100 80GB

2× A100 80GB or 1× H200 141GB

RAM

32GB

64GB

128GB

Disk

60GB NVMe

120GB NVMe

220GB NVMe

CUDA

12.0+

12.4+

12.4+

Practical Context

32K–64K

128K

256K

Clore.ai pick: For most agent workloads, a single RTX 4090 (~$0.70–2.50/hr) running an INT4 GGUF is unbeatable on price. Step up to a single H100 if you need FP8 quality or 128K+ context.


Option A — Ollama / GGUF (Quantized, single GPU)

This is the path most Clore.ai users will want. Community GGUFs typically appear on HuggingFace within a few days of an inclusionAI release.

A single RTX 4090 should hit ~80–120 tok/s on Q4_K_M with a 32K context — plenty for interactive agent work.


Option B — vLLM (Production API)

vLLM is the go-to for serving Ling-2.6-flash to multiple concurrent agents. Use the FP8 checkpoint on a single H100 / A100 80GB:

For BF16 full quality on long contexts (200K+), bump --tensor-parallel-size 2 across 2× A100 80GB or pin to a single H200 141GB.


SGLang is what Ant Group uses for the official 340 tok/s benchmark — the hybrid linear attention path is fastest under SGLang's runtime.


Clore.ai GPU Recommendations

Setup
VRAM
Quant
Expected Throughput
Clore.ai Cost

24GB

INT4 GGUF

~60–90 tok/s

~$0.33–1.24/hr

24GB

INT4 GGUF

~80–120 tok/s

~$0.70–2.50/hr

80GB

FP8

~120–180 tok/s

~$2–4/hr

1× H100 80GB

80GB

FP8

~150–220 tok/s

~$6–8/hr

4× H100 80GB

320GB

BF16 + TP=4

~340 tok/s (vendor)

~$24–32/hr


Use Cases

  • Tool-calling agents — BFCL-V4 and TAU2-bench tuning means structured function dispatch is a strength, not an afterthought.

  • Multi-step planning loops — sustained chain-of-tool-call traces without the drift typical of small models.

  • Local Claude Code / OpenHands replacement — drop-in OpenAI-compatible API on your own RTX 4090.

  • High-volume agentic batch jobs — 340 tok/s on 4×H100 makes this viable for processing thousands of agent transcripts per hour.

  • Long-context RAG — 256K native ctx covers most enterprise document sets in a single prompt.

  • Cheap dev sandbox for Ling-2.5-1T workflows — prototype on flash, deploy on the 1T variant.


Benchmarks

Benchmark
Ling-2.6-flash (vendor)
Notes

BFCL-V4

SOTA for size class

Berkeley Function Calling Leaderboard v4

TAU2-bench

SOTA for size class

Tool agent benchmark v2

SWE-bench Verified / Resolved

~61.2%

Resolved rate on verified split

MathArena AIME 2026

73.85

MathArena HMMT Feb 2026

49.29

Throughput

~340 tok/s

4× H20-3e, TP=4, batch 32


Troubleshooting

Issue
Solution

OutOfMemoryError on RTX 4090

Drop to Q4_K_S or Q3_K_M; reduce --ctx-size to 16384; close other GPU processes

GGUF not yet on HuggingFace

Model is one day old. Check unsloth, bartowski, and TheBloke mirrors; or quantize from BF16 yourself with llama-quantize

vLLM rejects the architecture

Ensure vLLM ≥ 0.7.x with --trust-remote-code; the hybrid linear attention layers are custom

Tool calls returned as plain text

Set --enable-auto-tool-choice --tool-call-parser hermes in vLLM; SGLang handles this automatically

Slow prefill on long contexts

Linear attention has warmup overhead; first request is always slowest. Use --enable-chunked-prefill in vLLM

Throughput well below 340 tok/s

The vendor number is 4× H20 with TP=4 and batch 32. Single-GPU + batch 1 is naturally much slower — that's expected, not a bug

Garbled output at high temperature

Drop to temperature=0.7 for chat, 0.1 for tool calling


Next Steps

Last updated

Was this helpful?