Ling-2.6-flash (Ant Group 104B MoE)
Deploy Ling-2.6-flash (104B MoE, 7.4B active) by Ant Group on Clore.ai — the agent-tuned flash sibling that fits on a single RTX 4090
Status (April 29, 2026): Ling-2.6-flash was released by Ant Group's inclusionAI team on April 28, 2026 (one day ago at the time of writing). It is the small, fast, agent-tuned sibling of Ling-2.5-1T — same lineage, same hybrid linear attention DNA, but with only 7.4B active parameters out of a 104B sparse MoE. Weights live at huggingface.co/inclusionAI/Ling-2.6-flash under the MIT license.
Where Ling-2.5-1T needed an 8-GPU rack to even boot, Ling-2.6-flash is the first inclusionAI release that fits on a single consumer GPU. The 7.4B active path means you pay the inference cost of an 8B dense model while drawing on a 104B parameter pool — and Ant Group has tuned that pool specifically for agentic workflows: tool calling, multi-step planning, and structured function dispatch.
Vendor-published numbers put Ling-2.6-flash at SOTA on BFCL-V4 and TAU2-bench for its size class, with throughput of roughly 340 tok/s on 4× H20 in the official benchmark configuration. For Clore.ai users the more interesting line is much smaller: INT4 fits comfortably on one RTX 4090 (24GB) with headroom for a 32K+ context, and FP8 fits on a single H100 80GB. That puts a fresh agent-tuned frontier-class small model at roughly $0.70–2.50/hr on the Clore.ai marketplace.
Key Specs
Total Parameters
104B (MoE)
Active Parameters
7.4B per forward pass
Architecture
1:7 MLA + Lightning Linear hybrid attention
Context Window
262,144 tokens
Quantizations
BF16, FP8, INT4
License
MIT
Release Date
April 28, 2026
Organization
Ant Group — inclusionAI
Primary Tooling
SGLang (recommended), vLLM, llama.cpp/Ollama (community GGUF)
Why Ling-2.6-flash?
Agent-tuned — explicitly trained for BFCL-V4 / TAU2-bench style tool-calling loops, not just benchmarked on them post-hoc.
Sparse MoE quality at 7.4B active cost — you get a 104B parameter knowledge pool through a 7.4B inference path.
256K context out of the box — 262K native tokens, no YaRN tricks needed for long agent traces.
MIT license — fully commercial, fine-tunable, redistributable.
Lineage — direct descendant of Ling-2.5-1T and Ring-2.5; the architecture is battle-tested.
Requirements
Clore-friendly. This is the first model in the inclusionAI lineup that runs on a single consumer GPU. If you've been priced out of Ling-2.5-1T or GLM-5.1, this is the entry point.
GPU VRAM
1× RTX 4090 / 3090 (24GB)
1× H100 / A100 80GB
2× A100 80GB or 1× H200 141GB
RAM
32GB
64GB
128GB
Disk
60GB NVMe
120GB NVMe
220GB NVMe
CUDA
12.0+
12.4+
12.4+
Practical Context
32K–64K
128K
256K
Clore.ai pick: For most agent workloads, a single RTX 4090 (~$0.70–2.50/hr) running an INT4 GGUF is unbeatable on price. Step up to a single H100 if you need FP8 quality or 128K+ context.
Option A — Ollama / GGUF (Quantized, single GPU)
This is the path most Clore.ai users will want. Community GGUFs typically appear on HuggingFace within a few days of an inclusionAI release.
Day-one heads-up: Ling-2.6-flash dropped on April 28, 2026. As of this writing the GGUF community quants may still be landing. Watch huggingface.co/models?search=ling-2.6-flash+gguf and unsloth for first builds. If ollama pull 404s, point llama.cpp at the GGUF file directly.
A single RTX 4090 should hit ~80–120 tok/s on Q4_K_M with a 32K context — plenty for interactive agent work.
Option B — vLLM (Production API)
vLLM is the go-to for serving Ling-2.6-flash to multiple concurrent agents. Use the FP8 checkpoint on a single H100 / A100 80GB:
For BF16 full quality on long contexts (200K+), bump --tensor-parallel-size 2 across 2× A100 80GB or pin to a single H200 141GB.
Option C — SGLang (recommended for max throughput)
SGLang is what Ant Group uses for the official 340 tok/s benchmark — the hybrid linear attention path is fastest under SGLang's runtime.
Clore.ai GPU Recommendations
Best value: A single RTX 4090 from $0.70/hr running the Q4_K_M GGUF. You get an agent-tuned, MIT-licensed, 104B-MoE model with 32K context for less than the price of a coffee per hour. This is exactly the deployment shape Clore.ai's consumer-GPU marketplace was built for.
Use Cases
Tool-calling agents — BFCL-V4 and TAU2-bench tuning means structured function dispatch is a strength, not an afterthought.
Multi-step planning loops — sustained chain-of-tool-call traces without the drift typical of small models.
Local Claude Code / OpenHands replacement — drop-in OpenAI-compatible API on your own RTX 4090.
High-volume agentic batch jobs — 340 tok/s on 4×H100 makes this viable for processing thousands of agent transcripts per hour.
Long-context RAG — 256K native ctx covers most enterprise document sets in a single prompt.
Cheap dev sandbox for Ling-2.5-1T workflows — prototype on flash, deploy on the 1T variant.
Benchmarks
Vendor-claimed — verify independently. All numbers below come from inclusionAI's April 28, 2026 model card. The model is one day old; community reproductions on BFCL-V4 and TAU2-bench have not been published yet. Treat these as directional, not gospel.
BFCL-V4
SOTA for size class
Berkeley Function Calling Leaderboard v4
TAU2-bench
SOTA for size class
Tool agent benchmark v2
SWE-bench Verified / Resolved
~61.2%
Resolved rate on verified split
MathArena AIME 2026
73.85
MathArena HMMT Feb 2026
49.29
Throughput
~340 tok/s
4× H20-3e, TP=4, batch 32
Troubleshooting
OutOfMemoryError on RTX 4090
Drop to Q4_K_S or Q3_K_M; reduce --ctx-size to 16384; close other GPU processes
GGUF not yet on HuggingFace
vLLM rejects the architecture
Ensure vLLM ≥ 0.7.x with --trust-remote-code; the hybrid linear attention layers are custom
Tool calls returned as plain text
Set --enable-auto-tool-choice --tool-call-parser hermes in vLLM; SGLang handles this automatically
Slow prefill on long contexts
Linear attention has warmup overhead; first request is always slowest. Use --enable-chunked-prefill in vLLM
Throughput well below 340 tok/s
The vendor number is 4× H20 with TP=4 and batch 32. Single-GPU + batch 1 is naturally much slower — that's expected, not a bug
Garbled output at high temperature
Drop to temperature=0.7 for chat, 0.1 for tool calling
Next Steps
Bigger sibling: Ling-2.5-1T — same family, 1T total / 63B active, frontier reasoning at multi-GPU cost
Similar single-GPU agent: MiMo-V2-Flash — 309B/15B active with built-in speculative decoding
Open-weight coding alternative: GLM-5.1 — 744B/40B active, SWE-Bench Pro leader
Cheap GPU rentals: Rent RTX 4090 from $0.70/hr or RTX 3090 from $0.33/hr
Clore.ai Marketplace: clore.ai/marketplace — full GPU catalog with on-demand and spot pricing
Links
inclusionAI organization — Ant Group's open-source AI lab
SGLang repo — recommended serving framework
BFCL-V4 leaderboard — Berkeley Function Calling
Last updated
Was this helpful?