Ling-2.5-1T (1 Trillion Parameters)

Run Ling-2.5-1T — Ant Group's 1 trillion parameter open-source LLM with hybrid linear attention on Clore.ai GPUs

Ling-2.5-1T by Ant Group (released February 16, 2026) is one of the largest open-source language models ever released — 1 trillion total parameters with 63B active. It introduces a hybrid linear attention architecture that enables efficient inference on context lengths up to 1 million tokens. Alongside it, Ant Group released Ring-2.5-1T, the world's first hybrid linear-architecture thinking model. Together, they represent a new frontier in open-source AI — competitive with GPT-5.2, DeepSeek V3.2, and Kimi K2.5 on reasoning and agentic benchmarks.

HuggingFace: inclusionAI/Ling-2.5-1Tarrow-up-right Companion model: inclusionAI/Ring-2.5-1Tarrow-up-right (thinking/reasoning variant) License: Open source (Ant Group InclusionAI License)

Key Features

  • 1 trillion total parameters, 63B active — massive scale with efficient MoE-style activation

  • Hybrid linear attention — combines MLA (Multi-head Linear Attention) with Lightning Linear Attention for exceptional throughput on long sequences

  • 1M token context window — via YaRN extension from native 256K, handles entire codebases and book-length documents

  • Frontier reasoning — approaches thinking-model performance while using ~4× fewer output tokens

  • Agentic capabilities — trained with Agentic RL, compatible with Claude Code, OpenCode, and OpenClaw

  • Ring-2.5-1T companion — dedicated reasoning variant achieves IMO 2025 and CMO 2025 gold medal level

Architecture Details

Component
Details

Total Parameters

1T (1,000B)

Active Parameters

63B

Architecture

Hybrid Linear Attention (MLA + Lightning Linear)

Pre-training Data

29T tokens

Native Context

256K tokens

Extended Context

1M tokens (YaRN)

Release Date

February 16, 2026

Requirements

Running Ling-2.5-1T at full precision requires substantial resources. Quantized versions make it more accessible.

Configuration
Quantized (Q4 GGUF)
FP8
BF16 (Full)

GPU

8× RTX 4090

8× H100 80GB

16× H100 80GB

VRAM

8×24GB (192GB)

8×80GB (640GB)

16×80GB (1.28TB)

RAM

256GB

512GB

1TB

Disk

600GB

1.2TB

2TB+

CUDA

12.0+

12.0+

12.0+

Recommended Clore.ai setup:

  • Quantized (Q4): 8× RTX 4090 (~$4–16/day) — usable for experimentation and moderate workloads

  • Production (FP8): 8× H100 (~$24–48/day) — full quality with good throughput

  • Note: This is an extremely large model. For budget-conscious users, consider the smaller models in the Ling family on HuggingFacearrow-up-right.

Quick Start with vLLM

vLLM is the recommended serving framework for Ling-2.5-1T:

Quick Start with llama.cpp (Quantized)

For consumer GPU setups, GGUF quantizations are available:

Usage Examples

1. Chat Completion via OpenAI API

Once vLLM or llama-server is running:

2. Long-Context Document Analysis

Ling-2.5-1T's hybrid linear attention makes it exceptionally efficient for long documents:

3. Agentic Tool Use

Ling-2.5-1T is trained with Agentic RL for tool calling:

Ling-2.5-1T vs Ring-2.5-1T

Aspect
Ling-2.5-1T
Ring-2.5-1T

Type

Instant (fast) model

Thinking (reasoning) model

Architecture

Hybrid Linear Attention

Hybrid Linear Attention

Best For

General chat, coding, agentic tasks

Math, formal reasoning, complex problems

Output Style

Direct answers

Chain-of-thought reasoning

Token Efficiency

High (fewer output tokens)

Uses more tokens for reasoning

IMO 2025

Competitive

Gold medal level

Tips for Clore.ai Users

  1. This model needs serious hardware — At 1T parameters, even Q4 quantization requires ~500GB of storage and 192GB+ VRAM. Make sure your Clore.ai instance has sufficient disk and multi-GPU before downloading.

  2. Start with --max-model-len 8192 — When first testing, use a short context to verify the model loads and runs correctly. Scale up the context length once everything works.

  3. Use persistent storage — The model weighs 1–2TB. Attach a large persistent volume on Clore.ai to avoid re-downloading. Download once with huggingface-cli download.

  4. Consider Ring-2.5-1T for reasoning tasks — If your use case is primarily math, logic, or formal reasoning, the companion Ring-2.5-1T model is specifically optimized for chain-of-thought reasoning.

  5. Monitor GPU memory — With 8-GPU setups, use nvidia-smi -l 1 to monitor memory usage and watch for OOM during generation with long contexts.

Troubleshooting

Issue
Solution

CUDA out of memory

Reduce --max-model-len; ensure --tensor-parallel-size matches GPU count; try --gpu-memory-utilization 0.95

Very slow generation

Linear attention needs warmup; first few requests may be slow. Also check you have NVLink between GPUs

Model download fails

Model is ~2TB in BF16. Ensure enough disk space. Use --resume-download flag with huggingface-cli

vLLM doesn't support the architecture

Ensure you're using vLLM ≥0.7.0 with --trust-remote-code; the custom attention layers require this flag

GGUF not available

Check unslotharrow-up-right or community quantizations; the model may take time to be quantized by the community

Poor quality responses

Use temperature ≤0.1 for factual tasks; add a system prompt; ensure you're not truncating the context

Further Reading

Last updated

Was this helpful?