Ling-2.5-1T (1 Trillion Parameters)
Run Ling-2.5-1T — Ant Group's 1 trillion parameter open-source LLM with hybrid linear attention on Clore.ai GPUs
Ling-2.5-1T by Ant Group (released February 16, 2026) is one of the largest open-source language models ever released — 1 trillion total parameters with 63B active. It introduces a hybrid linear attention architecture that enables efficient inference on context lengths up to 1 million tokens. Alongside it, Ant Group released Ring-2.5-1T, the world's first hybrid linear-architecture thinking model. Together, they represent a new frontier in open-source AI — competitive with GPT-5.2, DeepSeek V3.2, and Kimi K2.5 on reasoning and agentic benchmarks.
HuggingFace: inclusionAI/Ling-2.5-1T Companion model: inclusionAI/Ring-2.5-1T (thinking/reasoning variant) License: Open source (Ant Group InclusionAI License)
Key Features
1 trillion total parameters, 63B active — massive scale with efficient MoE-style activation
Hybrid linear attention — combines MLA (Multi-head Linear Attention) with Lightning Linear Attention for exceptional throughput on long sequences
1M token context window — via YaRN extension from native 256K, handles entire codebases and book-length documents
Frontier reasoning — approaches thinking-model performance while using ~4× fewer output tokens
Agentic capabilities — trained with Agentic RL, compatible with Claude Code, OpenCode, and OpenClaw
Ring-2.5-1T companion — dedicated reasoning variant achieves IMO 2025 and CMO 2025 gold medal level
Architecture Details
Total Parameters
1T (1,000B)
Active Parameters
63B
Architecture
Hybrid Linear Attention (MLA + Lightning Linear)
Pre-training Data
29T tokens
Native Context
256K tokens
Extended Context
1M tokens (YaRN)
Release Date
February 16, 2026
Requirements
Running Ling-2.5-1T at full precision requires substantial resources. Quantized versions make it more accessible.
GPU
8× RTX 4090
8× H100 80GB
16× H100 80GB
VRAM
8×24GB (192GB)
8×80GB (640GB)
16×80GB (1.28TB)
RAM
256GB
512GB
1TB
Disk
600GB
1.2TB
2TB+
CUDA
12.0+
12.0+
12.0+
Recommended Clore.ai setup:
Quantized (Q4): 8× RTX 4090 (~$4–16/day) — usable for experimentation and moderate workloads
Production (FP8): 8× H100 (~$24–48/day) — full quality with good throughput
Note: This is an extremely large model. For budget-conscious users, consider the smaller models in the Ling family on HuggingFace.
Quick Start with vLLM
vLLM is the recommended serving framework for Ling-2.5-1T:
Quick Start with llama.cpp (Quantized)
For consumer GPU setups, GGUF quantizations are available:
Usage Examples
1. Chat Completion via OpenAI API
Once vLLM or llama-server is running:
2. Long-Context Document Analysis
Ling-2.5-1T's hybrid linear attention makes it exceptionally efficient for long documents:
3. Agentic Tool Use
Ling-2.5-1T is trained with Agentic RL for tool calling:
Ling-2.5-1T vs Ring-2.5-1T
Type
Instant (fast) model
Thinking (reasoning) model
Architecture
Hybrid Linear Attention
Hybrid Linear Attention
Best For
General chat, coding, agentic tasks
Math, formal reasoning, complex problems
Output Style
Direct answers
Chain-of-thought reasoning
Token Efficiency
High (fewer output tokens)
Uses more tokens for reasoning
IMO 2025
Competitive
Gold medal level
Tips for Clore.ai Users
This model needs serious hardware — At 1T parameters, even Q4 quantization requires ~500GB of storage and 192GB+ VRAM. Make sure your Clore.ai instance has sufficient disk and multi-GPU before downloading.
Start with
--max-model-len 8192— When first testing, use a short context to verify the model loads and runs correctly. Scale up the context length once everything works.Use persistent storage — The model weighs 1–2TB. Attach a large persistent volume on Clore.ai to avoid re-downloading. Download once with
huggingface-cli download.Consider Ring-2.5-1T for reasoning tasks — If your use case is primarily math, logic, or formal reasoning, the companion Ring-2.5-1T model is specifically optimized for chain-of-thought reasoning.
Monitor GPU memory — With 8-GPU setups, use
nvidia-smi -l 1to monitor memory usage and watch for OOM during generation with long contexts.
Troubleshooting
CUDA out of memory
Reduce --max-model-len; ensure --tensor-parallel-size matches GPU count; try --gpu-memory-utilization 0.95
Very slow generation
Linear attention needs warmup; first few requests may be slow. Also check you have NVLink between GPUs
Model download fails
Model is ~2TB in BF16. Ensure enough disk space. Use --resume-download flag with huggingface-cli
vLLM doesn't support the architecture
Ensure you're using vLLM ≥0.7.0 with --trust-remote-code; the custom attention layers require this flag
GGUF not available
Check unsloth or community quantizations; the model may take time to be quantized by the community
Poor quality responses
Use temperature ≤0.1 for factual tasks; add a system prompt; ensure you're not truncating the context
Further Reading
Official Announcement (BusinessWire) — Release details and benchmarks
HuggingFace — Ling-2.5-1T — Model weights and documentation
HuggingFace — Ring-2.5-1T — Thinking model companion
ModelScope Mirror — Faster downloads in Asia
vLLM Documentation — Serving framework
Last updated
Was this helpful?