# Ling-2.5-1T (1 Trillion Parameters)

Ling-2.5-1T by Ant Group (released February 16, 2026) is one of the largest open-source language models ever released — **1 trillion total parameters with 63B active**. It introduces a hybrid linear attention architecture that enables efficient inference on context lengths up to 1 million tokens. Alongside it, Ant Group released Ring-2.5-1T, the world's first hybrid linear-architecture thinking model. Together, they represent a new frontier in open-source AI — competitive with GPT-5.2, DeepSeek V3.2, and Kimi K2.5 on reasoning and agentic benchmarks.

**HuggingFace:** [inclusionAI/Ling-2.5-1T](https://huggingface.co/inclusionAI/Ling-2.5-1T) **Companion model:** [inclusionAI/Ring-2.5-1T](https://huggingface.co/inclusionAI/Ring-2.5-1T) (thinking/reasoning variant) **License:** Open source (Ant Group InclusionAI License)

## Key Features

* **1 trillion total parameters, 63B active** — massive scale with efficient MoE-style activation
* **Hybrid linear attention** — combines MLA (Multi-head Linear Attention) with Lightning Linear Attention for exceptional throughput on long sequences
* **1M token context window** — via YaRN extension from native 256K, handles entire codebases and book-length documents
* **Frontier reasoning** — approaches thinking-model performance while using \~4× fewer output tokens
* **Agentic capabilities** — trained with Agentic RL, compatible with Claude Code, OpenCode, and OpenClaw
* **Ring-2.5-1T companion** — dedicated reasoning variant achieves IMO 2025 and CMO 2025 gold medal level

## Architecture Details

| Component         | Details                                          |
| ----------------- | ------------------------------------------------ |
| Total Parameters  | 1T (1,000B)                                      |
| Active Parameters | 63B                                              |
| Architecture      | Hybrid Linear Attention (MLA + Lightning Linear) |
| Pre-training Data | 29T tokens                                       |
| Native Context    | 256K tokens                                      |
| Extended Context  | 1M tokens (YaRN)                                 |
| Release Date      | February 16, 2026                                |

## Requirements

Running Ling-2.5-1T at full precision requires substantial resources. Quantized versions make it more accessible.

| Configuration | Quantized (Q4 GGUF) | FP8            | BF16 (Full)      |
| ------------- | ------------------- | -------------- | ---------------- |
| GPU           | 8× RTX 4090         | 8× H100 80GB   | 16× H100 80GB    |
| VRAM          | 8×24GB (192GB)      | 8×80GB (640GB) | 16×80GB (1.28TB) |
| RAM           | 256GB               | 512GB          | 1TB              |
| Disk          | 600GB               | 1.2TB          | 2TB+             |
| CUDA          | 12.0+               | 12.0+          | 12.0+            |

**Recommended Clore.ai setup:**

* **Quantized (Q4):** 8× RTX 4090 (\~$4–16/day) — usable for experimentation and moderate workloads
* **Production (FP8):** 8× H100 (\~$24–48/day) — full quality with good throughput
* **Note:** This is an extremely large model. For budget-conscious users, consider the smaller models in the Ling family on [HuggingFace](https://huggingface.co/inclusionAI).

## Quick Start with vLLM

vLLM is the recommended serving framework for Ling-2.5-1T:

```bash
# Install vLLM
pip install vllm

# Serve Ling-2.5-1T with tensor parallelism across 8 GPUs
vllm serve inclusionAI/Ling-2.5-1T \
    --tensor-parallel-size 8 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.90 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

# For reduced memory, limit context length:
vllm serve inclusionAI/Ling-2.5-1T \
    --tensor-parallel-size 8 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.95 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000
```

## Quick Start with llama.cpp (Quantized)

For consumer GPU setups, GGUF quantizations are available:

```bash
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Download a quantized GGUF (check HuggingFace for available quants)
huggingface-cli download inclusionAI/Ling-2.5-1T-GGUF \
    --include "*.Q4_K_M.gguf" \
    --local-dir ./models/

# Serve with llama-server (adjust -ngl for your GPU count)
./build/bin/llama-server \
    -m ./models/Ling-2.5-1T-Q4_K_M.gguf \
    -ngl 99 \
    -c 8192 \
    --host 0.0.0.0 \
    --port 8000
```

## Usage Examples

### 1. Chat Completion via OpenAI API

Once vLLM or llama-server is running:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="inclusionAI/Ling-2.5-1T",
    messages=[
        {"role": "system", "content": "You are a world-class reasoning assistant. Think step by step."},
        {"role": "user", "content": "Prove that the square root of 2 is irrational."}
    ],
    temperature=0.1,
    max_tokens=4096
)

print(response.choices[0].message.content)
```

### 2. Long-Context Document Analysis

Ling-2.5-1T's hybrid linear attention makes it exceptionally efficient for long documents:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="n/a")

# Load a large document
with open("full_codebase.txt", "r") as f:
    codebase = f.read()  # Can be hundreds of thousands of tokens

response = client.chat.completions.create(
    model="inclusionAI/Ling-2.5-1T",
    messages=[
        {"role": "system", "content": "You are a senior software architect."},
        {"role": "user", "content": f"Analyze this codebase for security vulnerabilities and architectural issues:\n\n{codebase}"}
    ],
    temperature=0.1,
    max_tokens=8192
)

print(response.choices[0].message.content)
```

### 3. Agentic Tool Use

Ling-2.5-1T is trained with Agentic RL for tool calling:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="n/a")

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_database",
            "description": "Search the product database",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "category": {"type": "string", "enum": ["electronics", "clothing", "books"]},
                    "max_price": {"type": "number"}
                },
                "required": ["query"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="inclusionAI/Ling-2.5-1T",
    messages=[{"role": "user", "content": "Find me a laptop under $1000 with good reviews"}],
    tools=tools,
    tool_choice="auto"
)

print(response.choices[0].message.tool_calls)
```

## Ling-2.5-1T vs Ring-2.5-1T

| Aspect           | Ling-2.5-1T                         | Ring-2.5-1T                              |
| ---------------- | ----------------------------------- | ---------------------------------------- |
| Type             | Instant (fast) model                | Thinking (reasoning) model               |
| Architecture     | Hybrid Linear Attention             | Hybrid Linear Attention                  |
| Best For         | General chat, coding, agentic tasks | Math, formal reasoning, complex problems |
| Output Style     | Direct answers                      | Chain-of-thought reasoning               |
| Token Efficiency | High (fewer output tokens)          | Uses more tokens for reasoning           |
| IMO 2025         | Competitive                         | Gold medal level                         |

## Tips for Clore.ai Users

1. **This model needs serious hardware** — At 1T parameters, even Q4 quantization requires \~500GB of storage and 192GB+ VRAM. Make sure your Clore.ai instance has sufficient disk and multi-GPU before downloading.
2. **Start with `--max-model-len 8192`** — When first testing, use a short context to verify the model loads and runs correctly. Scale up the context length once everything works.
3. **Use persistent storage** — The model weighs 1–2TB. Attach a large persistent volume on Clore.ai to avoid re-downloading. Download once with `huggingface-cli download`.
4. **Consider Ring-2.5-1T for reasoning tasks** — If your use case is primarily math, logic, or formal reasoning, the companion Ring-2.5-1T model is specifically optimized for chain-of-thought reasoning.
5. **Monitor GPU memory** — With 8-GPU setups, use `nvidia-smi -l 1` to monitor memory usage and watch for OOM during generation with long contexts.

## Troubleshooting

| Issue                                 | Solution                                                                                                                             |
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
| `CUDA out of memory`                  | Reduce `--max-model-len`; ensure `--tensor-parallel-size` matches GPU count; try `--gpu-memory-utilization 0.95`                     |
| Very slow generation                  | Linear attention needs warmup; first few requests may be slow. Also check you have NVLink between GPUs                               |
| Model download fails                  | Model is \~2TB in BF16. Ensure enough disk space. Use `--resume-download` flag with `huggingface-cli`                                |
| vLLM doesn't support the architecture | Ensure you're using vLLM ≥0.7.0 with `--trust-remote-code`; the custom attention layers require this flag                            |
| GGUF not available                    | Check [unsloth](https://huggingface.co/unsloth) or community quantizations; the model may take time to be quantized by the community |
| Poor quality responses                | Use temperature ≤0.1 for factual tasks; add a system prompt; ensure you're not truncating the context                                |

## Further Reading

* [Official Announcement (BusinessWire)](https://www.businesswire.com/news/home/20260215551663/en/) — Release details and benchmarks
* [HuggingFace — Ling-2.5-1T](https://huggingface.co/inclusionAI/Ling-2.5-1T) — Model weights and documentation
* [HuggingFace — Ring-2.5-1T](https://huggingface.co/inclusionAI/Ring-2.5-1T) — Thinking model companion
* [ModelScope Mirror](https://www.modelscope.cn/models/inclusionAI/Ling-2.5-1T) — Faster downloads in Asia
* [vLLM Documentation](https://docs.vllm.ai/) — Serving framework
