# Ling-2.5-1T (1 Trillion Parameters)

Ling-2.5-1T by Ant Group (released February 16, 2026) is one of the largest open-source language models ever released — **1 trillion total parameters with 63B active**. It introduces a hybrid linear attention architecture that enables efficient inference on context lengths up to 1 million tokens. Alongside it, Ant Group released Ring-2.5-1T, the world's first hybrid linear-architecture thinking model. Together, they represent a new frontier in open-source AI — competitive with GPT-5.2, DeepSeek V3.2, and Kimi K2.5 on reasoning and agentic benchmarks.

**HuggingFace:** [inclusionAI/Ling-2.5-1T](https://huggingface.co/inclusionAI/Ling-2.5-1T) **Companion model:** [inclusionAI/Ring-2.5-1T](https://huggingface.co/inclusionAI/Ring-2.5-1T) (thinking/reasoning variant) **License:** Open source (Ant Group InclusionAI License)

## Key Features

* **1 trillion total parameters, 63B active** — massive scale with efficient MoE-style activation
* **Hybrid linear attention** — combines MLA (Multi-head Linear Attention) with Lightning Linear Attention for exceptional throughput on long sequences
* **1M token context window** — via YaRN extension from native 256K, handles entire codebases and book-length documents
* **Frontier reasoning** — approaches thinking-model performance while using \~4× fewer output tokens
* **Agentic capabilities** — trained with Agentic RL, compatible with Claude Code, OpenCode, and OpenClaw
* **Ring-2.5-1T companion** — dedicated reasoning variant achieves IMO 2025 and CMO 2025 gold medal level

## Architecture Details

| Component         | Details                                          |
| ----------------- | ------------------------------------------------ |
| Total Parameters  | 1T (1,000B)                                      |
| Active Parameters | 63B                                              |
| Architecture      | Hybrid Linear Attention (MLA + Lightning Linear) |
| Pre-training Data | 29T tokens                                       |
| Native Context    | 256K tokens                                      |
| Extended Context  | 1M tokens (YaRN)                                 |
| Release Date      | February 16, 2026                                |

## Requirements

Running Ling-2.5-1T at full precision requires substantial resources. Quantized versions make it more accessible.

| Configuration | Quantized (Q4 GGUF) | FP8            | BF16 (Full)      |
| ------------- | ------------------- | -------------- | ---------------- |
| GPU           | 8× RTX 4090         | 8× H100 80GB   | 16× H100 80GB    |
| VRAM          | 8×24GB (192GB)      | 8×80GB (640GB) | 16×80GB (1.28TB) |
| RAM           | 256GB               | 512GB          | 1TB              |
| Disk          | 600GB               | 1.2TB          | 2TB+             |
| CUDA          | 12.0+               | 12.0+          | 12.0+            |

**Recommended Clore.ai setup:**

* **Quantized (Q4):** 8× RTX 4090 (\~$4–16/day) — usable for experimentation and moderate workloads
* **Production (FP8):** 8× H100 (\~$24–48/day) — full quality with good throughput
* **Note:** This is an extremely large model. For budget-conscious users, consider the smaller models in the Ling family on [HuggingFace](https://huggingface.co/inclusionAI).

## Quick Start with vLLM

vLLM is the recommended serving framework for Ling-2.5-1T:

```bash
# Install vLLM
pip install vllm

# Serve Ling-2.5-1T with tensor parallelism across 8 GPUs
vllm serve inclusionAI/Ling-2.5-1T \
    --tensor-parallel-size 8 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.90 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

# For reduced memory, limit context length:
vllm serve inclusionAI/Ling-2.5-1T \
    --tensor-parallel-size 8 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.95 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000
```

## Quick Start with llama.cpp (Quantized)

For consumer GPU setups, GGUF quantizations are available:

```bash
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Download a quantized GGUF (check HuggingFace for available quants)
huggingface-cli download inclusionAI/Ling-2.5-1T-GGUF \
    --include "*.Q4_K_M.gguf" \
    --local-dir ./models/

# Serve with llama-server (adjust -ngl for your GPU count)
./build/bin/llama-server \
    -m ./models/Ling-2.5-1T-Q4_K_M.gguf \
    -ngl 99 \
    -c 8192 \
    --host 0.0.0.0 \
    --port 8000
```

## Usage Examples

### 1. Chat Completion via OpenAI API

Once vLLM or llama-server is running:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="inclusionAI/Ling-2.5-1T",
    messages=[
        {"role": "system", "content": "You are a world-class reasoning assistant. Think step by step."},
        {"role": "user", "content": "Prove that the square root of 2 is irrational."}
    ],
    temperature=0.1,
    max_tokens=4096
)

print(response.choices[0].message.content)
```

### 2. Long-Context Document Analysis

Ling-2.5-1T's hybrid linear attention makes it exceptionally efficient for long documents:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="n/a")

# Load a large document
with open("full_codebase.txt", "r") as f:
    codebase = f.read()  # Can be hundreds of thousands of tokens

response = client.chat.completions.create(
    model="inclusionAI/Ling-2.5-1T",
    messages=[
        {"role": "system", "content": "You are a senior software architect."},
        {"role": "user", "content": f"Analyze this codebase for security vulnerabilities and architectural issues:\n\n{codebase}"}
    ],
    temperature=0.1,
    max_tokens=8192
)

print(response.choices[0].message.content)
```

### 3. Agentic Tool Use

Ling-2.5-1T is trained with Agentic RL for tool calling:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="n/a")

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_database",
            "description": "Search the product database",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "category": {"type": "string", "enum": ["electronics", "clothing", "books"]},
                    "max_price": {"type": "number"}
                },
                "required": ["query"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="inclusionAI/Ling-2.5-1T",
    messages=[{"role": "user", "content": "Find me a laptop under $1000 with good reviews"}],
    tools=tools,
    tool_choice="auto"
)

print(response.choices[0].message.tool_calls)
```

## Ling-2.5-1T vs Ring-2.5-1T

| Aspect           | Ling-2.5-1T                         | Ring-2.5-1T                              |
| ---------------- | ----------------------------------- | ---------------------------------------- |
| Type             | Instant (fast) model                | Thinking (reasoning) model               |
| Architecture     | Hybrid Linear Attention             | Hybrid Linear Attention                  |
| Best For         | General chat, coding, agentic tasks | Math, formal reasoning, complex problems |
| Output Style     | Direct answers                      | Chain-of-thought reasoning               |
| Token Efficiency | High (fewer output tokens)          | Uses more tokens for reasoning           |
| IMO 2025         | Competitive                         | Gold medal level                         |

## Tips for Clore.ai Users

1. **This model needs serious hardware** — At 1T parameters, even Q4 quantization requires \~500GB of storage and 192GB+ VRAM. Make sure your Clore.ai instance has sufficient disk and multi-GPU before downloading.
2. **Start with `--max-model-len 8192`** — When first testing, use a short context to verify the model loads and runs correctly. Scale up the context length once everything works.
3. **Use persistent storage** — The model weighs 1–2TB. Attach a large persistent volume on Clore.ai to avoid re-downloading. Download once with `huggingface-cli download`.
4. **Consider Ring-2.5-1T for reasoning tasks** — If your use case is primarily math, logic, or formal reasoning, the companion Ring-2.5-1T model is specifically optimized for chain-of-thought reasoning.
5. **Monitor GPU memory** — With 8-GPU setups, use `nvidia-smi -l 1` to monitor memory usage and watch for OOM during generation with long contexts.

## Troubleshooting

| Issue                                 | Solution                                                                                                                             |
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
| `CUDA out of memory`                  | Reduce `--max-model-len`; ensure `--tensor-parallel-size` matches GPU count; try `--gpu-memory-utilization 0.95`                     |
| Very slow generation                  | Linear attention needs warmup; first few requests may be slow. Also check you have NVLink between GPUs                               |
| Model download fails                  | Model is \~2TB in BF16. Ensure enough disk space. Use `--resume-download` flag with `huggingface-cli`                                |
| vLLM doesn't support the architecture | Ensure you're using vLLM ≥0.7.0 with `--trust-remote-code`; the custom attention layers require this flag                            |
| GGUF not available                    | Check [unsloth](https://huggingface.co/unsloth) or community quantizations; the model may take time to be quantized by the community |
| Poor quality responses                | Use temperature ≤0.1 for factual tasks; add a system prompt; ensure you're not truncating the context                                |

## Further Reading

* [Official Announcement (BusinessWire)](https://www.businesswire.com/news/home/20260215551663/en/) — Release details and benchmarks
* [HuggingFace — Ling-2.5-1T](https://huggingface.co/inclusionAI/Ling-2.5-1T) — Model weights and documentation
* [HuggingFace — Ring-2.5-1T](https://huggingface.co/inclusionAI/Ring-2.5-1T) — Thinking model companion
* [ModelScope Mirror](https://www.modelscope.cn/models/inclusionAI/Ling-2.5-1T) — Faster downloads in Asia
* [vLLM Documentation](https://docs.vllm.ai/) — Serving framework


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/ling25.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.