# Qwen2.5

Run Alibaba's Qwen2.5 family of models - powerful multilingual LLMs with excellent code and math capabilities on CLORE.AI GPUs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Why Qwen2.5?

* **Versatile sizes** - 0.5B to 72B parameters
* **Multilingual** - 29 languages including Chinese
* **Long context** - Up to 128K tokens
* **Specialized variants** - Coder, Math editions
* **Open source** - Apache 2.0 license

## Quick Deploy on CLORE.AI

**Docker Image:**

```
vllm/vllm-openai:latest
```

**Ports:**

```
22/tcp
8000/http
```

**Command:**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

### Verify It's Working

```bash
# Check if service is ready
curl https://your-http-pub.clorecloud.net/health

# List available models
curl https://your-http-pub.clorecloud.net/v1/models
```

{% hint style="warning" %}
If you get HTTP 502, wait 5-15 minutes - the model is still downloading from HuggingFace.
{% endhint %}

## Qwen3 Reasoning Mode

{% hint style="info" %}
**New in Qwen3:** Some Qwen3 models support a reasoning mode that shows the model's thought process in `<think>` tags before the final answer.
{% endhint %}

When using Qwen3 models via vLLM, responses may include reasoning:

```json
{
  "content": "<think>\nLet me think about this step by step...\n</think>\n\nThe answer is..."
}
```

To use Qwen3 with reasoning:

```bash
vllm serve Qwen/Qwen3-0.6B --host 0.0.0.0 --port 8000
```

## Model Variants

### Base Models

| Model                | Parameters | VRAM (FP16) | Context | Notes               |
| -------------------- | ---------- | ----------- | ------- | ------------------- |
| Qwen2.5-0.5B         | 0.5B       | 2GB         | 32K     | Edge/testing        |
| Qwen2.5-1.5B         | 1.5B       | 4GB         | 32K     | Very light          |
| Qwen2.5-3B           | 3B         | 8GB         | 32K     | Budget              |
| Qwen2.5-7B           | 7B         | 16GB        | 128K    | Balanced            |
| Qwen2.5-14B          | 14B        | 32GB        | 128K    | High quality        |
| Qwen2.5-32B          | 32B        | 70GB        | 128K    | Very high quality   |
| Qwen2.5-72B          | 72B        | 150GB       | 128K    | **Best quality**    |
| Qwen2.5-72B-Instruct | 72B        | 150GB       | 128K    | Chat/instruct tuned |

### Specialized Variants

| Model                      | Focus       | Best For               | VRAM (FP16) |
| -------------------------- | ----------- | ---------------------- | ----------- |
| Qwen2.5-Coder-7B-Instruct  | Code        | Programming, debugging | 16GB        |
| Qwen2.5-Coder-14B-Instruct | Code        | Complex code tasks     | 32GB        |
| Qwen2.5-Coder-32B-Instruct | Code        | **Best code model**    | 70GB        |
| Qwen2.5-Math-7B-Instruct   | Mathematics | Calculations, proofs   | 16GB        |
| Qwen2.5-Math-72B-Instruct  | Mathematics | Research-grade math    | 150GB       |
| Qwen2.5-Instruct           | Chat        | General assistant      | varies      |

## Hardware Requirements

| Model     | Minimum GPU   | Recommended  | VRAM (Q4) |
| --------- | ------------- | ------------ | --------- |
| 0.5B-3B   | RTX 3060 12GB | RTX 3080     | 2-6GB     |
| 7B        | RTX 3090 24GB | RTX 4090     | 6GB       |
| 14B       | A100 40GB     | A100 80GB    | 12GB      |
| 32B       | A100 80GB     | 2x A100 40GB | 22GB      |
| 72B       | 2x A100 80GB  | 4x A100 80GB | 48GB      |
| Coder-32B | A100 80GB     | 2x A100 40GB | 22GB      |

## Installation

### Using vLLM (Recommended)

```bash
pip install vllm==0.7.3

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000
```

### Using Ollama

```bash
# Standard models
ollama pull qwen2.5:7b
ollama pull qwen2.5:14b
ollama pull qwen2.5:32b
ollama pull qwen2.5:72b       # New: largest Qwen2.5

# Specialized
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:32b  # New: best code model

# Run chat
ollama run qwen2.5:7b
```

### Using Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen2.5-7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## API Usage

### OpenAI-Compatible API

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain machine learning in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)
```

### Streaming

```python
stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Write a poem about AI"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
```

### cURL

```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "messages": [
            {"role": "user", "content": "What is Python?"}
        ]
    }'
```

## Qwen2.5-72B-Instruct

The flagship Qwen2.5 model — the largest and most capable in the family. It competes with GPT-4 on many benchmarks and is fully open-source under Apache 2.0.

### Running via vLLM (Multi-GPU)

```bash
# 4x A100 80GB setup
vllm serve Qwen/Qwen2.5-72B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9

# AWQ quantized — runs on 2x A100 80GB
vllm serve Qwen/Qwen2.5-72B-Instruct-AWQ \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --quantization awq \
    --max-model-len 32768
```

### Running via Ollama

```bash
# Pull 72B model (requires 48GB+ VRAM for Q4)
ollama pull qwen2.5:72b

# Run interactive session
ollama run qwen2.5:72b

# API access
curl http://localhost:11434/api/chat -d '{
  "model": "qwen2.5:72b",
  "messages": [{"role": "user", "content": "Analyze this complex scenario..."}],
  "stream": false
}'
```

### Python Example

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# The 72B model excels at complex analytical tasks
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-72B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are an expert analyst. Provide detailed, nuanced responses."
        },
        {
            "role": "user",
            "content": """Compare the architectural differences between transformer and 
            state space models (SSMs) for sequence modeling. Include efficiency tradeoffs."""
        }
    ],
    temperature=0.7,
    max_tokens=2000
)

print(response.choices[0].message.content)
```

## Qwen2.5-Coder-32B-Instruct

The best open-source code model available. Qwen2.5-Coder-32B-Instruct matches or exceeds GPT-4o on many coding benchmarks, supporting 40+ programming languages.

### Running via vLLM

```bash
# Single A100 80GB
vllm serve Qwen/Qwen2.5-Coder-32B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.9

# Dual RTX 4090 (24GB each = 48GB total, using Q4 quantization)
vllm serve Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --quantization awq
```

### Running via Ollama

```bash
# Pull Coder-32B (requires ~22GB VRAM for Q4)
ollama pull qwen2.5-coder:32b

# Run
ollama run qwen2.5-coder:32b

# Test with a coding prompt
ollama run qwen2.5-coder:32b "Write a Python async web scraper using aiohttp"
```

### Code Generation Examples

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Full-stack code generation
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-32B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are an expert software engineer. Write clean, production-ready code with proper error handling and documentation."
        },
        {
            "role": "user",
            "content": """Write a Python FastAPI service that:
1. Accepts POST /summarize with JSON body {"text": "...", "max_length": 150}
2. Uses a local Ollama instance to summarize the text
3. Returns {"summary": "...", "original_length": N, "summary_length": N}
4. Includes proper error handling, input validation with Pydantic, and async support"""
        }
    ],
    temperature=0.1,  # Low temperature for code
    max_tokens=3000
)

print(response.choices[0].message.content)
```

````python
# Code review and debugging
code_to_review = """
def find_duplicates(lst):
    seen = []
    duplicates = []
    for item in lst:
        if item in seen:
            duplicates.append(item)
        seen.append(item)
    return duplicates
"""

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-32B-Instruct",
    messages=[
        {
            "role": "user",
            "content": f"Review this Python code for performance issues and suggest improvements:\n\n```python\n{code_to_review}\n```"
        }
    ],
    temperature=0.3
)

print(response.choices[0].message.content)
````

## Qwen2.5-Coder

Optimized for code generation:

```bash
# Using vLLM
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Coder-7B-Instruct \
    --host 0.0.0.0

# Using Ollama
ollama run qwen2.5-coder:7b
```

```python
prompt = """Write a Python function that:
1. Takes a list of numbers
2. Returns the median value
3. Handles empty lists gracefully
Include type hints and docstrings."""

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-7B-Instruct",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.2
)

print(response.choices[0].message.content)
```

## Qwen2.5-Math

Specialized for mathematical reasoning:

```bash
# Using vLLM
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Math-7B-Instruct \
    --host 0.0.0.0
```

```python
prompt = """Solve step by step:
Find all values of x where: x^3 - 6x^2 + 11x - 6 = 0"""

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Math-7B-Instruct",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.1
)

print(response.choices[0].message.content)
```

## Multilingual Support

Qwen2.5 supports 29 languages:

```python
# Chinese
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "用中文解释什么是人工智能"}]
)

# Japanese
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "人工知能について日本語で説明してください"}]
)

# Korean
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "인공지능에 대해 한국어로 설명해주세요"}]
)
```

## Long Context (128K)

```python
# Read a long document
with open("long_document.txt", "r") as f:
    document = f.read()

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "user", "content": f"Summarize this document:\n\n{document}"}
    ],
    max_tokens=2000
)
```

## Quantization

### GGUF with Ollama

```bash
# 4-bit quantized
ollama pull qwen2.5:7b-instruct-q4_K_M
ollama pull qwen2.5:72b-instruct-q4_K_M   # 72B in 4-bit (~48GB)

# 8-bit quantized
ollama pull qwen2.5:7b-instruct-q8_0

# Coder variants
ollama pull qwen2.5-coder:32b-instruct-q4_K_M
```

### AWQ with vLLM

```bash
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-72B-Instruct-AWQ \
    --quantization awq \
    --tensor-parallel-size 2
```

### GGUF with llama.cpp

```bash
# Download GGUF
wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf

# Run server
./llama-server -m qwen2.5-7b-instruct-q4_k_m.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35
```

## Multi-GPU Setup

### Tensor Parallelism

```bash
# 72B on 4 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-72B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 32768

# 32B on 2 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-32B-Instruct \
    --tensor-parallel-size 2

# Coder-32B on 2 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Coder-32B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 16384
```

## Performance

### Throughput (tokens/sec)

| Model             | RTX 3090 | RTX 4090 | A100 40GB | A100 80GB |
| ----------------- | -------- | -------- | --------- | --------- |
| Qwen2.5-0.5B      | 250      | 320      | 380       | 400       |
| Qwen2.5-3B        | 150      | 200      | 250       | 280       |
| Qwen2.5-7B        | 75       | 100      | 130       | 150       |
| Qwen2.5-7B Q4     | 110      | 140      | 180       | 200       |
| Qwen2.5-14B       | -        | 55       | 70        | 85        |
| Qwen2.5-32B       | -        | -        | 35        | 50        |
| Qwen2.5-72B       | -        | -        | 20 (2x)   | 40 (2x)   |
| Qwen2.5-72B Q4    | -        | -        | -         | 55 (2x)   |
| Qwen2.5-Coder-32B | -        | -        | 32        | 48        |

### Time to First Token (TTFT)

| Model | RTX 4090 | A100 40GB  | A100 80GB  |
| ----- | -------- | ---------- | ---------- |
| 7B    | 60ms     | 40ms       | 35ms       |
| 14B   | 120ms    | 80ms       | 60ms       |
| 32B   | -        | 200ms      | 140ms      |
| 72B   | -        | 400ms (2x) | 280ms (2x) |

### Context Length vs VRAM (7B)

| Context | FP16 | Q8   | Q4   |
| ------- | ---- | ---- | ---- |
| 8K      | 16GB | 10GB | 6GB  |
| 32K     | 24GB | 16GB | 10GB |
| 64K     | 40GB | 26GB | 16GB |
| 128K    | 72GB | 48GB | 28GB |

## Benchmarks

| Model             | MMLU  | HumanEval | GSM8K | MATH  | LiveCodeBench |
| ----------------- | ----- | --------- | ----- | ----- | ------------- |
| Qwen2.5-7B        | 74.2% | 75.6%     | 85.4% | 55.2% | 42.1%         |
| Qwen2.5-14B       | 79.7% | 81.1%     | 89.5% | 65.8% | 51.3%         |
| Qwen2.5-32B       | 83.3% | 84.2%     | 91.2% | 72.1% | 60.7%         |
| Qwen2.5-72B       | 86.1% | 86.2%     | 93.2% | 79.5% | 67.4%         |
| Qwen2.5-Coder-7B  | 72.8% | 88.4%     | 86.1% | 58.4% | 64.2%         |
| Qwen2.5-Coder-32B | 83.1% | **92.7%** | 92.3% | 76.8% | **78.5%**     |

## Docker Compose

```yaml
version: '3.8'

services:
  qwen:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    command: >
      --model Qwen/Qwen2.5-7B-Instruct
      --host 0.0.0.0
      --port 8000
      --gpu-memory-utilization 0.9
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
```

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU           | Hourly Rate | Best For              |
| ------------- | ----------- | --------------------- |
| RTX 3090 24GB | \~$0.06     | 7B models             |
| RTX 4090 24GB | \~$0.10     | 7B-14B models         |
| A100 40GB     | \~$0.17     | 14B-32B models        |
| A100 80GB     | \~$0.25     | 32B models, Coder-32B |
| 2x A100 80GB  | \~$0.50     | 72B models            |
| 4x A100 80GB  | \~$1.00     | 72B max context       |

*Prices vary by provider. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads
* Pay with **CLORE** tokens
* Start with smaller models (7B) for testing

## Troubleshooting

### Out of Memory

```bash
# Reduce context
--max-model-len 8192

# Enable memory optimization
--gpu-memory-utilization 0.85

# Use quantized model
ollama pull qwen2.5:7b-instruct-q4_K_M
```

### Slow Generation

```bash
# Enable flash attention
pip install flash-attn

# Use vLLM for better throughput
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --enable-prefix-caching
```

### Chinese Characters Display

```python
# Ensure UTF-8 encoding
import sys
sys.stdout.reconfigure(encoding='utf-8')
```

### Model Not Found

```bash
# Check model name
huggingface-cli search Qwen/Qwen2.5

# Common names:
# Qwen/Qwen2.5-7B-Instruct
# Qwen/Qwen2.5-72B-Instruct       ← New
# Qwen/Qwen2.5-Coder-7B-Instruct
# Qwen/Qwen2.5-Coder-32B-Instruct ← New
# Qwen/Qwen2.5-Math-7B-Instruct
```

## Qwen2.5 vs Others

| Feature      | Qwen2.5-7B | Qwen2.5-72B | Llama 3.1 70B | GPT-4o      |
| ------------ | ---------- | ----------- | ------------- | ----------- |
| Context      | 128K       | 128K        | 128K          | 128K        |
| Multilingual | Excellent  | Excellent   | Good          | Excellent   |
| Code         | Excellent  | Excellent   | Good          | Excellent   |
| Math         | Excellent  | Excellent   | Good          | Excellent   |
| Chinese      | Excellent  | Excellent   | Poor          | Good        |
| License      | Apache 2.0 | Apache 2.0  | Llama 3.1     | Proprietary |
| Cost         | Free       | Free        | Free          | Paid API    |

**Use Qwen2.5 when:**

* Chinese language support needed
* Math/code tasks are priority
* Long context is required
* Want Apache 2.0 license
* Need best open-source code model (Coder-32B)

## Next Steps

* [vLLM](https://docs.clore.ai/guides/language-models/vllm) - Production deployment
* [Ollama](https://docs.clore.ai/guides/language-models/ollama) - Easy local setup
* [DeepSeek-V3](https://docs.clore.ai/guides/language-models/deepseek-v3) - Larger reasoning model
* [DeepSeek-R1](https://docs.clore.ai/guides/language-models/deepseek-r1) - Open-source reasoning model
* [Fine-tune LLM](https://docs.clore.ai/guides/training/finetune-llm) - Custom training


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/qwen25.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
