# Qwen2.5

Run Alibaba's Qwen2.5 family of models - powerful multilingual LLMs with excellent code and math capabilities on CLORE.AI GPUs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Why Qwen2.5?

* **Versatile sizes** - 0.5B to 72B parameters
* **Multilingual** - 29 languages including Chinese
* **Long context** - Up to 128K tokens
* **Specialized variants** - Coder, Math editions
* **Open source** - Apache 2.0 license

## Quick Deploy on CLORE.AI

**Docker Image:**

```
vllm/vllm-openai:latest
```

**Ports:**

```
22/tcp
8000/http
```

**Command:**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

### Verify It's Working

```bash
# Check if service is ready
curl https://your-http-pub.clorecloud.net/health

# List available models
curl https://your-http-pub.clorecloud.net/v1/models
```

{% hint style="warning" %}
If you get HTTP 502, wait 5-15 minutes - the model is still downloading from HuggingFace.
{% endhint %}

## Qwen3 Reasoning Mode

{% hint style="info" %}
**New in Qwen3:** Some Qwen3 models support a reasoning mode that shows the model's thought process in `<think>` tags before the final answer.
{% endhint %}

When using Qwen3 models via vLLM, responses may include reasoning:

```json
{
  "content": "<think>\nLet me think about this step by step...\n</think>\n\nThe answer is..."
}
```

To use Qwen3 with reasoning:

```bash
vllm serve Qwen/Qwen3-0.6B --host 0.0.0.0 --port 8000
```

## Model Variants

### Base Models

| Model                | Parameters | VRAM (FP16) | Context | Notes               |
| -------------------- | ---------- | ----------- | ------- | ------------------- |
| Qwen2.5-0.5B         | 0.5B       | 2GB         | 32K     | Edge/testing        |
| Qwen2.5-1.5B         | 1.5B       | 4GB         | 32K     | Very light          |
| Qwen2.5-3B           | 3B         | 8GB         | 32K     | Budget              |
| Qwen2.5-7B           | 7B         | 16GB        | 128K    | Balanced            |
| Qwen2.5-14B          | 14B        | 32GB        | 128K    | High quality        |
| Qwen2.5-32B          | 32B        | 70GB        | 128K    | Very high quality   |
| Qwen2.5-72B          | 72B        | 150GB       | 128K    | **Best quality**    |
| Qwen2.5-72B-Instruct | 72B        | 150GB       | 128K    | Chat/instruct tuned |

### Specialized Variants

| Model                      | Focus       | Best For               | VRAM (FP16) |
| -------------------------- | ----------- | ---------------------- | ----------- |
| Qwen2.5-Coder-7B-Instruct  | Code        | Programming, debugging | 16GB        |
| Qwen2.5-Coder-14B-Instruct | Code        | Complex code tasks     | 32GB        |
| Qwen2.5-Coder-32B-Instruct | Code        | **Best code model**    | 70GB        |
| Qwen2.5-Math-7B-Instruct   | Mathematics | Calculations, proofs   | 16GB        |
| Qwen2.5-Math-72B-Instruct  | Mathematics | Research-grade math    | 150GB       |
| Qwen2.5-Instruct           | Chat        | General assistant      | varies      |

## Hardware Requirements

| Model     | Minimum GPU   | Recommended  | VRAM (Q4) |
| --------- | ------------- | ------------ | --------- |
| 0.5B-3B   | RTX 3060 12GB | RTX 3080     | 2-6GB     |
| 7B        | RTX 3090 24GB | RTX 4090     | 6GB       |
| 14B       | A100 40GB     | A100 80GB    | 12GB      |
| 32B       | A100 80GB     | 2x A100 40GB | 22GB      |
| 72B       | 2x A100 80GB  | 4x A100 80GB | 48GB      |
| Coder-32B | A100 80GB     | 2x A100 40GB | 22GB      |

## Installation

### Using vLLM (Recommended)

```bash
pip install vllm==0.7.3

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000
```

### Using Ollama

```bash
# Standard models
ollama pull qwen2.5:7b
ollama pull qwen2.5:14b
ollama pull qwen2.5:32b
ollama pull qwen2.5:72b       # New: largest Qwen2.5

# Specialized
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:32b  # New: best code model

# Run chat
ollama run qwen2.5:7b
```

### Using Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen2.5-7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## API Usage

### OpenAI-Compatible API

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain machine learning in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)
```

### Streaming

```python
stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Write a poem about AI"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
```

### cURL

```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "messages": [
            {"role": "user", "content": "What is Python?"}
        ]
    }'
```

## Qwen2.5-72B-Instruct

The flagship Qwen2.5 model — the largest and most capable in the family. It competes with GPT-4 on many benchmarks and is fully open-source under Apache 2.0.

### Running via vLLM (Multi-GPU)

```bash
# 4x A100 80GB setup
vllm serve Qwen/Qwen2.5-72B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9

# AWQ quantized — runs on 2x A100 80GB
vllm serve Qwen/Qwen2.5-72B-Instruct-AWQ \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --quantization awq \
    --max-model-len 32768
```

### Running via Ollama

```bash
# Pull 72B model (requires 48GB+ VRAM for Q4)
ollama pull qwen2.5:72b

# Run interactive session
ollama run qwen2.5:72b

# API access
curl http://localhost:11434/api/chat -d '{
  "model": "qwen2.5:72b",
  "messages": [{"role": "user", "content": "Analyze this complex scenario..."}],
  "stream": false
}'
```

### Python Example

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# The 72B model excels at complex analytical tasks
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-72B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are an expert analyst. Provide detailed, nuanced responses."
        },
        {
            "role": "user",
            "content": """Compare the architectural differences between transformer and 
            state space models (SSMs) for sequence modeling. Include efficiency tradeoffs."""
        }
    ],
    temperature=0.7,
    max_tokens=2000
)

print(response.choices[0].message.content)
```

## Qwen2.5-Coder-32B-Instruct

The best open-source code model available. Qwen2.5-Coder-32B-Instruct matches or exceeds GPT-4o on many coding benchmarks, supporting 40+ programming languages.

### Running via vLLM

```bash
# Single A100 80GB
vllm serve Qwen/Qwen2.5-Coder-32B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.9

# Dual RTX 4090 (24GB each = 48GB total, using Q4 quantization)
vllm serve Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --quantization awq
```

### Running via Ollama

```bash
# Pull Coder-32B (requires ~22GB VRAM for Q4)
ollama pull qwen2.5-coder:32b

# Run
ollama run qwen2.5-coder:32b

# Test with a coding prompt
ollama run qwen2.5-coder:32b "Write a Python async web scraper using aiohttp"
```

### Code Generation Examples

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Full-stack code generation
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-32B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are an expert software engineer. Write clean, production-ready code with proper error handling and documentation."
        },
        {
            "role": "user",
            "content": """Write a Python FastAPI service that:
1. Accepts POST /summarize with JSON body {"text": "...", "max_length": 150}
2. Uses a local Ollama instance to summarize the text
3. Returns {"summary": "...", "original_length": N, "summary_length": N}
4. Includes proper error handling, input validation with Pydantic, and async support"""
        }
    ],
    temperature=0.1,  # Low temperature for code
    max_tokens=3000
)

print(response.choices[0].message.content)
```

````python
# Code review and debugging
code_to_review = """
def find_duplicates(lst):
    seen = []
    duplicates = []
    for item in lst:
        if item in seen:
            duplicates.append(item)
        seen.append(item)
    return duplicates
"""

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-32B-Instruct",
    messages=[
        {
            "role": "user",
            "content": f"Review this Python code for performance issues and suggest improvements:\n\n```python\n{code_to_review}\n```"
        }
    ],
    temperature=0.3
)

print(response.choices[0].message.content)
````

## Qwen2.5-Coder

Optimized for code generation:

```bash
# Using vLLM
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Coder-7B-Instruct \
    --host 0.0.0.0

# Using Ollama
ollama run qwen2.5-coder:7b
```

```python
prompt = """Write a Python function that:
1. Takes a list of numbers
2. Returns the median value
3. Handles empty lists gracefully
Include type hints and docstrings."""

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-7B-Instruct",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.2
)

print(response.choices[0].message.content)
```

## Qwen2.5-Math

Specialized for mathematical reasoning:

```bash
# Using vLLM
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Math-7B-Instruct \
    --host 0.0.0.0
```

```python
prompt = """Solve step by step:
Find all values of x where: x^3 - 6x^2 + 11x - 6 = 0"""

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Math-7B-Instruct",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.1
)

print(response.choices[0].message.content)
```

## Multilingual Support

Qwen2.5 supports 29 languages:

```python
# Chinese
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "用中文解释什么是人工智能"}]
)

# Japanese
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "人工知能について日本語で説明してください"}]
)

# Korean
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "인공지능에 대해 한국어로 설명해주세요"}]
)
```

## Long Context (128K)

```python
# Read a long document
with open("long_document.txt", "r") as f:
    document = f.read()

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "user", "content": f"Summarize this document:\n\n{document}"}
    ],
    max_tokens=2000
)
```

## Quantization

### GGUF with Ollama

```bash
# 4-bit quantized
ollama pull qwen2.5:7b-instruct-q4_K_M
ollama pull qwen2.5:72b-instruct-q4_K_M   # 72B in 4-bit (~48GB)

# 8-bit quantized
ollama pull qwen2.5:7b-instruct-q8_0

# Coder variants
ollama pull qwen2.5-coder:32b-instruct-q4_K_M
```

### AWQ with vLLM

```bash
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-72B-Instruct-AWQ \
    --quantization awq \
    --tensor-parallel-size 2
```

### GGUF with llama.cpp

```bash
# Download GGUF
wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf

# Run server
./llama-server -m qwen2.5-7b-instruct-q4_k_m.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35
```

## Multi-GPU Setup

### Tensor Parallelism

```bash
# 72B on 4 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-72B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 32768

# 32B on 2 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-32B-Instruct \
    --tensor-parallel-size 2

# Coder-32B on 2 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Coder-32B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 16384
```

## Performance

### Throughput (tokens/sec)

| Model             | RTX 3090 | RTX 4090 | A100 40GB | A100 80GB |
| ----------------- | -------- | -------- | --------- | --------- |
| Qwen2.5-0.5B      | 250      | 320      | 380       | 400       |
| Qwen2.5-3B        | 150      | 200      | 250       | 280       |
| Qwen2.5-7B        | 75       | 100      | 130       | 150       |
| Qwen2.5-7B Q4     | 110      | 140      | 180       | 200       |
| Qwen2.5-14B       | -        | 55       | 70        | 85        |
| Qwen2.5-32B       | -        | -        | 35        | 50        |
| Qwen2.5-72B       | -        | -        | 20 (2x)   | 40 (2x)   |
| Qwen2.5-72B Q4    | -        | -        | -         | 55 (2x)   |
| Qwen2.5-Coder-32B | -        | -        | 32        | 48        |

### Time to First Token (TTFT)

| Model | RTX 4090 | A100 40GB  | A100 80GB  |
| ----- | -------- | ---------- | ---------- |
| 7B    | 60ms     | 40ms       | 35ms       |
| 14B   | 120ms    | 80ms       | 60ms       |
| 32B   | -        | 200ms      | 140ms      |
| 72B   | -        | 400ms (2x) | 280ms (2x) |

### Context Length vs VRAM (7B)

| Context | FP16 | Q8   | Q4   |
| ------- | ---- | ---- | ---- |
| 8K      | 16GB | 10GB | 6GB  |
| 32K     | 24GB | 16GB | 10GB |
| 64K     | 40GB | 26GB | 16GB |
| 128K    | 72GB | 48GB | 28GB |

## Benchmarks

| Model             | MMLU  | HumanEval | GSM8K | MATH  | LiveCodeBench |
| ----------------- | ----- | --------- | ----- | ----- | ------------- |
| Qwen2.5-7B        | 74.2% | 75.6%     | 85.4% | 55.2% | 42.1%         |
| Qwen2.5-14B       | 79.7% | 81.1%     | 89.5% | 65.8% | 51.3%         |
| Qwen2.5-32B       | 83.3% | 84.2%     | 91.2% | 72.1% | 60.7%         |
| Qwen2.5-72B       | 86.1% | 86.2%     | 93.2% | 79.5% | 67.4%         |
| Qwen2.5-Coder-7B  | 72.8% | 88.4%     | 86.1% | 58.4% | 64.2%         |
| Qwen2.5-Coder-32B | 83.1% | **92.7%** | 92.3% | 76.8% | **78.5%**     |

## Docker Compose

```yaml
version: '3.8'

services:
  qwen:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    command: >
      --model Qwen/Qwen2.5-7B-Instruct
      --host 0.0.0.0
      --port 8000
      --gpu-memory-utilization 0.9
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
```

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU           | Hourly Rate | Best For              |
| ------------- | ----------- | --------------------- |
| RTX 3090 24GB | \~$0.06     | 7B models             |
| RTX 4090 24GB | \~$0.10     | 7B-14B models         |
| A100 40GB     | \~$0.17     | 14B-32B models        |
| A100 80GB     | \~$0.25     | 32B models, Coder-32B |
| 2x A100 80GB  | \~$0.50     | 72B models            |
| 4x A100 80GB  | \~$1.00     | 72B max context       |

*Prices vary by provider. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads
* Pay with **CLORE** tokens
* Start with smaller models (7B) for testing

## Troubleshooting

### Out of Memory

```bash
# Reduce context
--max-model-len 8192

# Enable memory optimization
--gpu-memory-utilization 0.85

# Use quantized model
ollama pull qwen2.5:7b-instruct-q4_K_M
```

### Slow Generation

```bash
# Enable flash attention
pip install flash-attn

# Use vLLM for better throughput
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --enable-prefix-caching
```

### Chinese Characters Display

```python
# Ensure UTF-8 encoding
import sys
sys.stdout.reconfigure(encoding='utf-8')
```

### Model Not Found

```bash
# Check model name
huggingface-cli search Qwen/Qwen2.5

# Common names:
# Qwen/Qwen2.5-7B-Instruct
# Qwen/Qwen2.5-72B-Instruct       ← New
# Qwen/Qwen2.5-Coder-7B-Instruct
# Qwen/Qwen2.5-Coder-32B-Instruct ← New
# Qwen/Qwen2.5-Math-7B-Instruct
```

## Qwen2.5 vs Others

| Feature      | Qwen2.5-7B | Qwen2.5-72B | Llama 3.1 70B | GPT-4o      |
| ------------ | ---------- | ----------- | ------------- | ----------- |
| Context      | 128K       | 128K        | 128K          | 128K        |
| Multilingual | Excellent  | Excellent   | Good          | Excellent   |
| Code         | Excellent  | Excellent   | Good          | Excellent   |
| Math         | Excellent  | Excellent   | Good          | Excellent   |
| Chinese      | Excellent  | Excellent   | Poor          | Good        |
| License      | Apache 2.0 | Apache 2.0  | Llama 3.1     | Proprietary |
| Cost         | Free       | Free        | Free          | Paid API    |

**Use Qwen2.5 when:**

* Chinese language support needed
* Math/code tasks are priority
* Long context is required
* Want Apache 2.0 license
* Need best open-source code model (Coder-32B)

## Next Steps

* [vLLM](https://docs.clore.ai/guides/language-models/vllm) - Production deployment
* [Ollama](https://docs.clore.ai/guides/language-models/ollama) - Easy local setup
* [DeepSeek-V3](https://docs.clore.ai/guides/language-models/deepseek-v3) - Larger reasoning model
* [DeepSeek-R1](https://docs.clore.ai/guides/language-models/deepseek-r1) - Open-source reasoning model
* [Fine-tune LLM](https://docs.clore.ai/guides/training/finetune-llm) - Custom training
