Qwen2.5

Run Alibaba's Qwen2.5 multilingual LLMs on Clore.ai GPUs

Run Alibaba's Qwen2.5 family of models - powerful multilingual LLMs with excellent code and math capabilities on CLORE.AI GPUs.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Why Qwen2.5?

Versatile sizes - 0.5B to 72B parameters
Multilingual - 29 languages including Chinese
Long context - Up to 128K tokens
Specialized variants - Coder, Math editions
Open source - Apache 2.0 license

Quick Deploy on CLORE.AI

Docker Image:

vllm/vllm-openai:latest

Ports:

22/tcp
8000/http

Command:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Verify It's Working

# Check if service is ready
curl https://your-http-pub.clorecloud.net/health

# List available models
curl https://your-http-pub.clorecloud.net/v1/models

If you get HTTP 502, wait 5-15 minutes - the model is still downloading from HuggingFace.

Qwen3 Reasoning Mode

New in Qwen3: Some Qwen3 models support a reasoning mode that shows the model's thought process in <think> tags before the final answer.

When using Qwen3 models via vLLM, responses may include reasoning:

{
  "content": "<think>\nLet me think about this step by step...\n</think>\n\nThe answer is..."
}

To use Qwen3 with reasoning:

vllm serve Qwen/Qwen3-0.6B --host 0.0.0.0 --port 8000

Model Variants

Base Models

Model

Parameters

VRAM (FP16)

Context

Notes

Qwen2.5-0.5B

0.5B

2GB

32K

Edge/testing

Qwen2.5-1.5B

1.5B

4GB

32K

Very light

Qwen2.5-3B

8GB

32K

Budget

Qwen2.5-7B

16GB

128K

Balanced

Qwen2.5-14B

14B

32GB

128K

High quality

Qwen2.5-32B

32B

70GB

128K

Very high quality

Qwen2.5-72B

72B

150GB

128K

Best quality

Qwen2.5-72B-Instruct

72B

150GB

128K

Chat/instruct tuned

Specialized Variants

Model

Focus

Best For

VRAM (FP16)

Qwen2.5-Coder-7B-Instruct

Code

Programming, debugging

16GB

Qwen2.5-Coder-14B-Instruct

Code

Complex code tasks

32GB

Qwen2.5-Coder-32B-Instruct

Code

Best code model

70GB

Qwen2.5-Math-7B-Instruct

Mathematics

Calculations, proofs

16GB

Qwen2.5-Math-72B-Instruct

Mathematics

Research-grade math

150GB

Qwen2.5-Instruct

Chat

General assistant

varies

Hardware Requirements

Model

Minimum GPU

Recommended

VRAM (Q4)

0.5B-3B

RTX 3060 12GB

RTX 3080

2-6GB

RTX 3090 24GB

RTX 4090

6GB

14B

A100 40GB

A100 80GB

12GB

32B

A100 80GB

2x A100 40GB

22GB

72B

2x A100 80GB

4x A100 80GB

48GB

Coder-32B

A100 80GB

2x A100 40GB

22GB

Installation

Using vLLM (Recommended)

pip install vllm==0.7.3

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000

Using Ollama

# Standard models
ollama pull qwen2.5:7b
ollama pull qwen2.5:14b
ollama pull qwen2.5:32b
ollama pull qwen2.5:72b       # New: largest Qwen2.5

# Specialized
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:32b  # New: best code model

# Run chat
ollama run qwen2.5:7b

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen2.5-7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

API Usage

OpenAI-Compatible API

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain machine learning in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Write a poem about AI"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

cURL

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "messages": [
            {"role": "user", "content": "What is Python?"}
        ]
    }'

Qwen2.5-72B-Instruct

The flagship Qwen2.5 model — the largest and most capable in the family. It competes with GPT-4 on many benchmarks and is fully open-source under Apache 2.0.

Running via vLLM (Multi-GPU)

# 4x A100 80GB setup
vllm serve Qwen/Qwen2.5-72B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9

# AWQ quantized — runs on 2x A100 80GB
vllm serve Qwen/Qwen2.5-72B-Instruct-AWQ \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --quantization awq \
    --max-model-len 32768

Running via Ollama

# Pull 72B model (requires 48GB+ VRAM for Q4)
ollama pull qwen2.5:72b

# Run interactive session
ollama run qwen2.5:72b

# API access
curl http://localhost:11434/api/chat -d '{
  "model": "qwen2.5:72b",
  "messages": [{"role": "user", "content": "Analyze this complex scenario..."}],
  "stream": false
}'

Python Example

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# The 72B model excels at complex analytical tasks
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-72B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are an expert analyst. Provide detailed, nuanced responses."
        },
        {
            "role": "user",
            "content": """Compare the architectural differences between transformer and 
            state space models (SSMs) for sequence modeling. Include efficiency tradeoffs."""
        }
    ],
    temperature=0.7,
    max_tokens=2000
)

print(response.choices[0].message.content)

Qwen2.5-Coder-32B-Instruct

The best open-source code model available. Qwen2.5-Coder-32B-Instruct matches or exceeds GPT-4o on many coding benchmarks, supporting 40+ programming languages.

Running via vLLM

# Single A100 80GB
vllm serve Qwen/Qwen2.5-Coder-32B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.9

# Dual RTX 4090 (24GB each = 48GB total, using Q4 quantization)
vllm serve Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --quantization awq

Running via Ollama

# Pull Coder-32B (requires ~22GB VRAM for Q4)
ollama pull qwen2.5-coder:32b

# Run
ollama run qwen2.5-coder:32b

# Test with a coding prompt
ollama run qwen2.5-coder:32b "Write a Python async web scraper using aiohttp"

Code Generation Examples

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Full-stack code generation
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-32B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are an expert software engineer. Write clean, production-ready code with proper error handling and documentation."
        },
        {
            "role": "user",
            "content": """Write a Python FastAPI service that:
1. Accepts POST /summarize with JSON body {"text": "...", "max_length": 150}
2. Uses a local Ollama instance to summarize the text
3. Returns {"summary": "...", "original_length": N, "summary_length": N}
4. Includes proper error handling, input validation with Pydantic, and async support"""
        }
    ],
    temperature=0.1,  # Low temperature for code
    max_tokens=3000
)

print(response.choices[0].message.content)

# Code review and debugging
code_to_review = """
def find_duplicates(lst):
    seen = []
    duplicates = []
    for item in lst:
        if item in seen:
            duplicates.append(item)
        seen.append(item)
    return duplicates
"""

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-32B-Instruct",
    messages=[
        {
            "role": "user",
            "content": f"Review this Python code for performance issues and suggest improvements:\n\n```python\n{code_to_review}\n```"
        }
    ],
    temperature=0.3
)

print(response.choices[0].message.content)

Qwen2.5-Coder

Optimized for code generation:

# Using vLLM
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Coder-7B-Instruct \
    --host 0.0.0.0

# Using Ollama
ollama run qwen2.5-coder:7b

prompt = """Write a Python function that:
1. Takes a list of numbers
2. Returns the median value
3. Handles empty lists gracefully
Include type hints and docstrings."""

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-7B-Instruct",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.2
)

print(response.choices[0].message.content)

Qwen2.5-Math

Specialized for mathematical reasoning:

# Using vLLM
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Math-7B-Instruct \
    --host 0.0.0.0

prompt = """Solve step by step:
Find all values of x where: x^3 - 6x^2 + 11x - 6 = 0"""

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Math-7B-Instruct",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.1
)

print(response.choices[0].message.content)

Multilingual Support

Qwen2.5 supports 29 languages:

# Chinese
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "用中文解释什么是人工智能"}]
)

# Japanese
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "人工知能について日本語で説明してください"}]
)

# Korean
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "인공지능에 대해 한국어로 설명해주세요"}]
)

Long Context (128K)

# Read a long document
with open("long_document.txt", "r") as f:
    document = f.read()

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "user", "content": f"Summarize this document:\n\n{document}"}
    ],
    max_tokens=2000
)

Quantization

GGUF with Ollama

# 4-bit quantized
ollama pull qwen2.5:7b-instruct-q4_K_M
ollama pull qwen2.5:72b-instruct-q4_K_M   # 72B in 4-bit (~48GB)

# 8-bit quantized
ollama pull qwen2.5:7b-instruct-q8_0

# Coder variants
ollama pull qwen2.5-coder:32b-instruct-q4_K_M

AWQ with vLLM

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-72B-Instruct-AWQ \
    --quantization awq \
    --tensor-parallel-size 2

GGUF with llama.cpp

# Download GGUF
wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf

# Run server
./llama-server -m qwen2.5-7b-instruct-q4_k_m.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35

Multi-GPU Setup

Tensor Parallelism

# 72B on 4 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-72B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 32768

# 32B on 2 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-32B-Instruct \
    --tensor-parallel-size 2

# Coder-32B on 2 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Coder-32B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 16384

Performance

Throughput (tokens/sec)

Model

RTX 3090

RTX 4090

A100 40GB

A100 80GB

Qwen2.5-0.5B

250

320

380

400

Qwen2.5-3B

150

200

250

280

Qwen2.5-7B

100

130

150

Qwen2.5-7B Q4

110

140

180

200

Qwen2.5-14B

Qwen2.5-32B

Qwen2.5-72B

20 (2x)

40 (2x)

Qwen2.5-72B Q4

55 (2x)

Qwen2.5-Coder-32B

Time to First Token (TTFT)

Model

RTX 4090

A100 40GB

A100 80GB

60ms

40ms

35ms

14B

120ms

80ms

60ms

32B

200ms

140ms

72B

400ms (2x)

280ms (2x)

Context Length vs VRAM (7B)

Context

FP16

16GB

10GB

6GB

32K

24GB

16GB

10GB

64K

40GB

26GB

16GB

128K

72GB

48GB

28GB

Benchmarks

Model

MMLU

HumanEval

GSM8K

MATH

LiveCodeBench

Qwen2.5-7B

74.2%

75.6%

85.4%

55.2%

42.1%

Qwen2.5-14B

79.7%

81.1%

89.5%

65.8%

51.3%

Qwen2.5-32B

83.3%

84.2%

91.2%

72.1%

60.7%

Qwen2.5-72B

86.1%

86.2%

93.2%

79.5%

67.4%

Qwen2.5-Coder-7B

72.8%

88.4%

86.1%

58.4%

64.2%

Qwen2.5-Coder-32B

83.1%

92.7%

92.3%

76.8%

78.5%

Docker Compose

version: '3.8'

services:
  qwen:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    command: >
      --model Qwen/Qwen2.5-7B-Instruct
      --host 0.0.0.0
      --port 8000
      --gpu-memory-utilization 0.9
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Cost Estimate

Typical CLORE.AI marketplace rates:

GPU

Hourly Rate

Best For

RTX 3090 24GB

~$0.06

7B models

RTX 4090 24GB

~$0.10

7B-14B models

A100 40GB

~$0.17

14B-32B models

A100 80GB

~$0.25

32B models, Coder-32B

2x A100 80GB

~$0.50

72B models

4x A100 80GB

~$1.00

72B max context

Prices vary by provider. Check CLORE.AI Marketplace for current rates.

Save money:

Use Spot market for flexible workloads
Pay with CLORE tokens
Start with smaller models (7B) for testing

Troubleshooting

Out of Memory

# Reduce context
--max-model-len 8192

# Enable memory optimization
--gpu-memory-utilization 0.85

# Use quantized model
ollama pull qwen2.5:7b-instruct-q4_K_M

Slow Generation

# Enable flash attention
pip install flash-attn

# Use vLLM for better throughput
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --enable-prefix-caching

Chinese Characters Display

# Ensure UTF-8 encoding
import sys
sys.stdout.reconfigure(encoding='utf-8')

Model Not Found

# Check model name
huggingface-cli search Qwen/Qwen2.5

# Common names:
# Qwen/Qwen2.5-7B-Instruct
# Qwen/Qwen2.5-72B-Instruct       ← New
# Qwen/Qwen2.5-Coder-7B-Instruct
# Qwen/Qwen2.5-Coder-32B-Instruct ← New
# Qwen/Qwen2.5-Math-7B-Instruct

Qwen2.5 vs Others

Feature

Qwen2.5-7B

Qwen2.5-72B

Llama 3.1 70B

GPT-4o

Context

128K

Multilingual

Excellent

Good

Excellent

Code

Excellent

Good

Excellent

Math

Excellent

Good

Excellent

Chinese

Excellent

Poor

Good

License

Apache 2.0

Llama 3.1

Proprietary

Cost

Free

Paid API

Use Qwen2.5 when:

Chinese language support needed
Math/code tasks are priority
Long context is required
Want Apache 2.0 license
Need best open-source code model (Coder-32B)

Next Steps

vLLM - Production deployment
Ollama - Easy local setup
DeepSeek-V3 - Larger reasoning model
DeepSeek-R1 - Open-source reasoning model
Fine-tune LLM - Custom training

PreviousDeepSeek-R1 Reasoning Model NextCodeLlama

Last updated 26 days ago

Was this helpful?

hashtagWhy Qwen2.5?

hashtagQuick Deploy on CLORE.AI

hashtagAccessing Your Service

hashtagVerify It's Working

hashtagQwen3 Reasoning Mode

hashtagModel Variants

hashtagBase Models

hashtagSpecialized Variants

hashtagHardware Requirements

hashtagInstallation

hashtagUsing vLLM (Recommended)

hashtagUsing Ollama

hashtagUsing Transformers

hashtagAPI Usage

hashtagOpenAI-Compatible API

hashtagStreaming

hashtagcURL

hashtagQwen2.5-72B-Instruct

hashtagRunning via vLLM (Multi-GPU)

hashtagRunning via Ollama

hashtagPython Example

hashtagQwen2.5-Coder-32B-Instruct

hashtagRunning via vLLM

hashtagRunning via Ollama

hashtagCode Generation Examples

hashtagQwen2.5-Coder

hashtagQwen2.5-Math

hashtagMultilingual Support

hashtagLong Context (128K)

hashtagQuantization

hashtagGGUF with Ollama

hashtagAWQ with vLLM

hashtagGGUF with llama.cpp

hashtagMulti-GPU Setup

hashtagTensor Parallelism

hashtagPerformance

hashtagThroughput (tokens/sec)

hashtagTime to First Token (TTFT)

hashtagContext Length vs VRAM (7B)

hashtagBenchmarks

hashtagDocker Compose

hashtagCost Estimate

hashtagTroubleshooting

hashtagOut of Memory

hashtagSlow Generation

hashtagChinese Characters Display

hashtagModel Not Found

hashtagQwen2.5 vs Others

hashtagNext Steps

Why Qwen2.5?

Quick Deploy on CLORE.AI

Accessing Your Service

Verify It's Working

Qwen3 Reasoning Mode

Model Variants

Base Models

Specialized Variants

Hardware Requirements

Installation

Using vLLM (Recommended)

Using Ollama

Using Transformers

API Usage

OpenAI-Compatible API

Streaming

cURL

Qwen2.5-72B-Instruct

Running via vLLM (Multi-GPU)

Running via Ollama

Python Example

Qwen2.5-Coder-32B-Instruct

Running via vLLM

Running via Ollama

Code Generation Examples

Qwen2.5-Coder

Qwen2.5-Math

Multilingual Support

Long Context (128K)

Quantization

GGUF with Ollama

AWQ with vLLM

GGUF with llama.cpp

Multi-GPU Setup

Tensor Parallelism

Performance

Throughput (tokens/sec)

Time to First Token (TTFT)

Context Length vs VRAM (7B)

Benchmarks

Docker Compose

Cost Estimate

Troubleshooting

Out of Memory

Slow Generation

Chinese Characters Display

Model Not Found

Qwen2.5 vs Others

Next Steps