Qwen2.5

Run Alibaba's Qwen2.5 family of models - powerful multilingual LLMs with excellent code and math capabilities on CLORE.AI GPUs.

circle-check

Why Qwen2.5?

  • Versatile sizes - 0.5B to 72B parameters

  • Multilingual - 29 languages including Chinese

  • Long context - Up to 128K tokens

  • Specialized variants - Coder, Math editions

  • Open source - Apache 2.0 license

Quick Deploy on CLORE.AI

Docker Image:

vllm/vllm-openai:latest

Ports:

22/tcp
8000/http

Command:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

  1. Go to My Orders page

  2. Click on your order

  3. Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Verify It's Working

circle-exclamation

Qwen3 Reasoning Mode

circle-info

New in Qwen3: Some Qwen3 models support a reasoning mode that shows the model's thought process in <think> tags before the final answer.

When using Qwen3 models via vLLM, responses may include reasoning:

To use Qwen3 with reasoning:

Model Variants

Base Models

Model
Parameters
VRAM (FP16)
Context

Qwen2.5-0.5B

0.5B

2GB

32K

Qwen2.5-1.5B

1.5B

4GB

32K

Qwen2.5-3B

3B

8GB

32K

Qwen2.5-7B

7B

16GB

128K

Qwen2.5-14B

14B

32GB

128K

Qwen2.5-32B

32B

70GB

128K

Qwen2.5-72B

72B

150GB

128K

Specialized Variants

Model
Focus
Best For

Qwen2.5-Coder

Code

Programming, debugging

Qwen2.5-Math

Mathematics

Calculations, proofs

Qwen2.5-Instruct

Chat

General assistant

Hardware Requirements

Model
Minimum GPU
Recommended

0.5B-3B

RTX 3060 12GB

RTX 3080

7B

RTX 3090 24GB

RTX 4090

14B

A100 40GB

A100 80GB

32B

A100 80GB

2x A100 40GB

72B

2x A100 80GB

4x A100 80GB

Installation

Using Ollama

Using Transformers

API Usage

OpenAI-Compatible API

Streaming

cURL

Qwen2.5-Coder

Optimized for code generation:

Qwen2.5-Math

Specialized for mathematical reasoning:

Multilingual Support

Qwen2.5 supports 29 languages:

Long Context (128K)

Quantization

GGUF with Ollama

AWQ with vLLM

GGUF with llama.cpp

Multi-GPU Setup

Tensor Parallelism

Performance

Throughput (tokens/sec)

Model
RTX 3090
RTX 4090
A100 40GB
A100 80GB

Qwen2.5-0.5B

250

320

380

400

Qwen2.5-3B

150

200

250

280

Qwen2.5-7B

75

100

130

150

Qwen2.5-7B Q4

110

140

180

200

Qwen2.5-14B

-

55

70

85

Qwen2.5-32B

-

-

35

50

Qwen2.5-72B

-

-

20 (2x)

40 (2x)

Time to First Token (TTFT)

Model
RTX 4090
A100 40GB
A100 80GB

7B

60ms

40ms

35ms

14B

120ms

80ms

60ms

32B

-

200ms

140ms

72B

-

400ms (2x)

280ms (2x)

Context Length vs VRAM (7B)

Context
FP16
Q8
Q4

8K

16GB

10GB

6GB

32K

24GB

16GB

10GB

64K

40GB

26GB

16GB

128K

72GB

48GB

28GB

Benchmarks

Model
MMLU
HumanEval
GSM8K
MATH

Qwen2.5-7B

74.2%

75.6%

85.4%

55.2%

Qwen2.5-14B

79.7%

81.1%

89.5%

65.8%

Qwen2.5-32B

83.3%

84.2%

91.2%

72.1%

Qwen2.5-72B

86.1%

86.2%

93.2%

79.5%

Docker Compose

Cost Estimate

Typical CLORE.AI marketplace rates:

GPU
Hourly Rate
Best For

RTX 3090 24GB

~$0.06

7B models

RTX 4090 24GB

~$0.10

7B-14B models

A100 40GB

~$0.17

14B-32B models

A100 80GB

~$0.25

32B models

2x A100 80GB

~$0.50

72B models

Prices vary by provider. Check CLORE.AI Marketplacearrow-up-right for current rates.

Save money:

  • Use Spot market for flexible workloads

  • Pay with CLORE tokens

  • Start with smaller models (7B) for testing

Troubleshooting

Out of Memory

Slow Generation

Chinese Characters Display

Model Not Found

Qwen2.5 vs Others

Feature
Qwen2.5-7B
Llama 3.1 8B
Mistral 7B

Context

128K

128K

32K

Multilingual

Excellent

Good

Good

Code

Excellent

Good

Good

Math

Excellent

Good

Good

Chinese

Excellent

Poor

Poor

License

Apache 2.0

Llama 3.1

Apache 2.0

Use Qwen2.5 when:

  • Chinese language support needed

  • Math/code tasks are priority

  • Long context is required

  • Want Apache 2.0 license

Next Steps

Last updated

Was this helpful?