Qwen2.5

Run Alibaba's Qwen2.5 multilingual LLMs on Clore.ai GPUs

Run Alibaba's Qwen2.5 family of models - powerful multilingual LLMs with excellent code and math capabilities on CLORE.AI GPUs.

circle-check

Why Qwen2.5?

  • Versatile sizes - 0.5B to 72B parameters

  • Multilingual - 29 languages including Chinese

  • Long context - Up to 128K tokens

  • Specialized variants - Coder, Math editions

  • Open source - Apache 2.0 license

Quick Deploy on CLORE.AI

Docker Image:

vllm/vllm-openai:latest

Ports:

22/tcp
8000/http

Command:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

  1. Go to My Orders page

  2. Click on your order

  3. Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Verify It's Working

circle-exclamation

Qwen3 Reasoning Mode

circle-info

New in Qwen3: Some Qwen3 models support a reasoning mode that shows the model's thought process in <think> tags before the final answer.

When using Qwen3 models via vLLM, responses may include reasoning:

To use Qwen3 with reasoning:

Model Variants

Base Models

Model
Parameters
VRAM (FP16)
Context
Notes

Qwen2.5-0.5B

0.5B

2GB

32K

Edge/testing

Qwen2.5-1.5B

1.5B

4GB

32K

Very light

Qwen2.5-3B

3B

8GB

32K

Budget

Qwen2.5-7B

7B

16GB

128K

Balanced

Qwen2.5-14B

14B

32GB

128K

High quality

Qwen2.5-32B

32B

70GB

128K

Very high quality

Qwen2.5-72B

72B

150GB

128K

Best quality

Qwen2.5-72B-Instruct

72B

150GB

128K

Chat/instruct tuned

Specialized Variants

Model
Focus
Best For
VRAM (FP16)

Qwen2.5-Coder-7B-Instruct

Code

Programming, debugging

16GB

Qwen2.5-Coder-14B-Instruct

Code

Complex code tasks

32GB

Qwen2.5-Coder-32B-Instruct

Code

Best code model

70GB

Qwen2.5-Math-7B-Instruct

Mathematics

Calculations, proofs

16GB

Qwen2.5-Math-72B-Instruct

Mathematics

Research-grade math

150GB

Qwen2.5-Instruct

Chat

General assistant

varies

Hardware Requirements

Model
Minimum GPU
Recommended
VRAM (Q4)

0.5B-3B

RTX 3060 12GB

RTX 3080

2-6GB

7B

RTX 3090 24GB

RTX 4090

6GB

14B

A100 40GB

A100 80GB

12GB

32B

A100 80GB

2x A100 40GB

22GB

72B

2x A100 80GB

4x A100 80GB

48GB

Coder-32B

A100 80GB

2x A100 40GB

22GB

Installation

Using Ollama

Using Transformers

API Usage

OpenAI-Compatible API

Streaming

cURL

Qwen2.5-72B-Instruct

The flagship Qwen2.5 model — the largest and most capable in the family. It competes with GPT-4 on many benchmarks and is fully open-source under Apache 2.0.

Running via vLLM (Multi-GPU)

Running via Ollama

Python Example

Qwen2.5-Coder-32B-Instruct

The best open-source code model available. Qwen2.5-Coder-32B-Instruct matches or exceeds GPT-4o on many coding benchmarks, supporting 40+ programming languages.

Running via vLLM

Running via Ollama

Code Generation Examples

Qwen2.5-Coder

Optimized for code generation:

Qwen2.5-Math

Specialized for mathematical reasoning:

Multilingual Support

Qwen2.5 supports 29 languages:

Long Context (128K)

Quantization

GGUF with Ollama

AWQ with vLLM

GGUF with llama.cpp

Multi-GPU Setup

Tensor Parallelism

Performance

Throughput (tokens/sec)

Model
RTX 3090
RTX 4090
A100 40GB
A100 80GB

Qwen2.5-0.5B

250

320

380

400

Qwen2.5-3B

150

200

250

280

Qwen2.5-7B

75

100

130

150

Qwen2.5-7B Q4

110

140

180

200

Qwen2.5-14B

-

55

70

85

Qwen2.5-32B

-

-

35

50

Qwen2.5-72B

-

-

20 (2x)

40 (2x)

Qwen2.5-72B Q4

-

-

-

55 (2x)

Qwen2.5-Coder-32B

-

-

32

48

Time to First Token (TTFT)

Model
RTX 4090
A100 40GB
A100 80GB

7B

60ms

40ms

35ms

14B

120ms

80ms

60ms

32B

-

200ms

140ms

72B

-

400ms (2x)

280ms (2x)

Context Length vs VRAM (7B)

Context
FP16
Q8
Q4

8K

16GB

10GB

6GB

32K

24GB

16GB

10GB

64K

40GB

26GB

16GB

128K

72GB

48GB

28GB

Benchmarks

Model
MMLU
HumanEval
GSM8K
MATH
LiveCodeBench

Qwen2.5-7B

74.2%

75.6%

85.4%

55.2%

42.1%

Qwen2.5-14B

79.7%

81.1%

89.5%

65.8%

51.3%

Qwen2.5-32B

83.3%

84.2%

91.2%

72.1%

60.7%

Qwen2.5-72B

86.1%

86.2%

93.2%

79.5%

67.4%

Qwen2.5-Coder-7B

72.8%

88.4%

86.1%

58.4%

64.2%

Qwen2.5-Coder-32B

83.1%

92.7%

92.3%

76.8%

78.5%

Docker Compose

Cost Estimate

Typical CLORE.AI marketplace rates:

GPU
Hourly Rate
Best For

RTX 3090 24GB

~$0.06

7B models

RTX 4090 24GB

~$0.10

7B-14B models

A100 40GB

~$0.17

14B-32B models

A100 80GB

~$0.25

32B models, Coder-32B

2x A100 80GB

~$0.50

72B models

4x A100 80GB

~$1.00

72B max context

Prices vary by provider. Check CLORE.AI Marketplacearrow-up-right for current rates.

Save money:

  • Use Spot market for flexible workloads

  • Pay with CLORE tokens

  • Start with smaller models (7B) for testing

Troubleshooting

Out of Memory

Slow Generation

Chinese Characters Display

Model Not Found

Qwen2.5 vs Others

Feature
Qwen2.5-7B
Qwen2.5-72B
Llama 3.1 70B
GPT-4o

Context

128K

128K

128K

128K

Multilingual

Excellent

Excellent

Good

Excellent

Code

Excellent

Excellent

Good

Excellent

Math

Excellent

Excellent

Good

Excellent

Chinese

Excellent

Excellent

Poor

Good

License

Apache 2.0

Apache 2.0

Llama 3.1

Proprietary

Cost

Free

Free

Free

Paid API

Use Qwen2.5 when:

  • Chinese language support needed

  • Math/code tasks are priority

  • Long context is required

  • Want Apache 2.0 license

  • Need best open-source code model (Coder-32B)

Next Steps

Last updated

Was this helpful?