Qwen2.5

在 Clore.ai GPU 上运行阿里巴巴的 Qwen2.5 多语言大型语言模型

在CLORE.AI GPU 上运行阿里巴巴的 Qwen2.5 系列模型——功能强大的多语言大模型，具备出色的代码和数学能力。

所有示例都可以在通过以下方式租用的 GPU 服务器上运行： CLORE.AI 市场.

为什么选择 Qwen2.5？

多功能的规模 - 0.5B 到 72B 参数
多语言 - 包含中文在内的 29 种语言
长上下文 - 最多可达 128K 令牌
专用变体 - Coder、Math 版本
开源 - Apache 2.0 许可

在 CLORE.AI 上快速部署

Docker 镜像：

vllm/vllm-openai:latest

端口：

22/tcp
8000/http

命令：

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000

访问您的服务

部署后，在以下位置查找您的 http_pub URL： 我的订单:

前往 我的订单 页面
单击您的订单
查找 http_pub URL（例如， abc123.clorecloud.net)

使用 https://YOUR_HTTP_PUB_URL 而不是 localhost 在下面的示例中。

验证是否正常运行

# 检查服务是否就绪
curl https://your-http-pub.clorecloud.net/health

# 列出可用模型
curl https://your-http-pub.clorecloud.net/v1/models

如果收到 HTTP 502，请等待 5-15 分钟——模型仍在从 HuggingFace 下载。

Qwen3 推理模式

Qwen3 的新特性： 一些 Qwen3 模型支持一种推理模式，显示模型在最终答案之前的思路，使用 <think> 标签。

通过 vLLM 使用 Qwen3 模型时，响应可能包含推理：

{
  "content": "<think>\n让我一步一步考虑这个问题...\n</think>\n\n答案是..."
}

要使用带有推理的 Qwen3：

vllm serve Qwen/Qwen3-0.6B --host 0.0.0.0 --port 8000

1024x1024

基础模型

A100

参数量

显存（FP16）

上下文

注意事项

Qwen2.5-0.5B

0.5B

2GB

32K

边缘/测试用

Qwen2.5-1.5B

1.5B

4GB

32K

非常轻量

Qwen2.5-3B

8GB

32K

预算型

Qwen2.5-7B

16GB

128K

平衡

Qwen2.5-14B

14B

32GB

128K

高质量

Qwen2.5-32B

32B

70GB

128K

非常高质量

Qwen2.5-72B

72B

150GB

128K

最佳质量

Qwen2.5-72B-Instruct

72B

150GB

128K

聊天/指令微调

专用变体

A100

侧重领域

最适合

显存（FP16）

Qwen2.5-Coder-7B-Instruct

代码

编程、调试

16GB

Qwen2.5-Coder-14B-Instruct

代码

复杂代码任务

32GB

Qwen2.5-Coder-32B-Instruct

代码

最佳代码模型

70GB

Qwen2.5-Math-7B-Instruct

数学

计算、证明

16GB

Qwen2.5-Math-72B-Instruct

数学

研究级数学

150GB

Qwen2.5-Instruct

对话

通用助理

各有不同

硬件要求

A100

最低 GPU

安装

使用 vLLM（推荐）

pip install vllm==0.7.3

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000

使用 Ollama

# 标准模型
ollama pull qwen2.5:7b
ollama pull qwen2.5:14b
ollama pull qwen2.5:32b
ollama pull qwen2.5:72b       # 新：最大的 Qwen2.5

# 专用模型
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:32b  # 新：最佳代码模型

# 运行聊天
ollama run qwen2.5:7b

使用 Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen2.5-7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

API 使用

兼容 OpenAI 的 API

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "用简单术语解释机器学习。"}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

流式传输

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "写一首关于 AI 的诗"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

cURL

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "messages": [
            {"role": "user", "content": "What is Python?"}
        ]
    }'

Qwen2.5-72B-Instruct

旗舰 Qwen2.5 模型 —— 该系列中最大且最有能力的模型。在许多基准上可与 GPT-4 竞争，并且在 Apache 2.0 下完全开源。

通过 vLLM 运行（多 GPU）

# 4x A100 80GB 配置
vllm serve Qwen/Qwen2.5-72B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9

# AWQ 量化 — 在 2x A100 80GB 上运行
vllm serve Qwen/Qwen2.5-72B-Instruct-AWQ \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --quantization awq \
    --max-model-len 32768

通过 Ollama 运行

# 拉取 72B 模型（Q4 需要 48GB+ 显存）
ollama pull qwen2.5:72b

# 运行交互会话
ollama run qwen2.5:72b

# API 访问
curl http://localhost:11434/api/chat -d '{
  "model": "qwen2.5:72b",
  "messages": [{"role": "user", "content": "分析这个复杂场景..."}],
  "stream": false
}'

Python 示例

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# 72B 模型在复杂分析任务上表现优秀
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-72B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "你是专家分析师。提供详细、细致的回答。"
        },
        {
            "role": "user",
            "content": """比较 transformer 和 
            状态空间模型（SSMs）在序列建模方面的架构差异。包括效率权衡。"""
        }
    ],
    temperature=0.7,
    max_tokens=2000
)

print(response.choices[0].message.content)

Qwen2.5-Coder-32B-Instruct

可用的最佳开源代码模型。Qwen2.5-Coder-32B-Instruct 在许多代码基准上可与或超过 GPT-4o，支持 40+ 种编程语言。

通过 vLLM 运行

# 单个 A100 80GB
vllm serve Qwen/Qwen2.5-Coder-32B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.9

# 双 RTX 4090（每张 24GB = 共 48GB，使用 Q4 量化）
vllm serve Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --quantization awq

通过 Ollama 运行

# 拉取 Coder-32B（Q4 约需 ~22GB 显存）
ollama pull qwen2.5-coder:32b

# 运行
ollama run qwen2.5-coder:32b

# 使用代码提示进行测试
ollama run qwen2.5-coder:32b "用 aiohttp 编写一个 Python 异步网页爬虫"

代码生成示例

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# 全栈代码生成
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-32B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "你是资深软件工程师。编写干净的、可投入生产的代码，具有适当的错误处理和文档。"
        },
        {
            "role": "user",
            "content": """编写一个 Python FastAPI 服务，要求：
1. 接收 POST /summarize，JSON 主体 {"text": "...", "max_length": 150}
2. 使用本地 Ollama 实例对文本进行摘要
3. 返回 {"summary": "...", "original_length": N, "summary_length": N}
4. 包含适当的错误处理、使用 Pydantic 的输入验证和异步支持"""
        }
    ],
    temperature=0.1,  # 代码使用低温度
    max_tokens=3000
)

print(response.choices[0].message.content)

# 代码审查与调试
code_to_review = """
def find_duplicates(lst):
    seen = []
    duplicates = []
    for item in lst:
        if item in seen:
            duplicates.append(item)
        seen.append(item)
    return duplicates
"""

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-32B-Instruct",
    messages=[
        {
            "role": "user",
            "content": f"审查此 Python 代码以查找性能问题并提出改进建议:\n\n```python\n{code_to_review}\n```"
        }
    ],
    temperature=0.3
)

print(response.choices[0].message.content)

Qwen2.5-Coder

针对代码生成进行优化：

# 使用 vLLM
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Coder-7B-Instruct \
    --host 0.0.0.0

# 使用 Ollama
ollama run qwen2.5-coder:7b

prompt = """编写一个 Python 函数：
1. 接受一个数字列表
2. 返回中位数值
3. 对空列表优雅处理
包含类型提示和文档字符串。"""

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-7B-Instruct",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.2
)

print(response.choices[0].message.content)

Qwen2.5-Math

针对数学推理的专用模型：

# 使用 vLLM
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Math-7B-Instruct \
    --host 0.0.0.0

prompt = """逐步求解：
求满足： x^3 - 6x^2 + 11x - 6 = 0 的所有 x 值"""

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Math-7B-Instruct",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.1
)

print(response.choices[0].message.content)

多语言支持

Qwen2.5 支持 29 种语言：

# Chinese
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "用中文解释什么是人工智能"}]
)

# Japanese
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "请用日语说明什么是人工智能"}]
)

# 韩语
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "请用韩语解释人工智能"}]
)

长上下文（128K）

# 读取一份长文档
with open("long_document.txt", "r") as f:
    document = f.read()

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "user", "content": f"总结此文档:\n\n{document}"}
    ],
    max_tokens=2000
)

量化

Ollama 的 GGUF

# 4 位量化
ollama pull qwen2.5:7b-instruct-q4_K_M
ollama pull qwen2.5:72b-instruct-q4_K_M   # 72B 的 4 位（约 ~48GB）

# 8-bit 量化
ollama pull qwen2.5:7b-instruct-q8_0

# Coder 变体
ollama pull qwen2.5-coder:32b-instruct-q4_K_M

AWQ 与 vLLM

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-72B-Instruct-AWQ \
    --quantization awq \
    --tensor-parallel-size 2

使用 llama.cpp 的 GGUF

# 下载 GGUF
wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf

# 运行服务器
./llama-server -m qwen2.5-7b-instruct-q4_k_m.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35

多 GPU 设置

张量并行

# 72B 在 4 个 GPU 上
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-72B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 32768

# 32B 在 2 个 GPU 上
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-32B-Instruct \
    --tensor-parallel-size 2

# Coder-32B 在 2 个 GPU 上
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Coder-32B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 16384

background = Image.open("studio_bg.jpg")

吞吐量（tokens/秒）

A100

速度

512x512

按日费率

4 小时会话

Qwen2.5-0.5B

250

320

380

400

Qwen2.5-3B

150

200

250

280

Qwen2.5-7B

100

130

150

Qwen2.5-7B Q4

110

140

180

200

Qwen2.5-14B

Qwen2.5-32B

Qwen2.5-72B

20 (2x)

40 (2x)

Qwen2.5-72B Q4

55 (2x)

Qwen2.5-Coder-32B

首个标记时间（TTFT）

A100

512x512

按日费率

4 小时会话

60ms

40ms

35ms

14B

120ms

80 毫秒

60ms

32B

200ms

140ms

72B

400ms (2x)

280ms (2x)

上下文长度 vs 显存（7B）

上下文

FP16

16GB

10GB

6GB

32K

24GB

16GB

10GB

64K

40GB

26GB

16GB

128K

72GB

48GB

28GB

基准测试

A100

MMLU

HumanEval

GSM8K

数学

LiveCodeBench

Qwen2.5-7B

74.2%

75.6%

85.4%

55.2%

42.1%

Qwen2.5-14B

79.7%

81.1%

89.5%

65.8%

51.3%

Qwen2.5-32B

83.3%

84.2%

91.2%

72.1%

60.7%

Qwen2.5-72B

86.1%

86.2%

93.2%

79.5%

67.4%

Qwen2.5-Coder-7B

72.8%

88.4%

86.1%

58.4%

64.2%

Qwen2.5-Coder-32B

83.1%

92.7%

92.3%

76.8%

78.5%

Docker Compose

version: '3.8'

services:
  qwen:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    command: >
      --model Qwen/Qwen2.5-7B-Instruct
      --host 0.0.0.0
      --port 8000
      --gpu-memory-utilization 0.9
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

下载所有所需的检查点

典型 CLORE.AI 市场价格：

GPU

验证 CUDA 兼容性

最适合

RTX 3090 24GB

~$0.06

7B 模型

RTX 4090 24GB

~$0.10

7B-14B 模型

按日费率

~$0.17

14B-32B 模型

4 小时会话

~$0.25

32B 模型，Coder-32B

2x A100 80GB

~$0.50

72B 模型

4x A100 80GB

~$1.00

72B 最大上下文

价格因提供者而异。查看 CLORE.AI 市场 A100 40GB

A100 80GB

使用竞价适用于弹性工作负载的市场
以获取当前费率。 CLORE 节省费用：
从较小的模型（7B）开始进行测试

# 使用固定种子以获得一致结果

内存不足

# 减少上下文
--max-model-len 8192

# 启用内存优化
--gpu-memory-utilization 0.85

# 使用量化模型
ollama pull qwen2.5:7b-instruct-q4_K_M

生成速度慢

# 启用 flash attention
pip install flash-attn

# 使用 vLLM 以获得更佳吞吐量
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --enable-prefix-caching

中文字符显示

# 确保使用 UTF-8 编码
"""通过 Ollama 使用 LLaVA 分析图像"""
sys.stdout.reconfigure(encoding='utf-8')

未找到模型

# 检查模型名称
huggingface-cli search Qwen/Qwen2.5

# 常见名称：
# Qwen/Qwen2.5-7B-Instruct
# Qwen/Qwen2.5-72B-Instruct       ← 新
# Qwen/Qwen2.5-Coder-7B-Instruct
# Qwen/Qwen2.5-Coder-32B-Instruct ← 新
# Qwen/Qwen2.5-Math-7B-Instruct

Qwen2.5 与其他模型比较

特性

Qwen2.5-7B

Qwen2.5-72B

Llama 3.1 70B

GPT-4o

上下文

128K

多语言

优秀

良好

优秀

代码

优秀

良好

优秀

数学

优秀

良好

优秀

中文

优秀

差

良好

许可

Apache 2.0

Llama 3.1

专有

成本

免费

付费 API

何时使用 Qwen2.5：

需要中文语言支持时
以数学/代码任务为优先时
需要长上下文时
希望获得 Apache 2.0 许可时
需要最佳开源代码模型（Coder-32B）

使用以下方式支付

vLLM - 生产部署
Ollama - 简单的本地部署
DeepSeek-V3 - 更大的推理模型
DeepSeek-R1 - 开源的推理模型
微调大型语言模型 - 自定义训练

上一页DeepSeek-R1 推理模型下一页CodeLlama

最后更新于22天前

这有帮助吗？

hashtag为什么选择 Qwen2.5？

hashtag在 CLORE.AI 上快速部署

hashtag访问您的服务

hashtag验证是否正常运行

hashtagQwen3 推理模式

hashtag1024x1024

hashtag基础模型

hashtag专用变体

hashtag硬件要求

hashtag安装

hashtag使用 vLLM（推荐）

hashtag使用 Ollama

hashtag使用 Transformers

hashtagAPI 使用

hashtag兼容 OpenAI 的 API

hashtag流式传输

hashtagcURL

hashtagQwen2.5-72B-Instruct

hashtag通过 vLLM 运行（多 GPU）

hashtag通过 Ollama 运行

hashtagPython 示例

hashtagQwen2.5-Coder-32B-Instruct

hashtag通过 vLLM 运行

hashtag通过 Ollama 运行

hashtag代码生成示例

hashtagQwen2.5-Coder

hashtagQwen2.5-Math

hashtag多语言支持

hashtag长上下文（128K）

hashtag量化

hashtagOllama 的 GGUF

hashtagAWQ 与 vLLM

hashtag使用 llama.cpp 的 GGUF

hashtag多 GPU 设置

hashtag张量并行

hashtagbackground = Image.open("studio_bg.jpg")

hashtag吞吐量（tokens/秒）

hashtag首个标记时间（TTFT）

hashtag上下文长度 vs 显存（7B）

hashtag基准测试

hashtagDocker Compose

hashtag下载所有所需的检查点

hashtag# 使用固定种子以获得一致结果

hashtag内存不足

hashtag生成速度慢

hashtag中文字符显示

hashtag未找到模型

hashtagQwen2.5 与其他模型比较

hashtag使用以下方式支付

为什么选择 Qwen2.5？

在 CLORE.AI 上快速部署

访问您的服务

验证是否正常运行

Qwen3 推理模式

1024x1024

基础模型

专用变体

硬件要求

安装

使用 vLLM（推荐）

使用 Ollama

使用 Transformers

API 使用

兼容 OpenAI 的 API

流式传输

cURL

Qwen2.5-72B-Instruct

通过 vLLM 运行（多 GPU）

通过 Ollama 运行

Python 示例

Qwen2.5-Coder-32B-Instruct

通过 vLLM 运行

通过 Ollama 运行

代码生成示例

Qwen2.5-Coder

Qwen2.5-Math

多语言支持

长上下文（128K）

量化

Ollama 的 GGUF

AWQ 与 vLLM

使用 llama.cpp 的 GGUF

多 GPU 设置

张量并行

background = Image.open("studio_bg.jpg")

吞吐量（tokens/秒）

首个标记时间（TTFT）

上下文长度 vs 显存（7B）

基准测试

Docker Compose

下载所有所需的检查点

# 使用固定种子以获得一致结果

内存不足

生成速度慢

中文字符显示

未找到模型

Qwen2.5 与其他模型比较

使用以下方式支付