Qwen3.5

在 Clore.ai 上运行阿里巴巴 Qwen3.5——最新前沿模型（2026 年 2 月）

Qwen3.5 于 2026 年 2 月 16 日发布，是阿里巴巴最新的旗舰模型，也是 2026 年最受关注的开源发布之一。该 397B MoE 旗舰机型 在 HMMT 数学基准上击败了 Claude 4.5 Opus，而较小的 35B 密集模型 可以安装在单张 RTX 4090 上。所有模型开箱即具备代理能力（工具使用、函数调用、自治任务执行）和多模态理解能力。

主要特性

三种规模：9B（密集）、35B（密集）、397B（MoE）— 适合各种 GPU
击败 Claude 4.5 Opus 在 HMMT 数学基准上
原生多模态：文本 + 图像理解
代理能力：工具使用、函数调用、自治工作流
128K 上下文窗口：处理大型文档和代码库
Apache 2.0 许可证：完全商业使用，无限制

模型变体

模型

参数

类型

显存（Q4）

显存（FP16）

优势

Qwen3.5-9B

密集

6GB

18GB

快速、高效

Qwen3.5-35B

35B

密集

22GB

70GB

最佳单 GPU 选择

Qwen3.5-397B

397B

MoE

约 100GB

400GB+

前沿级别

需求

组件

9B（Q4）

35B（Q4）

397B（多 GPU）

GPU

RTX 3080 10GB

RTX 4090 24GB

4× H100 80GB

显存

8GB

22GB

320GB+

内存

16GB

32GB

128GB

磁盘

15GB

30GB

250GB

推荐的 Clore.ai GPU：RTX 4090 24GB（约 $0.5–2/天）用于 35B — 每美元的质量最佳

使用 Ollama 快速入门

# 安装 Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 9B — 在任何设备上运行（8GB 显存）
ollama run qwen3.5:9b

# 35B 量化版 — 需要 RTX 4090（24GB）
ollama run qwen3.5:35b

# 作为 API 服务器
ollama serve &
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5:35b",
    "messages": [{"role": "user", "content": "Solve this: if f(x) = x^3 - 3x + 1, find all real roots"}]
  }'

vLLM 设置（用于生产）

pip install vllm

# 单 GPU 上的 35B
vllm serve Qwen/Qwen3.5-35B-Instruct \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

# 具有长上下文的 9B
vllm serve Qwen/Qwen3.5-9B-Instruct \
  --max-model-len 65536

# 在多 GPU 集群上的 397B
vllm serve Qwen/Qwen3.5-397B-A45B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 32768

HuggingFace Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3.5-35B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True  # 在 24GB 上适配 35B
)

messages = [
    {"role": "system", "content": "You are a helpful math tutor."},
    {"role": "user", "content": "Prove that the square root of 2 is irrational."}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=2048, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))

代理 / 工具使用示例

import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

tools = [{
    "type": "function",
    "function": {
        "name": "get_gpu_price",
        "description": "获取 Clore.ai 上某款 GPU 的当前租赁价格",
        "parameters": {
            "type": "object",
            "properties": {
                "gpu_model": {"type": "string", "description": "GPU 型号名称，例如 RTX 4090"}
            },
            "required": ["gpu_model"]
        }
    }
}]

response = client.chat.completions.create(
    model="qwen3.5:35b",
    messages=[{"role": "user", "content": "What's the cheapest GPU I can rent for running a 7B model?"}],
    tools=tools,
    tool_choice="auto"
)

# Qwen3.5 将使用适当参数调用 get_gpu_price
print(response.choices[0].message)

为什么在 Clore.ai 上使用 Qwen3.5？

35B 模型可以说是 你可以在单张 RTX 4090 上运行的最佳模型:

在数学与推理方面超越 Llama 4 Scout
在代理任务上超越 Gemma 3 27B
工具使用 / 函数调用开箱即用
Apache 2.0 = 无许可烦恼

以 RTX 4090 每天 $0.5–2 的价格，你可以用咖啡的钱获得前沿级 AI。

给 Clore.ai 用户的提示

35B 是最佳折中：可放入 RTX 4090 Q4，性能胜过大多数 70B 模型
预算选 9B：即使是 RTX 3060（$0.15/天）也能很好地运行 9B 模型
使用 Ollama 快速上手：一条命令即可提供服务；包含兼容 OpenAI 的 API
代理工作流：Qwen3.5 擅长工具使用 — 与函数调用结合实现自动化
新模型 = 更少缓存：首次下载需要时间（35B 约 20GB）。在工作负载开始前预先拉取

故障排除

问题

解决方案

35B 在 24GB 上 OOM（内存不足）

使用 load_in_4bit=True 或减少 --max-model-len

Ollama 未找到模型

更新 Ollama： curl -fsSL https://ollama.com/install.sh | sh

第一次请求较慢

模型加载需要 30–60 秒；后续请求会很快

工具调用无效

确保传入 tools 参数；仅使用 instruct 变体

延伸阅读

上一页Mistral Small 3.1 下一页GLM-5

最后更新于22天前

这有帮助吗？

hashtag主要特性

hashtag模型变体

hashtag需求

hashtag使用 Ollama 快速入门

hashtagvLLM 设置（用于生产）

hashtagHuggingFace Transformers

hashtag代理 / 工具使用 示例

hashtag为什么在 Clore.ai 上使用 Qwen3.5？

hashtag给 Clore.ai 用户的提示

hashtag故障排除

hashtag延伸阅读