# Qwen3.5

Qwen3.5, released February 16, 2026, is Alibaba's latest flagship model and one of the hottest open-source releases of 2026. The **397B MoE flagship** beat Claude 4.5 Opus on the HMMT math benchmark, while the smaller **35B dense model** fits on a single RTX 4090. All models come with agentic capabilities (tool use, function calling, autonomous task execution) and multimodal understanding out of the box.

## Key Features

* **Three sizes**: 9B (dense), 35B (dense), 397B (MoE) — something for every GPU
* **Beat Claude 4.5 Opus** on HMMT math benchmark
* **Natively multimodal**: Text + image understanding
* **Agentic capabilities**: Tool use, function calling, autonomous workflows
* **128K context window**: Handle large documents and codebases
* **Apache 2.0 license**: Full commercial use, no restrictions

## Model Variants

| Model        | Params | Type  | VRAM (Q4) | VRAM (FP16) | Strength        |
| ------------ | ------ | ----- | --------- | ----------- | --------------- |
| Qwen3.5-9B   | 9B     | Dense | 6GB       | 18GB        | Fast, efficient |
| Qwen3.5-35B  | 35B    | Dense | 22GB      | 70GB        | Best single-GPU |
| Qwen3.5-397B | 397B   | MoE   | \~100GB   | 400GB+      | Frontier-class  |

## Requirements

| Component | 9B (Q4)       | 35B (Q4)      | 397B (multi-GPU) |
| --------- | ------------- | ------------- | ---------------- |
| GPU       | RTX 3080 10GB | RTX 4090 24GB | 4× H100 80GB     |
| VRAM      | 8GB           | 22GB          | 320GB+           |
| RAM       | 16GB          | 32GB          | 128GB            |
| Disk      | 15GB          | 30GB          | 250GB            |

**Recommended Clore.ai GPU**: RTX 4090 24GB (\~$0.5–2/day) for 35B — best quality per dollar

## Quick Start with Ollama

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 9B — runs on anything (8GB VRAM)
ollama run qwen3.5:9b

# 35B quantized — needs RTX 4090 (24GB)
ollama run qwen3.5:35b

# As API server
ollama serve &
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5:35b",
    "messages": [{"role": "user", "content": "Solve this: if f(x) = x^3 - 3x + 1, find all real roots"}]
  }'
```

## vLLM Setup (Production)

```bash
pip install vllm

# 35B on single GPU
vllm serve Qwen/Qwen3.5-35B-Instruct \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

# 9B with long context
vllm serve Qwen/Qwen3.5-9B-Instruct \
  --max-model-len 65536

# 397B on multi-GPU cluster
vllm serve Qwen/Qwen3.5-397B-A45B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 32768
```

## HuggingFace Transformers

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3.5-35B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True  # Fits 35B on 24GB
)

messages = [
    {"role": "system", "content": "You are a helpful math tutor."},
    {"role": "user", "content": "Prove that the square root of 2 is irrational."}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=2048, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))
```

## Agentic / Tool Use Example

```python
import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

tools = [{
    "type": "function",
    "function": {
        "name": "get_gpu_price",
        "description": "Get current rental price for a GPU model on Clore.ai",
        "parameters": {
            "type": "object",
            "properties": {
                "gpu_model": {"type": "string", "description": "GPU model name, e.g. RTX 4090"}
            },
            "required": ["gpu_model"]
        }
    }
}]

response = client.chat.completions.create(
    model="qwen3.5:35b",
    messages=[{"role": "user", "content": "What's the cheapest GPU I can rent for running a 7B model?"}],
    tools=tools,
    tool_choice="auto"
)

# Qwen3.5 will call get_gpu_price with appropriate parameters
print(response.choices[0].message)
```

## Why Qwen3.5 on Clore.ai?

The 35B model is arguably the **best model you can run on a single RTX 4090**:

* Beats Llama 4 Scout on math and reasoning
* Beats Gemma 3 27B on agentic tasks
* Tool use / function calling works out of the box
* Apache 2.0 = no license headaches

At $0.5–2/day for an RTX 4090, you get frontier-class AI for the cost of a coffee.

## Tips for Clore.ai Users

* **35B is the sweet spot**: Fits on RTX 4090 Q4, outperforms most 70B models
* **9B for budget**: Even RTX 3060 ($0.15/day) runs the 9B model well
* **Use Ollama for quick start**: One command to serve; OpenAI-compatible API included
* **Agentic workflows**: Qwen3.5 excels at tool use — combine with function calling for automation
* **Fresh model = less cached**: First download takes time (\~20GB for 35B). Pre-pull before your workload starts

## Troubleshooting

| Issue                  | Solution                                                        |
| ---------------------- | --------------------------------------------------------------- |
| 35B OOM on 24GB        | Use `load_in_4bit=True` or reduce `--max-model-len`             |
| Ollama model not found | Update Ollama: `curl -fsSL https://ollama.com/install.sh \| sh` |
| Slow on first request  | Model loading takes 30-60s; subsequent requests are fast        |
| Tool calls not working | Ensure you pass `tools` parameter; use instruct variant only    |

## Further Reading

* [Qwen Blog](https://qwenlm.github.io/)
* [HuggingFace Models](https://huggingface.co/Qwen)
* [Ollama Library](https://ollama.com/library/qwen3.5)
