# Qwen3.6-27B (Dense, Single-GPU)

{% hint style="info" %}
**Status (April 2026):** Qwen3.6-27B was released by Alibaba on **April 21, 2026** under the **Apache 2.0** license. Weights live at [huggingface.co/Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B). It is a **dense** 27B model — not MoE — with a **262K-token native context** that extends to **1M tokens with YaRN**, and day-0 support across vLLM, SGLang, and Ollama.
{% endhint %}

The MoE giants of 2026 — DeepSeek V4, GLM-5.1, MiMo-V2.5-Pro — are exciting on benchmarks but punishing in practice: hundreds of GB of weights, multi-GPU racks, fragile expert-routing kernels, and inference bills that make finance teams flinch. Qwen3.6-27B walks the other direction. It is **dense**, every parameter activates on every token, VRAM is predictable to the gigabyte, and there is no expert-routing surprise when you cross 8K context.

For most teams the question is not "can we serve a 744B MoE" — it is "can we put one good card in our cluster and serve a frontier-class coding assistant on it?" Qwen3.6-27B is built for exactly that. Q4 fits a single **RTX 4090 24GB**, Q8 fits a single **RTX 5090 32GB**, BF16 fits a single **L40S 48GB** or **A100 40GB**, and Alibaba is publishing **77.2% on SWE-Bench Verified** (vendor-claimed). One card, one container, one model.

### Key Specs

| Property         | Value                           |
| ---------------- | ------------------------------- |
| Parameters       | 27B (dense)                     |
| Architecture     | Dense decoder-only transformer  |
| Native Context   | 262,144 tokens                  |
| Extended Context | 1,000,000 tokens (YaRN)         |
| License          | Apache 2.0                      |
| Release Date     | April 21, 2026                  |
| Organization     | Alibaba (Qwen team)             |
| Primary Tooling  | vLLM, SGLang, Ollama, llama.cpp |

### Why Qwen3.6-27B?

* **Single-GPU economics** — Q4 on RTX 4090 from **$0.70–2.50/hr** on Clore.ai; no tensor-parallel orchestration to debug
* **Dense, not MoE** — fixed VRAM, no expert hot-spotting, no spiky latency at certain prompts
* **Apache 2.0** — fully commercial, fine-tunable, redistributable, no usage caps
* **262K native context, 1M with YaRN** — entire codebases, full books, hours of transcripts in one pass
* **Day-0 vLLM / SGLang / Ollama** — pick your serving stack; Qwen shipped configs for all three at release
* **77.2% SWE-Bench Verified** (vendor-claimed) — competitive with much larger MoE models on real coding tasks

***

## Requirements

{% hint style="success" %}
**The whole point is that this model is forgiving.** A single RTX 4090 from the Clore.ai marketplace is enough to run Qwen3.6-27B at production-grade quality (Q4) or "good enough for most use cases" speeds. No multi-GPU headaches.
{% endhint %}

| Component | Q4 (GGUF / AWQ)  | Q8 (GGUF / GPTQ) | BF16                         | Full FP16                  |
| --------- | ---------------- | ---------------- | ---------------------------- | -------------------------- |
| GPU       | 1× RTX 4090 24GB | 1× RTX 5090 32GB | 1× L40S 48GB or 1× A100 40GB | 1× A100 80GB               |
| VRAM Used | \~16–18GB        | \~28–30GB        | \~54GB                       | \~54GB + KV cache headroom |
| RAM       | 32GB             | 32GB             | 64GB                         | 96GB                       |
| Disk      | 20GB NVMe        | 32GB NVMe        | 60GB NVMe                    | 60GB NVMe                  |
| CUDA      | 12.1+            | 12.4+            | 12.1+                        | 12.1+                      |

**Clore.ai pick:** For 90% of teams, a single **RTX 4090 24GB** running Q4 (AWQ or GGUF) is the right answer. You get frontier-class coding for the price of a couple of coffees per day. Step up to RTX 5090 32GB if you want Q8 for slightly better quality, or to L40S / A100 40GB for full BF16 production inference.

***

## Option A — Ollama (Quantized, easiest)

Ollama is the fastest path from "I have a Clore.ai GPU" to "I have a chat endpoint."

```bash
# Pull Qwen3.6-27B (Q4_K_M by default, ~17GB download)
ollama pull qwen3.6:27b

# Run interactively
ollama run qwen3.6:27b

# Or expose the OpenAI-compatible API
ollama serve &

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6:27b",
    "messages": [
      {"role": "system", "content": "You are a senior Go engineer."},
      {"role": "user", "content": "Refactor this handler to use context.Context properly and add retries with exponential backoff."}
    ],
    "temperature": 0.6
  }'
```

{% hint style="info" %}
The default `qwen3.6:27b` tag in Ollama maps to Q4\_K\_M. Use `qwen3.6:27b-q8_0` for Q8 if you have an RTX 5090, or `qwen3.6:27b-fp16` for full precision (needs an A100 80GB).
{% endhint %}

***

## Option B — vLLM (Production)

vLLM is the recommended production server. The single-GPU config below targets RTX 4090 with AWQ quantization. The multi-GPU section is there for completeness — but with a 27B dense model, you almost never need it.

```yaml
# docker-compose.yml — single RTX 4090, Q4 AWQ
version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - hf_cache:/root/.cache/huggingface
    command: >
      --model Qwen/Qwen3.6-27B-Instruct-AWQ
      --quantization awq
      --max-model-len 65536
      --gpu-memory-utilization 0.92
      --served-model-name qwen3.6-27b
      --enable-auto-tool-choice
      --tool-call-parser hermes
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: "8gb"

volumes:
  hf_cache:
```

```bash
# Test the API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6-27b",
    "messages": [
      {"role": "user", "content": "Explain the difference between MoE and dense models in 3 bullets."}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'
```

For full **BF16** on a single L40S 48GB or A100 40GB, drop `--quantization awq` and point at the base checkpoint (`Qwen/Qwen3.6-27B-Instruct`, `--dtype bfloat16`, `--max-model-len 131072`). For 2× RTX 4090 with tensor parallelism (longer context, bigger KV cache), add `--tensor-parallel-size 2`.

***

## Option C — SGLang

SGLang shines when you push past the native 262K window with YaRN. Pass `--rope-scaling` to extend to \~1M tokens.

```bash
docker pull lmsysorg/sglang:latest

# Single-GPU, native 262K context
python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-27B-Instruct \
  --quantization awq \
  --context-length 262144 \
  --mem-fraction-static 0.90 \
  --served-model-name qwen3.6-27b

# YaRN-extended to 1M tokens (needs more VRAM headroom)
python3 -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-27B-Instruct \
  --dtype bfloat16 \
  --context-length 1000000 \
  --json-model-override-args '{"rope_scaling":{"type":"yarn","factor":4.0,"original_max_position_embeddings":262144}}' \
  --mem-fraction-static 0.85
```

{% hint style="warning" %}
**1M-context costs grow fast.** Even with YaRN, KV cache for 1M tokens at BF16 is roughly **40–60GB** depending on batch size. Plan for an A100 80GB or H100 if you actually intend to fill the window.
{% endhint %}

***

## Clore.ai GPU Recommendations

| Setup                | VRAM | Mode        | Expected Performance         | Clore.ai Cost       |
| -------------------- | ---- | ----------- | ---------------------------- | ------------------- |
| **1× RTX 4090 24GB** | 24GB | Q4 AWQ      | 50–80 tok/s, 64K ctx         | **\~$0.70–2.50/hr** |
| 1× RTX 5090 32GB     | 32GB | Q8 GPTQ     | 60–90 tok/s, 96K ctx         | \~$1.50–3.50/hr     |
| 1× L40S 48GB         | 48GB | BF16        | 35–55 tok/s, 131K ctx        | \~$1.20–2.80/hr     |
| 1× A100 40GB         | 40GB | BF16        | 40–60 tok/s, 96K ctx         | \~$1.00–2.50/hr     |
| 1× A100 80GB         | 80GB | FP16 + 262K | 40–60 tok/s, full native ctx | \~$1.80–3.50/hr     |
| 2× RTX 4090          | 48GB | BF16 TP=2   | 60–80 tok/s, 262K ctx        | \~$1.50–4.50/hr     |

{% hint style="success" %}
**Best value, by a mile:** [1× RTX 4090 from $0.70/hr](https://clore.ai/rent-4090.html) running Q4 AWQ via Ollama or vLLM. You get a frontier-class coding model on a single consumer card for less than the cost of a Claude Pro subscription per day.
{% endhint %}

***

## Use Cases

* **Single-GPU production deployments** — one container on one Clore.ai 4090 and you have a real coding assistant
* **Coding agents** — 77.2% SWE-Bench Verified (vendor-claimed) puts it in the "useful for autonomous PRs" bracket
* **Long-context RAG** — 262K native is enough for entire codebases or weeks of chat logs
* **1M-token analysis** — with YaRN, drop a whole book or a multi-month git log into one prompt
* **On-prem / air-gapped** — Apache 2.0 ships with the product, no API dependency
* **Edge fine-tuning** — 27B dense is friendly to LoRA/QLoRA on a single card
* **Worker in agent-of-agents** — pair as a worker with a larger MoE planner like [GLM-5.1](/guides/language-models/glm-5-1.md)

***

## Benchmarks

{% hint style="warning" %}
**Vendor-claimed — verify independently.** Numbers below come from Alibaba's April 21, 2026 release post. Independent reproductions (Aider, BigCodeBench, LiveCodeBench leaderboards) are still rolling in.
{% endhint %}

| Benchmark          | Qwen3.6-27B | Qwen3.5-35B | Gemma 3 27B | Llama 4 Scout |
| ------------------ | ----------- | ----------- | ----------- | ------------- |
| SWE-Bench Verified | **77.2%**   | \~71%       | \~58%       | \~54%         |
| HumanEval          | \~93%       | \~92%       | \~90%       | \~88%         |
| LiveCodeBench      | \~68%       | \~65%       | \~55%       | \~52%         |
| MMLU-Pro           | \~78%       | \~76%       | \~74%       | \~72%         |
| MATH               | \~87%       | \~85%       | \~78%       | \~76%         |

The headline number is **SWE-Bench Verified 77.2%** — that puts a single-GPU dense model into territory previously reserved for multi-GPU MoE systems. Treat it as a vendor claim until LMSYS / Aider boards confirm.

***

## Troubleshooting

| Issue                             | Solution                                                                                    |
| --------------------------------- | ------------------------------------------------------------------------------------------- |
| OOM on RTX 4090 (Q4)              | Lower `--max-model-len` to 32768; AWQ at 65K ctx is right at the edge of 24GB               |
| `qwen3.6:27b` not found in Ollama | Update Ollama; the tag landed late April 2026                                               |
| YaRN config rejected by vLLM      | Requires vLLM ≥ 0.7.x; pass via `--rope-scaling` JSON, not separate flags                   |
| Tool calls silently dropped       | Add `--enable-auto-tool-choice --tool-call-parser hermes` in vLLM                           |
| Slow prefill on long context      | Add `--enable-chunked-prefill` and reduce batch size                                        |
| KV cache OOM at 262K              | Drop to Q8 or move to L40S 48GB / A100 80GB                                                 |
| Bad quality near 1M ctx           | YaRN extends positions but quality degrades past \~600K; keep critical content near the end |

***

## Next Steps

* **Predecessor:** [Qwen3.5](/guides/language-models/qwen35.md) — Qwen3.6-27B is the dense successor; same family, sharper coding, longer native ctx
* **Multimodal sibling:** [Qwen3.5-Omni](/guides/language-models/qwen35-omni.md) — text + audio + image + video if you need more than text
* **Similar dense-27B class:** [Gemma 3](/guides/language-models/gemma3.md) — Google's 27B dense competitor, good baseline comparison
* **MoE alternative:** [Llama 4 Scout](/guides/language-models/llama4.md) — single-GPU MoE if you want to compare architectures
* **Frontier MoE step-up:** [GLM-5.1](/guides/language-models/glm-5-1.md) — when 27B dense is not enough and you have multi-GPU budget

### Links

* [Qwen3.6-27B on HuggingFace](https://huggingface.co/Qwen/Qwen3.6-27B)
* [Qwen GitHub](https://github.com/QwenLM/Qwen)
* [Qwen Blog](https://qwenlm.github.io/)
* [vLLM docs](https://docs.vllm.ai)
* [SGLang repo](https://github.com/sgl-project/sglang)
* [Ollama library](https://ollama.com/library/qwen3.6)
* **Rent a GPU:** [RTX 4090 from $0.70/hr](https://clore.ai/rent-4090.html) · [RTX 5090 32GB](https://clore.ai/rent-5090.html) · [Marketplace](https://clore.ai/marketplace)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/qwen36-27b.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
