Qwen3.6-27B (Dense, Single-GPU)
Deploy Qwen3.6-27B by Alibaba on Clore.ai — a dense 27B that fits on one RTX 4090 and ships with 262K native context
Status (April 2026): Qwen3.6-27B was released by Alibaba on April 21, 2026 under the Apache 2.0 license. Weights live at huggingface.co/Qwen/Qwen3.6-27B. It is a dense 27B model — not MoE — with a 262K-token native context that extends to 1M tokens with YaRN, and day-0 support across vLLM, SGLang, and Ollama.
The MoE giants of 2026 — DeepSeek V4, GLM-5.1, MiMo-V2.5-Pro — are exciting on benchmarks but punishing in practice: hundreds of GB of weights, multi-GPU racks, fragile expert-routing kernels, and inference bills that make finance teams flinch. Qwen3.6-27B walks the other direction. It is dense, every parameter activates on every token, VRAM is predictable to the gigabyte, and there is no expert-routing surprise when you cross 8K context.
For most teams the question is not "can we serve a 744B MoE" — it is "can we put one good card in our cluster and serve a frontier-class coding assistant on it?" Qwen3.6-27B is built for exactly that. Q4 fits a single RTX 4090 24GB, Q8 fits a single RTX 5090 32GB, BF16 fits a single L40S 48GB or A100 40GB, and Alibaba is publishing 77.2% on SWE-Bench Verified (vendor-claimed). One card, one container, one model.
Key Specs
Parameters
27B (dense)
Architecture
Dense decoder-only transformer
Native Context
262,144 tokens
Extended Context
1,000,000 tokens (YaRN)
License
Apache 2.0
Release Date
April 21, 2026
Organization
Alibaba (Qwen team)
Primary Tooling
vLLM, SGLang, Ollama, llama.cpp
Why Qwen3.6-27B?
Single-GPU economics — Q4 on RTX 4090 from $0.70–2.50/hr on Clore.ai; no tensor-parallel orchestration to debug
Dense, not MoE — fixed VRAM, no expert hot-spotting, no spiky latency at certain prompts
Apache 2.0 — fully commercial, fine-tunable, redistributable, no usage caps
262K native context, 1M with YaRN — entire codebases, full books, hours of transcripts in one pass
Day-0 vLLM / SGLang / Ollama — pick your serving stack; Qwen shipped configs for all three at release
77.2% SWE-Bench Verified (vendor-claimed) — competitive with much larger MoE models on real coding tasks
Requirements
The whole point is that this model is forgiving. A single RTX 4090 from the Clore.ai marketplace is enough to run Qwen3.6-27B at production-grade quality (Q4) or "good enough for most use cases" speeds. No multi-GPU headaches.
GPU
1× RTX 4090 24GB
1× RTX 5090 32GB
1× L40S 48GB or 1× A100 40GB
1× A100 80GB
VRAM Used
~16–18GB
~28–30GB
~54GB
~54GB + KV cache headroom
RAM
32GB
32GB
64GB
96GB
Disk
20GB NVMe
32GB NVMe
60GB NVMe
60GB NVMe
CUDA
12.1+
12.4+
12.1+
12.1+
Clore.ai pick: For 90% of teams, a single RTX 4090 24GB running Q4 (AWQ or GGUF) is the right answer. You get frontier-class coding for the price of a couple of coffees per day. Step up to RTX 5090 32GB if you want Q8 for slightly better quality, or to L40S / A100 40GB for full BF16 production inference.
Option A — Ollama (Quantized, easiest)
Ollama is the fastest path from "I have a Clore.ai GPU" to "I have a chat endpoint."
The default qwen3.6:27b tag in Ollama maps to Q4_K_M. Use qwen3.6:27b-q8_0 for Q8 if you have an RTX 5090, or qwen3.6:27b-fp16 for full precision (needs an A100 80GB).
Option B — vLLM (Production)
vLLM is the recommended production server. The single-GPU config below targets RTX 4090 with AWQ quantization. The multi-GPU section is there for completeness — but with a 27B dense model, you almost never need it.
For full BF16 on a single L40S 48GB or A100 40GB, drop --quantization awq and point at the base checkpoint (Qwen/Qwen3.6-27B-Instruct, --dtype bfloat16, --max-model-len 131072). For 2× RTX 4090 with tensor parallelism (longer context, bigger KV cache), add --tensor-parallel-size 2.
Option C — SGLang
SGLang shines when you push past the native 262K window with YaRN. Pass --rope-scaling to extend to ~1M tokens.
1M-context costs grow fast. Even with YaRN, KV cache for 1M tokens at BF16 is roughly 40–60GB depending on batch size. Plan for an A100 80GB or H100 if you actually intend to fill the window.
Clore.ai GPU Recommendations
1× RTX 4090 24GB
24GB
Q4 AWQ
50–80 tok/s, 64K ctx
~$0.70–2.50/hr
1× RTX 5090 32GB
32GB
Q8 GPTQ
60–90 tok/s, 96K ctx
~$1.50–3.50/hr
1× L40S 48GB
48GB
BF16
35–55 tok/s, 131K ctx
~$1.20–2.80/hr
1× A100 40GB
40GB
BF16
40–60 tok/s, 96K ctx
~$1.00–2.50/hr
1× A100 80GB
80GB
FP16 + 262K
40–60 tok/s, full native ctx
~$1.80–3.50/hr
2× RTX 4090
48GB
BF16 TP=2
60–80 tok/s, 262K ctx
~$1.50–4.50/hr
Best value, by a mile: 1× RTX 4090 from $0.70/hr running Q4 AWQ via Ollama or vLLM. You get a frontier-class coding model on a single consumer card for less than the cost of a Claude Pro subscription per day.
Use Cases
Single-GPU production deployments — one container on one Clore.ai 4090 and you have a real coding assistant
Coding agents — 77.2% SWE-Bench Verified (vendor-claimed) puts it in the "useful for autonomous PRs" bracket
Long-context RAG — 262K native is enough for entire codebases or weeks of chat logs
1M-token analysis — with YaRN, drop a whole book or a multi-month git log into one prompt
On-prem / air-gapped — Apache 2.0 ships with the product, no API dependency
Edge fine-tuning — 27B dense is friendly to LoRA/QLoRA on a single card
Worker in agent-of-agents — pair as a worker with a larger MoE planner like GLM-5.1
Benchmarks
Vendor-claimed — verify independently. Numbers below come from Alibaba's April 21, 2026 release post. Independent reproductions (Aider, BigCodeBench, LiveCodeBench leaderboards) are still rolling in.
SWE-Bench Verified
77.2%
~71%
~58%
~54%
HumanEval
~93%
~92%
~90%
~88%
LiveCodeBench
~68%
~65%
~55%
~52%
MMLU-Pro
~78%
~76%
~74%
~72%
MATH
~87%
~85%
~78%
~76%
The headline number is SWE-Bench Verified 77.2% — that puts a single-GPU dense model into territory previously reserved for multi-GPU MoE systems. Treat it as a vendor claim until LMSYS / Aider boards confirm.
Troubleshooting
OOM on RTX 4090 (Q4)
Lower --max-model-len to 32768; AWQ at 65K ctx is right at the edge of 24GB
qwen3.6:27b not found in Ollama
Update Ollama; the tag landed late April 2026
YaRN config rejected by vLLM
Requires vLLM ≥ 0.7.x; pass via --rope-scaling JSON, not separate flags
Tool calls silently dropped
Add --enable-auto-tool-choice --tool-call-parser hermes in vLLM
Slow prefill on long context
Add --enable-chunked-prefill and reduce batch size
KV cache OOM at 262K
Drop to Q8 or move to L40S 48GB / A100 80GB
Bad quality near 1M ctx
YaRN extends positions but quality degrades past ~600K; keep critical content near the end
Next Steps
Predecessor: Qwen3.5 — Qwen3.6-27B is the dense successor; same family, sharper coding, longer native ctx
Multimodal sibling: Qwen3.5-Omni — text + audio + image + video if you need more than text
Similar dense-27B class: Gemma 3 — Google's 27B dense competitor, good baseline comparison
MoE alternative: Llama 4 Scout — single-GPU MoE if you want to compare architectures
Frontier MoE step-up: GLM-5.1 — when 27B dense is not enough and you have multi-GPU budget
Links
Rent a GPU: RTX 4090 from $0.70/hr · RTX 5090 32GB · Marketplace
Last updated
Was this helpful?