For the complete documentation index, see llms.txt. This page is also available as Markdown.

Qwen3.6-27B (Dense, Single-GPU)

Deploy Qwen3.6-27B by Alibaba on Clore.ai — a dense 27B that fits on one RTX 4090 and ships with 262K native context

Status (April 2026): Qwen3.6-27B was released by Alibaba on April 21, 2026 under the Apache 2.0 license. Weights live at huggingface.co/Qwen/Qwen3.6-27B. It is a dense 27B model — not MoE — with a 262K-token native context that extends to 1M tokens with YaRN, and day-0 support across vLLM, SGLang, and Ollama.

The MoE giants of 2026 — DeepSeek V4, GLM-5.1, MiMo-V2.5-Pro — are exciting on benchmarks but punishing in practice: hundreds of GB of weights, multi-GPU racks, fragile expert-routing kernels, and inference bills that make finance teams flinch. Qwen3.6-27B walks the other direction. It is dense, every parameter activates on every token, VRAM is predictable to the gigabyte, and there is no expert-routing surprise when you cross 8K context.

For most teams the question is not "can we serve a 744B MoE" — it is "can we put one good card in our cluster and serve a frontier-class coding assistant on it?" Qwen3.6-27B is built for exactly that. Q4 fits a single RTX 4090 24GB, Q8 fits a single RTX 5090 32GB, BF16 fits a single L40S 48GB or A100 40GB, and Alibaba is publishing 77.2% on SWE-Bench Verified (vendor-claimed). One card, one container, one model.

Key Specs

Property
Value

Parameters

27B (dense)

Architecture

Dense decoder-only transformer

Native Context

262,144 tokens

Extended Context

1,000,000 tokens (YaRN)

License

Apache 2.0

Release Date

April 21, 2026

Organization

Alibaba (Qwen team)

Primary Tooling

vLLM, SGLang, Ollama, llama.cpp

Why Qwen3.6-27B?

  • Single-GPU economics — Q4 on RTX 4090 from $0.70–2.50/hr on Clore.ai; no tensor-parallel orchestration to debug

  • Dense, not MoE — fixed VRAM, no expert hot-spotting, no spiky latency at certain prompts

  • Apache 2.0 — fully commercial, fine-tunable, redistributable, no usage caps

  • 262K native context, 1M with YaRN — entire codebases, full books, hours of transcripts in one pass

  • Day-0 vLLM / SGLang / Ollama — pick your serving stack; Qwen shipped configs for all three at release

  • 77.2% SWE-Bench Verified (vendor-claimed) — competitive with much larger MoE models on real coding tasks


Requirements

Component
Q4 (GGUF / AWQ)
Q8 (GGUF / GPTQ)
BF16
Full FP16

GPU

1× RTX 4090 24GB

1× RTX 5090 32GB

1× L40S 48GB or 1× A100 40GB

1× A100 80GB

VRAM Used

~16–18GB

~28–30GB

~54GB

~54GB + KV cache headroom

RAM

32GB

32GB

64GB

96GB

Disk

20GB NVMe

32GB NVMe

60GB NVMe

60GB NVMe

CUDA

12.1+

12.4+

12.1+

12.1+

Clore.ai pick: For 90% of teams, a single RTX 4090 24GB running Q4 (AWQ or GGUF) is the right answer. You get frontier-class coding for the price of a couple of coffees per day. Step up to RTX 5090 32GB if you want Q8 for slightly better quality, or to L40S / A100 40GB for full BF16 production inference.


Option A — Ollama (Quantized, easiest)

Ollama is the fastest path from "I have a Clore.ai GPU" to "I have a chat endpoint."

The default qwen3.6:27b tag in Ollama maps to Q4_K_M. Use qwen3.6:27b-q8_0 for Q8 if you have an RTX 5090, or qwen3.6:27b-fp16 for full precision (needs an A100 80GB).


Option B — vLLM (Production)

vLLM is the recommended production server. The single-GPU config below targets RTX 4090 with AWQ quantization. The multi-GPU section is there for completeness — but with a 27B dense model, you almost never need it.

For full BF16 on a single L40S 48GB or A100 40GB, drop --quantization awq and point at the base checkpoint (Qwen/Qwen3.6-27B-Instruct, --dtype bfloat16, --max-model-len 131072). For 2× RTX 4090 with tensor parallelism (longer context, bigger KV cache), add --tensor-parallel-size 2.


Option C — SGLang

SGLang shines when you push past the native 262K window with YaRN. Pass --rope-scaling to extend to ~1M tokens.


Clore.ai GPU Recommendations

Setup
VRAM
Mode
Expected Performance
Clore.ai Cost

1× RTX 4090 24GB

24GB

Q4 AWQ

50–80 tok/s, 64K ctx

~$0.70–2.50/hr

1× RTX 5090 32GB

32GB

Q8 GPTQ

60–90 tok/s, 96K ctx

~$1.50–3.50/hr

1× L40S 48GB

48GB

BF16

35–55 tok/s, 131K ctx

~$1.20–2.80/hr

1× A100 40GB

40GB

BF16

40–60 tok/s, 96K ctx

~$1.00–2.50/hr

1× A100 80GB

80GB

FP16 + 262K

40–60 tok/s, full native ctx

~$1.80–3.50/hr

2× RTX 4090

48GB

BF16 TP=2

60–80 tok/s, 262K ctx

~$1.50–4.50/hr


Use Cases

  • Single-GPU production deployments — one container on one Clore.ai 4090 and you have a real coding assistant

  • Coding agents — 77.2% SWE-Bench Verified (vendor-claimed) puts it in the "useful for autonomous PRs" bracket

  • Long-context RAG — 262K native is enough for entire codebases or weeks of chat logs

  • 1M-token analysis — with YaRN, drop a whole book or a multi-month git log into one prompt

  • On-prem / air-gapped — Apache 2.0 ships with the product, no API dependency

  • Edge fine-tuning — 27B dense is friendly to LoRA/QLoRA on a single card

  • Worker in agent-of-agents — pair as a worker with a larger MoE planner like GLM-5.1


Benchmarks

Benchmark
Qwen3.6-27B
Qwen3.5-35B
Gemma 3 27B
Llama 4 Scout

SWE-Bench Verified

77.2%

~71%

~58%

~54%

HumanEval

~93%

~92%

~90%

~88%

LiveCodeBench

~68%

~65%

~55%

~52%

MMLU-Pro

~78%

~76%

~74%

~72%

MATH

~87%

~85%

~78%

~76%

The headline number is SWE-Bench Verified 77.2% — that puts a single-GPU dense model into territory previously reserved for multi-GPU MoE systems. Treat it as a vendor claim until LMSYS / Aider boards confirm.


Troubleshooting

Issue
Solution

OOM on RTX 4090 (Q4)

Lower --max-model-len to 32768; AWQ at 65K ctx is right at the edge of 24GB

qwen3.6:27b not found in Ollama

Update Ollama; the tag landed late April 2026

YaRN config rejected by vLLM

Requires vLLM ≥ 0.7.x; pass via --rope-scaling JSON, not separate flags

Tool calls silently dropped

Add --enable-auto-tool-choice --tool-call-parser hermes in vLLM

Slow prefill on long context

Add --enable-chunked-prefill and reduce batch size

KV cache OOM at 262K

Drop to Q8 or move to L40S 48GB / A100 80GB

Bad quality near 1M ctx

YaRN extends positions but quality degrades past ~600K; keep critical content near the end


Next Steps

  • Predecessor: Qwen3.5 — Qwen3.6-27B is the dense successor; same family, sharper coding, longer native ctx

  • Multimodal sibling: Qwen3.5-Omni — text + audio + image + video if you need more than text

  • Similar dense-27B class: Gemma 3 — Google's 27B dense competitor, good baseline comparison

  • MoE alternative: Llama 4 Scout — single-GPU MoE if you want to compare architectures

  • Frontier MoE step-up: GLM-5.1 — when 27B dense is not enough and you have multi-GPU budget

Last updated

Was this helpful?