Hy3 Preview (Tencent Hunyuan 3, 295B MoE)
Deploy Tencent's Hy3 Preview (295B MoE, 21B active, 256K ctx) on Clore.ai — the first model from Tencent Hunyuan's rebuilt training stack, tuned for long-horizon reasoning and agentic coding
Status (April 2026): Hy3 Preview is the first public release from Tencent Hunyuan's rebuilt training infrastructure, published on April 13, 2026 and last updated April 23, 2026. Weights live at huggingface.co/tencent/Hy3-preview under the Tencent Hy Community License. Day-0 support landed in vLLM and SGLang.
Hy3 Preview is a 295B-parameter Mixture-of-Experts language model that activates only ~21B parameters per token (192 experts, top-8 routed). It targets two workloads where Tencent has been visibly catching up: long-horizon reasoning (FrontierScience-Olympiad, IMOAnswerBench, math-PhD exams) and agentic coding (SWE-bench Verified 74.4%, Terminal-Bench 2.0 54.4%, vendor-claimed). The 256K context window plus an MTP (Multi-Token Prediction) speculative-decoding layer make it practical for IDE-scale coding agents and document-heavy RAG.
For Clore.ai users, the headline number is 21B active. You don't need a full 8×H200 rack. A tensor-parallel deployment across 4×A100 80GB or 2×H100 80GB (BF16 with offload) is enough to serve it at usable throughput — frontier-class agentic coding for ~$10–20/day on the marketplace, with weights staying on your own box.
Key Specs
Total Parameters
295B (MoE)
Active Parameters
21B per forward pass
Experts
192 total, top-8 routed
Layers
80 transformer + 1 MTP
Attention
64 heads, GQA with 8 KV heads, head dim 128
Hidden Size
4096
Intermediate Size
13,312
Vocabulary
120,832
Context Window
256,000 tokens
Native Precision
BF16
License
Tencent Hy Community License
Release Date
April 13, 2026
Organization
Tencent Hunyuan
Primary Tooling
vLLM, SGLang, AngelSlim, LLaMA-Factory
Why Hy3 Preview?
First on Tencent's rebuilt RL stack — Tencent rewrote its training infrastructure for this release; expect rapid iteration through 2026
21B active MoE — pay the inference cost of a ~21B dense model, not 295B
256K context — enough for full repos, long agent traces, or multi-document RAG in one shot
MTP speculative layer — built-in multi-token prediction gives ~1.5–2× decode speedups on Hopper-class GPUs
Two reasoning modes —
reasoning_effort: "high"for chain-of-thought,"no_think"for fast direct answersAgentic-coding focus — explicitly tuned for SWE-bench-style multi-turn tool use and terminal agents
Open-source-friendly license — Tencent Hy Community License is Apache-style for most uses; verify the LICENSE file for your case
Requirements
Still a 295B-class model. "21B active" describes inference compute, not the memory footprint. The full BF16 weights are ~590GB and must live in VRAM (or be offloaded). Plan for 8×H100/H200 if you want unconstrained throughput; 4×A100 80GB works with offload and shorter contexts.
GPU VRAM
~80GB + 256GB RAM offload
4× A100 80GB (320GB)
8× H100 80GB or 8× H20-3e
RAM
256GB
384GB
512GB
Disk
700GB NVMe
1TB NVMe
1.5TB NVMe
CUDA
12.4+
12.4+
12.6+
Driver
550+
550+
560+
Clore.ai pick: For most teams, 4× A100 80GB with BF16 tensor-parallel and --max-model-len 65536 is the sweet spot (~$10–16/day). If you need full 256K context with concurrent users, jump to 8× H100.
Option A — Ollama / GGUF (Quantized, community builds)
Heads-up: Hy3 Preview is brand new (April 13, 2026) and uses a custom MoE architecture. Community llama.cpp / GGUF support typically lands 2–4 weeks after release. If you need it today, use vLLM (Option B). Check huggingface.co/models?search=hy3-preview+gguf for community quants before pulling.
For pre-GGUF days, AngelSlim (Tencent's own quantization toolkit) can produce W4A16 / W8A8 weights directly from the BF16 checkpoint.
Option B — vLLM (Production API, recommended)
vLLM is Tencent's first-class serving target for Hy3 Preview. The MTP speculative layer is wired in via --speculative-config.method mtp.
Reasoning modes. Set reasoning_effort: "high" to enable chain-of-thought traces (slower, much better on math/coding/agent tasks) or "no_think" for fast direct answers. The vendor-recommended sampling is temperature=0.9, top_p=1.0 — zero-temp sampling can break reasoning traces.
Tight on GPUs? Drop to --tensor-parallel-size 4 on 4× A100 80GB. Keep --max-model-len 32768 and add --enable-chunked-prefill to keep prefill latency reasonable.
Option C — SGLang
SGLang ships day-0 support and pairs the MTP layer with EAGLE speculative decoding for additional throughput on Hopper.
Expect a 1.5–2× throughput boost on long agent loops compared to vanilla decode.
Clore.ai GPU Recommendations
Best value: 4× A100 80GB with BF16 tensor-parallel and a 64K context window. You get an open-weight 295B-class agentic coder for roughly the price of a Claude Pro subscription, and the weights never leave your rented box.
Use Cases
Autonomous SWE agents — 74.4% SWE-bench Verified (vendor-claimed) and explicit tuning for long tool-call loops; pair with OpenHands, SWE-agent, or Aider
Terminal-driven agents — 54.4% Terminal-Bench 2.0 puts it in the top tier for shell/CLI workflows
Long-horizon reasoning — Olympiad-level math (IMOAnswerBench, FrontierScience-Olympiad) and PhD-grade STEM
Codebase-scale RAG — 256K ctx fits a full mid-sized repo plus tests in a single prompt
Search and browsing agents — BrowseComp / WideSearch tuning makes it a strong planner for multi-step web research
Agent-of-agents — use Hy3 Preview as the planner and lighter open models (Qwen3.5, GLM-4.7 Flash) as workers
Benchmarks
Vendor-claimed — verify independently. All numbers below come from Tencent's April 13, 2026 model card. Independent reproductions (especially on SWE-bench Verified) are still rolling in. Treat them as upper bounds until LMSYS / OpenCompass confirms.
SWE-bench Verified
74.4%
~79%
~71%
~78%
Terminal-Bench 2.0
54.4%
—
—
—
GPQA Diamond
87.2%
—
~84%
~88%
SuperGPQA
51.6%
—
—
—
HLE
~30
—
—
—
Tencent also reports strong results on proprietary CL-bench / CL-bench-Life context-learning benchmarks and the Tsinghua Qiuzhen Math PhD exam (Spring 2026).
Troubleshooting
OutOfMemoryError on load
BF16 needs ~590GB total VRAM. Drop to 4×A100 with --max-model-len 32768 or use AngelSlim W4A16 quants.
Slow HuggingFace download
Use huggingface-cli download tencent/Hy3-preview --local-dir ./weights --resume-download. Expect 590GB+.
Tool calls silently dropped
Make sure --tool-call-parser hy_v3 (vLLM) or --tool-call-parser hunyuan (SGLang) is set, and --enable-auto-tool-choice is on.
Reasoning trace empty / wrong
Use temperature=0.9, top_p=1.0. Zero-temp greedy decoding breaks the chain-of-thought. Confirm reasoning_effort: "high".
MTP speculative decoding errors
Requires recent vLLM (post-April 2026 build). Run pip install -U vllm --pre or pin to a tag that lists mtp in release notes.
256K context OOMs
Start at --max-model-len 32768, enable --enable-chunked-prefill, raise gradually. Full 256K realistically needs 8× H200.
Custom architecture rejected
Always pass --trust-remote-code. Hy3 ships custom modeling code with the checkpoint.
Ollama / GGUF not available
Community quants typically arrive 2–4 weeks post-release. Use vLLM or AngelSlim in the meantime.
Next Steps
Closest open-weight peer: GLM-5.1 — 744B / 40B-active MoE, MIT license, top SWE-bench Pro scores
Multimodal alternative: Qwen3.5-Omni — text + audio + image + video, runs on a single RTX 4090
Reasoning-only alternative: DeepSeek R1 — pure long-form reasoning specialist
Rent the hardware: Rent A100 80GB on Clore.ai — 4× A100 80GB instances from ~$10/day
Full marketplace: clore.ai/marketplace — H100, H200, A100, RTX 5090 from $0.50/day
Links
Last updated
Was this helpful?