For the complete documentation index, see llms.txt. This page is also available as Markdown.

Hy3 Preview (Tencent Hunyuan 3, 295B MoE)

Deploy Tencent's Hy3 Preview (295B MoE, 21B active, 256K ctx) on Clore.ai — the first model from Tencent Hunyuan's rebuilt training stack, tuned for long-horizon reasoning and agentic coding

Status (April 2026): Hy3 Preview is the first public release from Tencent Hunyuan's rebuilt training infrastructure, published on April 13, 2026 and last updated April 23, 2026. Weights live at huggingface.co/tencent/Hy3-preview under the Tencent Hy Community License. Day-0 support landed in vLLM and SGLang.

Hy3 Preview is a 295B-parameter Mixture-of-Experts language model that activates only ~21B parameters per token (192 experts, top-8 routed). It targets two workloads where Tencent has been visibly catching up: long-horizon reasoning (FrontierScience-Olympiad, IMOAnswerBench, math-PhD exams) and agentic coding (SWE-bench Verified 74.4%, Terminal-Bench 2.0 54.4%, vendor-claimed). The 256K context window plus an MTP (Multi-Token Prediction) speculative-decoding layer make it practical for IDE-scale coding agents and document-heavy RAG.

For Clore.ai users, the headline number is 21B active. You don't need a full 8×H200 rack. A tensor-parallel deployment across 4×A100 80GB or 2×H100 80GB (BF16 with offload) is enough to serve it at usable throughput — frontier-class agentic coding for ~$10–20/day on the marketplace, with weights staying on your own box.

Key Specs

Property
Value

Total Parameters

295B (MoE)

Active Parameters

21B per forward pass

Experts

192 total, top-8 routed

Layers

80 transformer + 1 MTP

Attention

64 heads, GQA with 8 KV heads, head dim 128

Hidden Size

4096

Intermediate Size

13,312

Vocabulary

120,832

Context Window

256,000 tokens

Native Precision

BF16

License

Tencent Hy Community License

Release Date

April 13, 2026

Organization

Tencent Hunyuan

Primary Tooling

vLLM, SGLang, AngelSlim, LLaMA-Factory

Why Hy3 Preview?

  • First on Tencent's rebuilt RL stack — Tencent rewrote its training infrastructure for this release; expect rapid iteration through 2026

  • 21B active MoE — pay the inference cost of a ~21B dense model, not 295B

  • 256K context — enough for full repos, long agent traces, or multi-document RAG in one shot

  • MTP speculative layer — built-in multi-token prediction gives ~1.5–2× decode speedups on Hopper-class GPUs

  • Two reasoning modesreasoning_effort: "high" for chain-of-thought, "no_think" for fast direct answers

  • Agentic-coding focus — explicitly tuned for SWE-bench-style multi-turn tool use and terminal agents

  • Open-source-friendly license — Tencent Hy Community License is Apache-style for most uses; verify the LICENSE file for your case


Requirements

Component
Minimum (Q4 GGUF, offload)
Recommended (BF16, TP)
Full BF16 (production)

GPU VRAM

~80GB + 256GB RAM offload

4× A100 80GB (320GB)

8× H100 80GB or 8× H20-3e

RAM

256GB

384GB

512GB

Disk

700GB NVMe

1TB NVMe

1.5TB NVMe

CUDA

12.4+

12.4+

12.6+

Driver

550+

550+

560+

Clore.ai pick: For most teams, 4× A100 80GB with BF16 tensor-parallel and --max-model-len 65536 is the sweet spot (~$10–16/day). If you need full 256K context with concurrent users, jump to 8× H100.


Option A — Ollama / GGUF (Quantized, community builds)

For pre-GGUF days, AngelSlim (Tencent's own quantization toolkit) can produce W4A16 / W8A8 weights directly from the BF16 checkpoint.


vLLM is Tencent's first-class serving target for Hy3 Preview. The MTP speculative layer is wired in via --speculative-config.method mtp.

Reasoning modes. Set reasoning_effort: "high" to enable chain-of-thought traces (slower, much better on math/coding/agent tasks) or "no_think" for fast direct answers. The vendor-recommended sampling is temperature=0.9, top_p=1.0 — zero-temp sampling can break reasoning traces.

Tight on GPUs? Drop to --tensor-parallel-size 4 on 4× A100 80GB. Keep --max-model-len 32768 and add --enable-chunked-prefill to keep prefill latency reasonable.


Option C — SGLang

SGLang ships day-0 support and pairs the MTP layer with EAGLE speculative decoding for additional throughput on Hopper.

Expect a 1.5–2× throughput boost on long agent loops compared to vanilla decode.


Clore.ai GPU Recommendations

Setup
VRAM
Expected Performance
Clore.ai Cost
Rent

4× A100 80GB

320GB

BF16 sharded, 64K ctx, ~15–25 tok/s

~$10–16/day

2× H100 80GB

160GB

BF16 with offload, smaller ctx, ~12–20 tok/s

~$12–18/day

8× H100 80GB

640GB

BF16 full, 256K ctx, 60+ tok/s with MTP

~$48–64/day

8× H200 141GB

1,128GB

BF16 full + max concurrency

~$64–96/day

1× RTX 5090

32GB

Q4 GGUF, RAM offload, single user

~$3.94/hr


Use Cases

  • Autonomous SWE agents — 74.4% SWE-bench Verified (vendor-claimed) and explicit tuning for long tool-call loops; pair with OpenHands, SWE-agent, or Aider

  • Terminal-driven agents — 54.4% Terminal-Bench 2.0 puts it in the top tier for shell/CLI workflows

  • Long-horizon reasoning — Olympiad-level math (IMOAnswerBench, FrontierScience-Olympiad) and PhD-grade STEM

  • Codebase-scale RAG — 256K ctx fits a full mid-sized repo plus tests in a single prompt

  • Search and browsing agents — BrowseComp / WideSearch tuning makes it a strong planner for multi-step web research

  • Agent-of-agents — use Hy3 Preview as the planner and lighter open models (Qwen3.5, GLM-4.7 Flash) as workers


Benchmarks

Benchmark
Hy3 Preview
GLM-5.1
DeepSeek R1
GPT-5.4

SWE-bench Verified

74.4%

~79%

~71%

~78%

Terminal-Bench 2.0

54.4%

GPQA Diamond

87.2%

~84%

~88%

SuperGPQA

51.6%

HLE

~30

Tencent also reports strong results on proprietary CL-bench / CL-bench-Life context-learning benchmarks and the Tsinghua Qiuzhen Math PhD exam (Spring 2026).


Troubleshooting

Issue
Solution

OutOfMemoryError on load

BF16 needs ~590GB total VRAM. Drop to 4×A100 with --max-model-len 32768 or use AngelSlim W4A16 quants.

Slow HuggingFace download

Use huggingface-cli download tencent/Hy3-preview --local-dir ./weights --resume-download. Expect 590GB+.

Tool calls silently dropped

Make sure --tool-call-parser hy_v3 (vLLM) or --tool-call-parser hunyuan (SGLang) is set, and --enable-auto-tool-choice is on.

Reasoning trace empty / wrong

Use temperature=0.9, top_p=1.0. Zero-temp greedy decoding breaks the chain-of-thought. Confirm reasoning_effort: "high".

MTP speculative decoding errors

Requires recent vLLM (post-April 2026 build). Run pip install -U vllm --pre or pin to a tag that lists mtp in release notes.

256K context OOMs

Start at --max-model-len 32768, enable --enable-chunked-prefill, raise gradually. Full 256K realistically needs 8× H200.

Custom architecture rejected

Always pass --trust-remote-code. Hy3 ships custom modeling code with the checkpoint.

Ollama / GGUF not available

Community quants typically arrive 2–4 weeks post-release. Use vLLM or AngelSlim in the meantime.


Next Steps

  • Closest open-weight peer: GLM-5.1 — 744B / 40B-active MoE, MIT license, top SWE-bench Pro scores

  • Multimodal alternative: Qwen3.5-Omni — text + audio + image + video, runs on a single RTX 4090

  • Reasoning-only alternative: DeepSeek R1 — pure long-form reasoning specialist

  • Rent the hardware: Rent A100 80GB on Clore.ai — 4× A100 80GB instances from ~$10/day

  • Full marketplace: clore.ai/marketplace — H100, H200, A100, RTX 5090 from $0.50/day

Last updated

Was this helpful?