# Hy3 Preview (Tencent Hunyuan 3, 295B MoE)

{% hint style="info" %}
**Status (April 2026):** Hy3 Preview is the first public release from **Tencent Hunyuan's rebuilt training infrastructure**, published on **April 13, 2026** and last updated **April 23, 2026**. Weights live at [huggingface.co/tencent/Hy3-preview](https://huggingface.co/tencent/Hy3-preview) under the **Tencent Hy Community License**. Day-0 support landed in vLLM and SGLang.
{% endhint %}

Hy3 Preview is a **295B-parameter Mixture-of-Experts** language model that activates only **\~21B parameters per token** (192 experts, top-8 routed). It targets two workloads where Tencent has been visibly catching up: **long-horizon reasoning** (FrontierScience-Olympiad, IMOAnswerBench, math-PhD exams) and **agentic coding** (SWE-bench Verified 74.4%, Terminal-Bench 2.0 54.4%, vendor-claimed). The 256K context window plus an MTP (Multi-Token Prediction) speculative-decoding layer make it practical for IDE-scale coding agents and document-heavy RAG.

For Clore.ai users, the headline number is **21B active**. You don't need a full 8×H200 rack. A tensor-parallel deployment across **4×A100 80GB** or **2×H100 80GB** (BF16 with offload) is enough to serve it at usable throughput — frontier-class agentic coding for \~$10–20/day on the marketplace, with weights staying on your own box.

### Key Specs

| Property          | Value                                       |
| ----------------- | ------------------------------------------- |
| Total Parameters  | 295B (MoE)                                  |
| Active Parameters | 21B per forward pass                        |
| Experts           | 192 total, top-8 routed                     |
| Layers            | 80 transformer + 1 MTP                      |
| Attention         | 64 heads, GQA with 8 KV heads, head dim 128 |
| Hidden Size       | 4096                                        |
| Intermediate Size | 13,312                                      |
| Vocabulary        | 120,832                                     |
| Context Window    | 256,000 tokens                              |
| Native Precision  | BF16                                        |
| License           | Tencent Hy Community License                |
| Release Date      | April 13, 2026                              |
| Organization      | Tencent Hunyuan                             |
| Primary Tooling   | vLLM, SGLang, AngelSlim, LLaMA-Factory      |

### Why Hy3 Preview?

* **First on Tencent's rebuilt RL stack** — Tencent rewrote its training infrastructure for this release; expect rapid iteration through 2026
* **21B active MoE** — pay the inference cost of a \~21B dense model, not 295B
* **256K context** — enough for full repos, long agent traces, or multi-document RAG in one shot
* **MTP speculative layer** — built-in multi-token prediction gives \~1.5–2× decode speedups on Hopper-class GPUs
* **Two reasoning modes** — `reasoning_effort: "high"` for chain-of-thought, `"no_think"` for fast direct answers
* **Agentic-coding focus** — explicitly tuned for SWE-bench-style multi-turn tool use and terminal agents
* **Open-source-friendly license** — Tencent Hy Community License is Apache-style for most uses; verify the LICENSE file for your case

***

## Requirements

{% hint style="warning" %}
**Still a 295B-class model.** "21B active" describes inference compute, not the memory footprint. The full BF16 weights are \~590GB and must live in VRAM (or be offloaded). Plan for 8×H100/H200 if you want unconstrained throughput; 4×A100 80GB works with offload and shorter contexts.
{% endhint %}

| Component | Minimum (Q4 GGUF, offload) | Recommended (BF16, TP) | Full BF16 (production)    |
| --------- | -------------------------- | ---------------------- | ------------------------- |
| GPU VRAM  | \~80GB + 256GB RAM offload | 4× A100 80GB (320GB)   | 8× H100 80GB or 8× H20-3e |
| RAM       | 256GB                      | 384GB                  | 512GB                     |
| Disk      | 700GB NVMe                 | 1TB NVMe               | 1.5TB NVMe                |
| CUDA      | 12.4+                      | 12.4+                  | 12.6+                     |
| Driver    | 550+                       | 550+                   | 560+                      |

**Clore.ai pick:** For most teams, **4× A100 80GB** with BF16 tensor-parallel and `--max-model-len 65536` is the sweet spot (\~$10–16/day). If you need full 256K context with concurrent users, jump to 8× H100.

***

## Option A — Ollama / GGUF (Quantized, community builds)

{% hint style="warning" %}
**Heads-up:** Hy3 Preview is brand new (April 13, 2026) and uses a custom MoE architecture. Community llama.cpp / GGUF support typically lands **2–4 weeks** after release. If you need it today, use vLLM (Option B). Check [huggingface.co/models?search=hy3-preview+gguf](https://huggingface.co/models?search=hy3-preview+gguf) for community quants before pulling.
{% endhint %}

```bash
# Once a Q4_K_M build is published
docker exec ollama ollama pull hy3-preview:q4_K_M
docker exec ollama ollama run hy3-preview:q4_K_M

# Or with llama.cpp directly on a community GGUF
docker run --gpus all -it --rm -p 8080:8080 \
  -v $(pwd)/models:/models \
  ghcr.io/ggerganov/llama.cpp:server-cuda \
  -m /models/hy3-preview-q4_k_m.gguf \
  --n-gpu-layers 80 --ctx-size 32768 \
  --port 8080 --host 0.0.0.0
```

For pre-GGUF days, AngelSlim (Tencent's own quantization toolkit) can produce W4A16 / W8A8 weights directly from the BF16 checkpoint.

***

## Option B — vLLM (Production API, recommended)

vLLM is Tencent's first-class serving target for Hy3 Preview. The MTP speculative layer is wired in via `--speculative-config.method mtp`.

```yaml
version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - hf_cache:/root/.cache/huggingface
    command: >
      --model tencent/Hy3-preview
      --tensor-parallel-size 8
      --max-model-len 65536
      --gpu-memory-utilization 0.90
      --speculative-config.method mtp
      --speculative-config.num_speculative_tokens 1
      --tool-call-parser hy_v3
      --reasoning-parser hy_v3
      --enable-auto-tool-choice
      --served-model-name hy3-preview
      --trust-remote-code
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    shm_size: "16gb"

volumes:
  hf_cache:
```

```bash
# Test the API with high reasoning effort
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hy3-preview",
    "messages": [
      {"role": "system", "content": "You are an expert software engineer."},
      {"role": "user", "content": "Refactor this Python function to use async/await and add proper error handling."}
    ],
    "max_tokens": 4096,
    "temperature": 0.9,
    "top_p": 1.0,
    "reasoning_effort": "high"
  }'
```

{% hint style="info" %}
**Reasoning modes.** Set `reasoning_effort: "high"` to enable chain-of-thought traces (slower, much better on math/coding/agent tasks) or `"no_think"` for fast direct answers. The vendor-recommended sampling is `temperature=0.9, top_p=1.0` — zero-temp sampling can break reasoning traces.
{% endhint %}

{% hint style="info" %}
**Tight on GPUs?** Drop to `--tensor-parallel-size 4` on 4× A100 80GB. Keep `--max-model-len 32768` and add `--enable-chunked-prefill` to keep prefill latency reasonable.
{% endhint %}

***

## Option C — SGLang

SGLang ships day-0 support and pairs the MTP layer with EAGLE speculative decoding for additional throughput on Hopper.

```bash
docker pull lmsysorg/sglang:latest

python3 -m sglang.launch_server \
  --model tencent/Hy3-preview \
  --tp 8 \
  --tool-call-parser hunyuan \
  --reasoning-parser hunyuan \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 1 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 2 \
  --mem-fraction-static 0.88 \
  --context-length 65536 \
  --served-model-name hy3-preview
```

Expect a 1.5–2× throughput boost on long agent loops compared to vanilla decode.

***

## Clore.ai GPU Recommendations

| Setup         | VRAM    | Expected Performance                          | Clore.ai Cost | Rent                                                   |
| ------------- | ------- | --------------------------------------------- | ------------- | ------------------------------------------------------ |
| 4× A100 80GB  | 320GB   | BF16 sharded, 64K ctx, \~15–25 tok/s          | \~$10–16/day  | [Rent A100 80GB](https://clore.ai/rent-a100-80gb.html) |
| 2× H100 80GB  | 160GB   | BF16 with offload, smaller ctx, \~12–20 tok/s | \~$12–18/day  | [Rent H100](https://clore.ai/rent-h100.html)           |
| 8× H100 80GB  | 640GB   | BF16 full, 256K ctx, 60+ tok/s with MTP       | \~$48–64/day  | [Rent H100](https://clore.ai/rent-h100.html)           |
| 8× H200 141GB | 1,128GB | BF16 full + max concurrency                   | \~$64–96/day  | [Rent H200](https://clore.ai/rent-h200.html)           |
| 1× RTX 5090   | 32GB    | Q4 GGUF, RAM offload, single user             | \~$3.94/hr    | [Marketplace](https://clore.ai/marketplace)            |

{% hint style="success" %}
**Best value:** 4× A100 80GB with BF16 tensor-parallel and a 64K context window. You get an open-weight 295B-class agentic coder for roughly the price of a Claude Pro subscription, and the weights never leave your rented box.
{% endhint %}

***

## Use Cases

* **Autonomous SWE agents** — 74.4% SWE-bench Verified (vendor-claimed) and explicit tuning for long tool-call loops; pair with OpenHands, SWE-agent, or Aider
* **Terminal-driven agents** — 54.4% Terminal-Bench 2.0 puts it in the top tier for shell/CLI workflows
* **Long-horizon reasoning** — Olympiad-level math (IMOAnswerBench, FrontierScience-Olympiad) and PhD-grade STEM
* **Codebase-scale RAG** — 256K ctx fits a full mid-sized repo plus tests in a single prompt
* **Search and browsing agents** — BrowseComp / WideSearch tuning makes it a strong planner for multi-step web research
* **Agent-of-agents** — use Hy3 Preview as the planner and lighter open models ([Qwen3.5](/guides/language-models/qwen35.md), [GLM-4.7 Flash](/guides/language-models/glm-47-flash.md)) as workers

***

## Benchmarks

{% hint style="warning" %}
**Vendor-claimed — verify independently.** All numbers below come from Tencent's April 13, 2026 model card. Independent reproductions (especially on SWE-bench Verified) are still rolling in. Treat them as upper bounds until LMSYS / OpenCompass confirms.
{% endhint %}

| Benchmark          | Hy3 Preview | GLM-5.1 | DeepSeek R1 | GPT-5.4 |
| ------------------ | ----------- | ------- | ----------- | ------- |
| SWE-bench Verified | **74.4%**   | \~79%   | \~71%       | \~78%   |
| Terminal-Bench 2.0 | **54.4%**   | —       | —           | —       |
| GPQA Diamond       | **87.2%**   | —       | \~84%       | \~88%   |
| SuperGPQA          | 51.6%       | —       | —           | —       |
| HLE                | \~30        | —       | —           | —       |

Tencent also reports strong results on proprietary CL-bench / CL-bench-Life context-learning benchmarks and the Tsinghua Qiuzhen Math PhD exam (Spring 2026).

***

## Troubleshooting

| Issue                           | Solution                                                                                                                            |
| ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| `OutOfMemoryError` on load      | BF16 needs \~590GB total VRAM. Drop to 4×A100 with `--max-model-len 32768` or use AngelSlim W4A16 quants.                           |
| Slow HuggingFace download       | Use `huggingface-cli download tencent/Hy3-preview --local-dir ./weights --resume-download`. Expect 590GB+.                          |
| Tool calls silently dropped     | Make sure `--tool-call-parser hy_v3` (vLLM) or `--tool-call-parser hunyuan` (SGLang) is set, and `--enable-auto-tool-choice` is on. |
| Reasoning trace empty / wrong   | Use `temperature=0.9, top_p=1.0`. Zero-temp greedy decoding breaks the chain-of-thought. Confirm `reasoning_effort: "high"`.        |
| MTP speculative decoding errors | Requires recent vLLM (post-April 2026 build). Run `pip install -U vllm --pre` or pin to a tag that lists `mtp` in release notes.    |
| 256K context OOMs               | Start at `--max-model-len 32768`, enable `--enable-chunked-prefill`, raise gradually. Full 256K realistically needs 8× H200.        |
| Custom architecture rejected    | Always pass `--trust-remote-code`. Hy3 ships custom modeling code with the checkpoint.                                              |
| Ollama / GGUF not available     | Community quants typically arrive 2–4 weeks post-release. Use vLLM or AngelSlim in the meantime.                                    |

***

## Next Steps

* **Closest open-weight peer:** [GLM-5.1](/guides/language-models/glm-5-1.md) — 744B / 40B-active MoE, MIT license, top SWE-bench Pro scores
* **Multimodal alternative:** [Qwen3.5-Omni](/guides/language-models/qwen35-omni.md) — text + audio + image + video, runs on a single RTX 4090
* **Reasoning-only alternative:** [DeepSeek R1](/guides/language-models/deepseek-r1.md) — pure long-form reasoning specialist
* **Rent the hardware:** [Rent A100 80GB on Clore.ai](https://clore.ai/rent-a100-80gb.html) — 4× A100 80GB instances from \~$10/day
* **Full marketplace:** [clore.ai/marketplace](https://clore.ai/marketplace) — H100, H200, A100, RTX 5090 from $0.50/day

### Links

* [Hy3 Preview on HuggingFace](https://huggingface.co/tencent/Hy3-preview)
* [Hy3 Preview GitHub repo](https://github.com/Tencent-Hunyuan/Hy3-preview)
* [Tencent Hunyuan organization](https://huggingface.co/tencent)
* [vLLM docs](https://docs.vllm.ai)
* [SGLang repo](https://github.com/sgl-project/sglang)
* [AngelSlim — Tencent's quantization toolkit](https://github.com/Tencent/AngelSlim)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/hy3-preview.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.