# NVIDIA Nemotron 3 Super（120B MoE）

> **Nemotron 3 Super** 是 NVIDIA 的开源 1200 亿总参数 / 120 亿激活参数的专家混合（MoE）混合 Mamba-Transformer 模型，于 2026 年 3 月 11 日发布。专为复杂 **智能体 AI 系统** 而设计——自主编码、网络安全分诊以及长篇多步骤研究。提供 **高 5 倍的吞吐量** 相比质量相当的稠密模型。

## 为什么要在 Clore.ai 上运行 Nemotron 3 Super？

Nemotron 3 Super 的 MoE 架构意味着每次前向传播仅有 120 亿参数处于激活状态——因此你可以以中型模型的计算成本获得前沿级推理能力。在 Clore.ai 上，你可以租用一张 RTX 5090（32GB）或两张 RTX 4090，并以生产级速度使用完整的 INT4/FP4 量化运行它。

**关键数据：**

* **1200 亿总参数**，120 亿激活（潜在 MoE）
* **混合 Mamba-Transformer** 架构（Nemotron 系列中首个采用 MTP 层）
* **100 万 token 上下文窗口**
* 预训练于 **NVFP4** —— NVIDIA 原生 FP4 量化
* **5 倍吞吐量** 相比同类稠密模型
* NVIDIA Nemotron 开源模型许可证——开源权重，可用于商业用途

## 硬件需求

| 配置      | VRAM             | Clore.ai 成本  | 备注            |
| ------- | ---------------- | ------------ | ------------- |
| FP4（原生） | 1× RTX 5090 32GB | \~$3.50–5/小时 | 最快；原生 NVFP4   |
| INT4    | 2× RTX 4090 24GB | \~$4–6/小时    | 强力选择          |
| INT4    | 1× A100 80GB     | \~$20/小时     | 完整 INT4，单 GPU |
| INT8    | 4× RTX 4090      | \~$8–12/小时   | 接近完整质量        |
| BF16 完整 | 4× H100 80GB     | \~$24–40/小时  | 训练 / 完整保真度    |

> **Clore.ai 上的最佳价值：** 2× RTX 5090（可从约 $7/小时起租）用于 BF16 全精度推理。

## 快速开始：vLLM + Nemotron 3 Super

```bash
# 拉取 vLLM Docker 镜像（NVFP4 支持需要 vLLM >= 0.7.3）
docker run --gpus all --rm -it \
  -p 8000:8000 \
  -v /root/.cache:/root/.cache \
  vllm/vllm-openai:v0.7.3 \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
  --quantization fp4 \
  --max-model-len 32768 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.92
```

用于多 GPU（2× RTX 4090，INT4）：

```bash
docker run --gpus all --rm -it \
  -p 8000:8000 \
  -v /root/.cache:/root/.cache \
  vllm/vllm-openai:v0.7.3 \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
  --quantization awq_marlin \
  --max-model-len 65536 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.90
```

## SGLang（替代方案——更快的 MoE 服务）

对于生产级 MoE 吞吐量，SGLang 的 RadixAttention 在 MoE 模型上的吞吐量比 vLLM 高 2–5 倍：

```bash
docker run --gpus all --rm -it \
  -p 30000:30000 \
  -v /root/.cache:/root/.cache \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
    --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
    --tp 2 \
    --quantization fp8 \
    --context-length 131072 \
    --port 30000
```

## 在 Clore.ai 上部署：分步指南

### 1. 租用 GPU

前往 [clore.ai/marketplace](https://clore.ai/marketplace):

* 筛选： **RTX 5090** 或 **RTX 4090 × 2+**
* 按价格排序（现货订单便宜 20–40%）
* 最低要求：总计 32GB VRAM（FP4）；INT8 需要 48GB；BF16 需要 80GB

### 2. 启动容器

在 Clore.ai 控制台中，选择 **自定义 Docker** 并输入：

```
镜像：vllm/vllm-openai:v0.7.3
端口：8000
命令：--model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 --quantization fp4 --max-model-len 32768
```

或者使用一行 SSH 启动：

```bash
ssh root@<clore-server-ip> "docker run --gpus all -d \
  -p 8000:8000 \
  -v /root/.cache:/root/.cache \
  --name nemotron3 \
  vllm/vllm-openai:v0.7.3 \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
  --quantization fp4 \
  --max-model-len 32768 && echo 'Started'"
```

### 3. 测试 API

```bash
curl http://<server-ip>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Write a Python function to scrape GitHub issues and categorize them by severity."}
    ],
    "max_tokens": 2048,
    "temperature": 0.1
  }'
```

## 智能体用例：多智能体编码流水线

Nemotron 3 Super 是专为多智能体工作流打造的。下面是一个使用兼容 OpenAI API 的最小示例：

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://<server-ip>:8000/v1",
    api_key="none"
)

def planning_agent(task: str) -> str:
    """高级任务分解。"""
    response = client.chat.completions.create(
        model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16",
        messages=[
            {"role": "system", "content": "You are a senior engineering lead. Break down complex tasks into concrete sub-tasks with acceptance criteria."},
            {"role": "user", "content": f"Decompose this task: {task}"}
        ],
        max_tokens=1024,
        temperature=0.0
    )
    return response.choices[0].message.content

def coding_agent(subtask: str) -> str:
    """代码实现。"""
    response = client.chat.completions.create(
        model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16",
        messages=[
            {"role": "system", "content": "You are an expert Python engineer. Write production-quality code with tests."},
            {"role": "user", "content": subtask}
        ],
        max_tokens=2048,
        temperature=0.1
    )
    return response.choices[0].message.content

# 示例：自主功能实现
plan = planning_agent("Build a REST API for user authentication with JWT")
print("Plan:", plan)
code = coding_agent(f"Implement step 1 from this plan: {plan}")
print("Code:", code)
```

## 基准测试（2026 年 3 月）

| 基准项                | Nemotron 3 Super | DeepSeek V3 | Llama 4 Maverick |
| ------------------ | ---------------- | ----------- | ---------------- |
| HumanEval          | 92.1%            | 90.8%       | 88.4%            |
| MATH-500           | 89.3%            | 90.2%       | 84.7%            |
| SWE-bench Verified | 65.2%            | 61.4%       | 55.8%            |
| MMLU               | 88.7%            | 87.2%       | 86.1%            |
| 吞吐量（tok/s）         | 1,840            | 410         | 890              |

*吞吐量在 2× H100 80GB、INT4 量化条件下测得。*

## 监控与生产建议

```bash
# 查看 GPU 内存和利用率
watch -n2 nvidia-smi

# 查看 vLLM 吞吐量统计
curl http://localhost:8000/metrics 2>/dev/null | grep vllm

# Docker 日志（实时）
docker logs -f nemotron3

# 如果 OOM：降低 max_model_len 或增加 tensor-parallel-size
```

**Clore.ai 上生产环境的推荐设置：**

* `--max-model-len 32768` 适用于大多数工作负载（节省 VRAM，覆盖 95% 的请求）
* `--gpu-memory-utilization 0.90` （为 MoE 路由开销保留 10% 缓冲）
* `--enable-chunked-prefill` 用于更长输入时获得更好的延迟
* 为批处理工作负载启用现货订单，可节省 30–40% 成本

## 成本对比

| 提供商               | 配置          | $/小时     |
| ----------------- | ----------- | -------- |
| **Clore.ai** （现货） | 2× RTX 5090 | \~$5.60  |
| **Clore.ai** （按需） | 2× RTX 5090 | \~$7.00  |
| Azure AI          | 托管 API      | \~$15–20 |
| NVIDIA API        | 托管 API      | \~$12–18 |

*在 Clore.ai 上自托管对于持续性工作负载来说，比托管 API 便宜 2–3 倍。*

## 相关指南

* [vLLM 服务](/guides/guides_v2-zh/yu-yan-mo-xing/vllm.md) —— 兼容 OpenAI API 的生产级 LLM 服务器
* [SGLang](/guides/guides_v2-zh/yu-yan-mo-xing/sglang.md) —— 使用 RadixAttention 获得更快的 MoE 吞吐量
* [DeepSeek V4](/guides/guides_v2-zh/yu-yan-mo-xing/deepseek-v4.md) —— 即将推出的 1T 参数开源模型
* [CrewAI](/guides/guides_v2-zh/ai-ping-tai-yu-zhi-neng-ti/crewai.md) —— 使用基于角色的智能体构建多智能体流水线
* [OpenHands](/guides/guides_v2-zh/ai-ping-tai-yu-zhi-neng-ti/openhands.md) —— 自主软件工程智能体
* [GPU 对比](/guides/guides_v2-zh/kai-shi-shi-yong/gpu-comparison.md) —— 为你的工作负载选择合适的 GPU

***

*最后更新：2026 年 3 月 16 日 | 模型发布：2026 年 3 月 11 日 | 许可证：NVIDIA Nemotron 开源模型许可证*


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/yu-yan-mo-xing/nvidia-nemotron-3-super.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.