# NVIDIA Nemotron 3 Super (120B MoE)

> **Nemotron 3 Super** is NVIDIA's open-source 120B-total / 12B-active Mixture-of-Experts Hybrid Mamba-Transformer model, released March 11, 2026. Designed specifically for complex **agentic AI systems** — autonomous coding, cybersecurity triaging, and long-form multi-step research. Delivers **5× higher throughput** vs dense models of comparable quality.

## Why Run Nemotron 3 Super on Clore.ai?

Nemotron 3 Super's MoE architecture means only 12B parameters are active per forward pass — so you get frontier-level reasoning at the compute cost of a mid-sized model. On Clore.ai you can rent a single RTX 5090 (32GB) or a pair of RTX 4090s and run it with full INT4/FP4 quantization at production speeds.

**Key numbers:**

* **120B total parameters**, 12B active (Latent MoE)
* **Hybrid Mamba-Transformer** architecture (first in Nemotron line with MTP Layers)
* **1M token context window**
* Pre-trained in **NVFP4** — native NVIDIA FP4 quantization
* **5× throughput** vs comparable dense models
* NVIDIA Nemotron Open Model License — open weights with commercial use

## Hardware Requirements

| Config       | VRAM             | Clore.ai Cost | Notes                    |
| ------------ | ---------------- | ------------- | ------------------------ |
| FP4 (native) | 1× RTX 5090 32GB | \~$3.50–5/hr  | Fastest; native NVFP4    |
| INT4         | 2× RTX 4090 24GB | \~$4–6/hr     | Strong option            |
| INT4         | 1× A100 80GB     | \~$20/hr      | Full INT4, single GPU    |
| INT8         | 4× RTX 4090      | \~$8–12/hr    | Near-full quality        |
| BF16 full    | 4× H100 80GB     | \~$24–40/hr   | Training / full fidelity |

> **Best value on Clore.ai:** 2× RTX 5090 (available from \~$7/hr) for BF16 full-precision inference.

## Quick Start: vLLM + Nemotron 3 Super

```bash
# Pull the vLLM Docker image (NVFP4 support requires vLLM >= 0.7.3)
docker run --gpus all --rm -it \
  -p 8000:8000 \
  -v /root/.cache:/root/.cache \
  vllm/vllm-openai:v0.7.3 \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
  --quantization fp4 \
  --max-model-len 32768 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.92
```

For multi-GPU (2× RTX 4090 in INT4):

```bash
docker run --gpus all --rm -it \
  -p 8000:8000 \
  -v /root/.cache:/root/.cache \
  vllm/vllm-openai:v0.7.3 \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
  --quantization awq_marlin \
  --max-model-len 65536 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.90
```

## SGLang (Alternative — Faster MoE Serving)

For production-grade MoE throughput, SGLang's RadixAttention gives 2–5× better throughput vs vLLM on MoE models:

```bash
docker run --gpus all --rm -it \
  -p 30000:30000 \
  -v /root/.cache:/root/.cache \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
    --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
    --tp 2 \
    --quantization fp8 \
    --context-length 131072 \
    --port 30000
```

## Deploy on Clore.ai: Step-by-Step

### 1. Rent a GPU

Go to [clore.ai/marketplace](https://clore.ai/marketplace):

* Filter: **RTX 5090** or **RTX 4090 × 2+**
* Sort by price (spot orders are 20–40% cheaper)
* Minimum: 32GB VRAM total (FP4); 48GB for INT8; 80GB for BF16

### 2. Launch Container

In the Clore.ai dashboard, select **Custom Docker** and enter:

```
Image: vllm/vllm-openai:v0.7.3
Ports: 8000
Command: --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 --quantization fp4 --max-model-len 32768
```

Or use the one-liner SSH launch:

```bash
ssh root@<clore-server-ip> "docker run --gpus all -d \
  -p 8000:8000 \
  -v /root/.cache:/root/.cache \
  --name nemotron3 \
  vllm/vllm-openai:v0.7.3 \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
  --quantization fp4 \
  --max-model-len 32768 && echo 'Started'"
```

### 3. Test the API

```bash
curl http://<server-ip>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Write a Python function to scrape GitHub issues and categorize them by severity."}
    ],
    "max_tokens": 2048,
    "temperature": 0.1
  }'
```

## Agentic Use Case: Multi-Agent Coding Pipeline

Nemotron 3 Super is purpose-built for multi-agent workflows. Here's a minimal example using the OpenAI-compatible API:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://<server-ip>:8000/v1",
    api_key="none"
)

def planning_agent(task: str) -> str:
    """High-level task decomposition."""
    response = client.chat.completions.create(
        model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16",
        messages=[
            {"role": "system", "content": "You are a senior engineering lead. Break down complex tasks into concrete sub-tasks with acceptance criteria."},
            {"role": "user", "content": f"Decompose this task: {task}"}
        ],
        max_tokens=1024,
        temperature=0.0
    )
    return response.choices[0].message.content

def coding_agent(subtask: str) -> str:
    """Code implementation."""
    response = client.chat.completions.create(
        model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16",
        messages=[
            {"role": "system", "content": "You are an expert Python engineer. Write production-quality code with tests."},
            {"role": "user", "content": subtask}
        ],
        max_tokens=2048,
        temperature=0.1
    )
    return response.choices[0].message.content

# Example: autonomous feature implementation
plan = planning_agent("Build a REST API for user authentication with JWT")
print("Plan:", plan)
code = coding_agent(f"Implement step 1 from this plan: {plan}")
print("Code:", code)
```

## Benchmarks (March 2026)

| Benchmark          | Nemotron 3 Super | DeepSeek V3 | Llama 4 Maverick |
| ------------------ | ---------------- | ----------- | ---------------- |
| HumanEval          | 92.1%            | 90.8%       | 88.4%            |
| MATH-500           | 89.3%            | 90.2%       | 84.7%            |
| SWE-bench Verified | 65.2%            | 61.4%       | 55.8%            |
| MMLU               | 88.7%            | 87.2%       | 86.1%            |
| Throughput (tok/s) | 1,840            | 410         | 890              |

*Throughput measured on 2× H100 80GB with INT4 quantization.*

## Monitoring & Production Tips

```bash
# Watch GPU memory and utilization
watch -n2 nvidia-smi

# Check vLLM throughput stats
curl http://localhost:8000/metrics 2>/dev/null | grep vllm

# Docker logs (live)
docker logs -f nemotron3

# If OOM: reduce max_model_len or increase tensor-parallel-size
```

**Recommended settings for production on Clore.ai:**

* `--max-model-len 32768` for most workloads (saves VRAM, covers 95% of requests)
* `--gpu-memory-utilization 0.90` (leave 10% buffer for MoE routing overhead)
* `--enable-chunked-prefill` for better latency on long inputs
* Enable spot orders for 30–40% cost savings on batch workloads

## Cost Comparison

| Provider                 | Config      | $/hr     |
| ------------------------ | ----------- | -------- |
| **Clore.ai** (spot)      | 2× RTX 5090 | \~$5.60  |
| **Clore.ai** (on-demand) | 2× RTX 5090 | \~$7.00  |
| Azure AI                 | Hosted API  | \~$15–20 |
| NVIDIA API               | Hosted API  | \~$12–18 |

*Self-hosting on Clore.ai is 2–3× cheaper than managed API for sustained workloads.*

## Related Guides

* [vLLM Serving](/guides/language-models/vllm.md) — production LLM server with OpenAI-compatible API
* [SGLang](/guides/language-models/sglang.md) — faster MoE throughput with RadixAttention
* [DeepSeek V4](/guides/language-models/deepseek-v4.md) — upcoming 1T-parameter open model
* [CrewAI](/guides/ai-platforms-and-agents/crewai.md) — build multi-agent pipelines with role-based agents
* [OpenHands](/guides/ai-platforms-and-agents/openhands.md) — autonomous software engineering agents
* [GPU Comparison](/guides/getting-started/gpu-comparison.md) — pick the right GPU for your workload

***

*Last updated: March 16, 2026 | Model released: March 11, 2026 | License: NVIDIA Nemotron Open Model License*


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/nvidia-nemotron-3-super.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
