For the complete documentation index, see llms.txt. This page is also available as Markdown.

NVIDIA Nemotron 3 Super (120B MoE)

Nemotron 3 Super is NVIDIA's open-source 120B-total / 12B-active Mixture-of-Experts Hybrid Mamba-Transformer model, released March 11, 2026. Designed specifically for complex agentic AI systems — autonomous coding, cybersecurity triaging, and long-form multi-step research. Delivers 5× higher throughput vs dense models of comparable quality.

Why Run Nemotron 3 Super on Clore.ai?

Nemotron 3 Super's MoE architecture means only 12B parameters are active per forward pass — so you get frontier-level reasoning at the compute cost of a mid-sized model. On Clore.ai you can rent a single RTX 5090 (32GB) or a pair of RTX 4090s and run it with full INT4/FP4 quantization at production speeds.

Key numbers:

  • 120B total parameters, 12B active (Latent MoE)

  • Hybrid Mamba-Transformer architecture (first in Nemotron line with MTP Layers)

  • 1M token context window

  • Pre-trained in NVFP4 — native NVIDIA FP4 quantization

  • 5× throughput vs comparable dense models

  • NVIDIA Nemotron Open Model License — open weights with commercial use

Hardware Requirements

Config
VRAM
Clore.ai Cost
Notes

FP4 (native)

1× RTX 5090 32GB

~$3.50–5/hr

Fastest; native NVFP4

INT4

2× RTX 4090 24GB

~$4–6/hr

Strong option

INT4

1× A100 80GB

~$20/hr

Full INT4, single GPU

INT8

4× RTX 4090

~$8–12/hr

Near-full quality

BF16 full

4× H100 80GB

~$24–40/hr

Training / full fidelity

Best value on Clore.ai: 2× RTX 5090 (available from ~$7/hr) for BF16 full-precision inference.

Quick Start: vLLM + Nemotron 3 Super

For multi-GPU (2× RTX 4090 in INT4):

SGLang (Alternative — Faster MoE Serving)

For production-grade MoE throughput, SGLang's RadixAttention gives 2–5× better throughput vs vLLM on MoE models:

Deploy on Clore.ai: Step-by-Step

1. Rent a GPU

Go to clore.ai/marketplace:

  • Filter: RTX 5090 or RTX 4090 × 2+

  • Sort by price (spot orders are 20–40% cheaper)

  • Minimum: 32GB VRAM total (FP4); 48GB for INT8; 80GB for BF16

2. Launch Container

In the Clore.ai dashboard, select Custom Docker and enter:

Or use the one-liner SSH launch:

3. Test the API

Agentic Use Case: Multi-Agent Coding Pipeline

Nemotron 3 Super is purpose-built for multi-agent workflows. Here's a minimal example using the OpenAI-compatible API:

Benchmarks (March 2026)

Benchmark
Nemotron 3 Super
DeepSeek V3
Llama 4 Maverick

HumanEval

92.1%

90.8%

88.4%

MATH-500

89.3%

90.2%

84.7%

SWE-bench Verified

65.2%

61.4%

55.8%

MMLU

88.7%

87.2%

86.1%

Throughput (tok/s)

1,840

410

890

Throughput measured on 2× H100 80GB with INT4 quantization.

Monitoring & Production Tips

Recommended settings for production on Clore.ai:

  • --max-model-len 32768 for most workloads (saves VRAM, covers 95% of requests)

  • --gpu-memory-utilization 0.90 (leave 10% buffer for MoE routing overhead)

  • --enable-chunked-prefill for better latency on long inputs

  • Enable spot orders for 30–40% cost savings on batch workloads

Cost Comparison

Provider
Config
$/hr

Clore.ai (spot)

2× RTX 5090

~$5.60

Clore.ai (on-demand)

2× RTX 5090

~$7.00

Azure AI

Hosted API

~$15–20

NVIDIA API

Hosted API

~$12–18

Self-hosting on Clore.ai is 2–3× cheaper than managed API for sustained workloads.

  • vLLM Serving — production LLM server with OpenAI-compatible API

  • SGLang — faster MoE throughput with RadixAttention

  • DeepSeek V4 — upcoming 1T-parameter open model

  • CrewAI — build multi-agent pipelines with role-based agents

  • OpenHands — autonomous software engineering agents

  • GPU Comparison — pick the right GPU for your workload


Last updated: March 16, 2026 | Model released: March 11, 2026 | License: NVIDIA Nemotron Open Model License

Last updated

Was this helpful?