> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-de/sprachmodelle/nvidia-nemotron-3-super.md).

# NVIDIA Nemotron 3 Super (120B MoE)

> **Nemotron 3 Super** ist NVIDIAs Open-Source-120B-gesamt-/12B-aktiv-Mixture-of-Experts-Hybrid-Mamba-Transformer-Modell, veröffentlicht am 11. März 2026. Speziell für komplexe **agentische KI-Systeme** — autonomes Coden, Cybersecurity-Triage und umfangreiche mehrstufige Recherchen. Liefert **5× höheren Durchsatz** gegenüber dichten Modellen vergleichbarer Qualität.

## Warum Nemotron 3 Super auf Clore.ai ausführen?

Die MoE-Architektur von Nemotron 3 Super bedeutet, dass pro Forward-Pass nur 12B Parameter aktiv sind — so erhalten Sie Reasoning auf Frontier-Niveau zu den Rechenkosten eines mittelgroßen Modells. Auf Clore.ai können Sie eine einzelne RTX 5090 (32 GB) oder ein Paar RTX 4090 mieten und es mit voller INT4/FP4-Quantisierung in Produktionsgeschwindigkeit ausführen.

**Wichtige Zahlen:**

* **120B Gesamtparameter**, 12B aktiv (Latent MoE)
* **Hybride Mamba-Transformer** Architektur (erstmals in der Nemotron-Reihe mit MTP-Layers)
* **1M-Token-Kontextfenster**
* Vortrainiert in **NVFP4** — native NVIDIA-FP4-Quantisierung
* **5× Durchsatz** gegenüber vergleichbaren dichten Modellen
* NVIDIA Nemotron Open Model License — offene Gewichte mit kommerzieller Nutzung

## Hardware-Anforderungen

| Konfiguration    | VRAM              | Clore.ai-Kosten | Hinweise                         |
| ---------------- | ----------------- | --------------- | -------------------------------- |
| FP4 (nativ)      | 1× RTX 5090 32 GB | \~3,50–5 $/Std. | Am schnellsten; natives NVFP4    |
| INT4             | 2× RTX 4090 24 GB | \~4–6 $/Std.    | Starke Option                    |
| INT4             | 1× A100 80 GB     | \~20 $/Std.     | Volles INT4, einzelne GPU        |
| INT8             | 4× RTX 4090       | \~8–12 $/Std.   | Nahezu volle Qualität            |
| BF16 vollständig | 4× H100 80GB      | \~24–40 $/Std.  | Training / volle Wiedergabetreue |

> **Bestes Preis-Leistungs-Verhältnis auf Clore.ai:** 2× RTX 5090 (verfügbar ab \~7 $/Std.) für BF16-Inferenz mit voller Präzision.

## Schnellstart: vLLM + Nemotron 3 Super

```bash
# Das vLLM-Docker-Image ziehen (NVFP4-Unterstützung erfordert vLLM >= 0.7.3)
docker run --gpus all --rm -it \
  -p 8000:8000 \
  -v /root/.cache:/root/.cache \
  vllm/vllm-openai:v0.7.3 \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
  --quantization fp4 \
  --max-model-len 32768 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.92
```

Für Multi-GPU (2× RTX 4090 in INT4):

```bash
docker run --gpus all --rm -it \
  -p 8000:8000 \
  -v /root/.cache:/root/.cache \
  vllm/vllm-openai:v0.7.3 \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
  --quantization awq_marlin \
  --max-model-len 65536 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.90
```

## SGLang (Alternative — schnelleres MoE-Serving)

Für MoE-Durchsatz in Produktionsqualität liefert SGLangs RadixAttention 2–5× besseren Durchsatz als vLLM bei MoE-Modellen:

```bash
docker run --gpus all --rm -it \
  -p 30000:30000 \
  -v /root/.cache:/root/.cache \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
    --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
    --tp 2 \
    --quantization fp8 \
    --context-length 131072 \
    --port 30000
```

## Bereitstellung auf Clore.ai: Schritt für Schritt

### 1. GPU mieten

Gehen Sie zu [clore.ai/marketplace](https://clore.ai/marketplace):

* Filter: **RTX 5090** oder **RTX 4090 × 2+**
* Nach Preis sortieren (Spot-Bestellungen sind 20–40% günstiger)
* Minimum: 32 GB VRAM gesamt (FP4); 48 GB für INT8; 80 GB für BF16

### 2. Container starten

Wählen Sie im Clore.ai-Dashboard **benutzerdefiniertes Docker** und eingeben:

```
Image: vllm/vllm-openai:v0.7.3
Ports: 8000
Befehl: --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 --quantization fp4 --max-model-len 32768
```

Oder verwenden Sie den SSH-Start per Einzeiler:

```bash
ssh root@<clore-server-ip> "docker run --gpus all -d \
  -p 8000:8000 \
  -v /root/.cache:/root/.cache \
  --name nemotron3 \
  vllm/vllm-openai:v0.7.3 \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
  --quantization fp4 \
  --max-model-len 32768 && echo 'Gestartet'"
```

### 3. API testen

```bash
curl http://<server-ip>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16",
    "messages": [
      {"role": "system", "content": "Du bist ein hilfreicher Assistent."},
      {"role": "user", "content": "Schreibe eine Python-Funktion, die GitHub-Issues durchsucht und sie nach Schweregrad kategorisiert."}
    ],
    "max_tokens": 2048,
    "temperature": 0.1
  }'
```

## Agentischer Anwendungsfall: Multi-Agenten-Coding-Pipeline

Nemotron 3 Super ist speziell für Multi-Agent-Workflows entwickelt. Hier ist ein minimales Beispiel mit der OpenAI-kompatiblen API:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://<server-ip>:8000/v1",
    api_key="none"
)

def planning_agent(task: str) -> str:
    """Aufgaben auf hoher Ebene zerlegen."""
    response = client.chat.completions.create(
        model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16",
        messages=[
            {"role": "system", "content": "Du bist ein leitender Engineering-Manager. Zerlege komplexe Aufgaben in konkrete Teilaufgaben mit Akzeptanzkriterien."},
            {"role": "user", "content": f"Zerlege diese Aufgabe: {task}"}
        ],
        max_tokens=1024,
        temperature=0.0
    )
    return response.choices[0].message.content

def coding_agent(subtask: str) -> str:
    """Code-Implementierung."""
    response = client.chat.completions.create(
        model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16",
        messages=[
            {"role": "system", "content": "Du bist ein Experte für Python-Engineering. Schreibe produktionsreifen Code mit Tests."},
            {"role": "user", "content": subtask}
        ],
        max_tokens=2048,
        temperature=0.1
    )
    return response.choices[0].message.content

# Beispiel: autonome Feature-Implementierung
plan = planning_agent("Erstelle eine REST-API für Benutzerauthentifizierung mit JWT")
print("Plan:", plan)
code = coding_agent(f"Implementiere Schritt 1 aus diesem Plan: {plan}")
print("Code:", code)
```

## Benchmarks (März 2026)

| Benchmark          | Nemotron 3 Super | DeepSeek V3 | Llama 4 Maverick |
| ------------------ | ---------------- | ----------- | ---------------- |
| HumanEval          | 92.1%            | 90.8%       | 88.4%            |
| MATH-500           | 89.3%            | 90.2%       | 84.7%            |
| SWE-bench Verified | 65.2%            | 61.4%       | 55.8%            |
| MMLU               | 88.7%            | 87.2%       | 86.1%            |
| Durchsatz (tok/s)  | 1,840            | 410         | 890              |

*Der Durchsatz wurde auf 2× H100 80GB mit INT4-Quantisierung gemessen.*

## Monitoring- und Produktionstipps

```bash
# GPU-Speicher und Auslastung überwachen
watch -n2 nvidia-smi

# vLLM-Durchsatzstatistiken prüfen
curl http://localhost:8000/metrics 2>/dev/null | grep vllm

# Docker-Logs (live)
docker logs -f nemotron3

# Bei OOM: max_model_len reduzieren oder tensor-parallel-size erhöhen
```

**Empfohlene Einstellungen für die Produktion auf Clore.ai:**

* `--max-model-len 32768` für die meisten Workloads (spart VRAM, deckt 95% der Anfragen ab)
* `--gpu-memory-utilization 0.90` (10% Puffer für MoE-Routing-Overhead lassen)
* `--enable-chunked-prefill` für bessere Latenz bei langen Eingaben
* Spot-Bestellungen aktivieren für 30–40% Kosteneinsparung bei Batch-Workloads

## Kostenvergleich

| Anbieter                 | Konfiguration | $/Std.    |
| ------------------------ | ------------- | --------- |
| **Clore.ai** (spot)      | 2× RTX 5090   | \~$5.60   |
| **Clore.ai** (On-Demand) | 2× RTX 5090   | \~$7.00   |
| Azure AI                 | Hosted API    | \~15–20 $ |
| NVIDIA API               | Hosted API    | \~12–18 $ |

*Self-Hosting auf Clore.ai ist bei dauerhaftem Workload 2–3× günstiger als eine verwaltete API.*

## Verwandte Anleitungen

* [vLLM Serving](/guides/guides_v2-de/sprachmodelle/vllm.md) — Produktions-LLM-Server mit OpenAI-kompatibler API
* [SGLang](/guides/guides_v2-de/sprachmodelle/sglang.md) — schnellerer MoE-Durchsatz mit RadixAttention
* [DeepSeek V4](/guides/guides_v2-de/sprachmodelle/deepseek-v4.md) — bevorstehendes offenes 1T-Parameter-Modell
* [CrewAI](/guides/guides_v2-de/ki-plattformen-and-agenten/crewai.md) — Multi-Agenten-Pipelines mit rollenbasierten Agenten aufbauen
* [OpenHands](/guides/guides_v2-de/ki-plattformen-and-agenten/openhands.md) — autonome Software-Engineering-Agenten
* [GPU-Vergleich](/guides/guides_v2-de/erste-schritte/gpu-comparison.md) — die richtige GPU für Ihren Workload auswählen

***

*Zuletzt aktualisiert: 16. März 2026 | Modell veröffentlicht: 11. März 2026 | Lizenz: NVIDIA Nemotron Open Model License*


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-de/sprachmodelle/nvidia-nemotron-3-super.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.