Llama 4 (Scout & Maverick)

Run Meta Llama 4 Scout & Maverick MoE models on Clore.ai GPUs

Meta's Llama 4, released April 2025, marks a fundamental shift to Mixture of Experts (MoE) architecture. Instead of activating all parameters for every token, Llama 4 routes each token to specialized "expert" sub-networks — delivering frontier performance at a fraction of the compute cost. Two open-weight models are available: Scout (ideal for single-GPU) and Maverick (multi-GPU powerhouse).

Key Features

  • MoE Architecture: Only 17B parameters active per token (out of 109B/400B total)

  • Massive Context Windows: Scout supports 10M tokens, Maverick supports 1M tokens

  • Natively Multimodal: Understands both text and images out of the box

  • Two Models: Scout (16 experts, single-GPU friendly) and Maverick (128 experts, multi-GPU)

  • Competitive Performance: Scout matches Gemma 3 27B; Maverick competes with GPT-4o class models

  • Open Weights: Llama Community License (free for most commercial uses)

Model Variants

Model
Total Params
Active Params
Experts
Context
Min VRAM (Q4)
Min VRAM (FP16)

Scout

109B

17B

16

10M

12GB

80GB

Maverick

400B

17B

128

1M

48GB (multi)

320GB (multi)

Requirements

Component
Scout (Q4)
Scout (FP16)
Maverick (Q4)

GPU

1× RTX 4090

1× H100

4× RTX 4090

VRAM

24GB

80GB

4×24GB

RAM

32GB

64GB

128GB

Disk

50GB

120GB

250GB

CUDA

11.8+

12.0+

12.0+

Recommended Clore.ai GPU: RTX 4090 24GB (~$0.5–2/day) for Scout — best value

Quick Start with Ollama

The fastest way to get Llama 4 running:

Ollama as API Server

vLLM Setup (Production)

For production workloads with higher throughput:

Query vLLM Server

HuggingFace Transformers

Docker Quick Start

Why MoE Matters on Clore.ai

Traditional dense models (like Llama 3.3 70B) need massive VRAM because all 70B parameters are active. Llama 4 Scout has 109B total but only activates 17B per token — meaning:

  • Same quality as 70B+ dense models at a fraction of the VRAM cost

  • Fits on a single RTX 4090 in quantized mode

  • 10M token context — process entire codebases, long documents, books

  • Cheaper to rent — $0.5–2/day instead of $6–12/day for 70B models

Tips for Clore.ai Users

  • Start with Scout Q4: Best bang for buck on RTX 4090 — $0.5–2/day, covers 95% of use cases

  • Use --max-model-len wisely: Don't set context higher than you need — it reserves VRAM. Start at 8192, increase as needed

  • Tensor Parallel for Maverick: Rent 4× RTX 4090 machines for Maverick; use --tensor-parallel-size 4

  • HuggingFace Login Required: huggingface-cli login — you need to accept the Llama license on HF first

  • Ollama for Quick Tests, vLLM for Production: Ollama is faster to set up; vLLM gives higher throughput for API serving

  • Monitor GPU Memory: watch nvidia-smi — MoE models can spike VRAM on long sequences

Troubleshooting

Issue
Solution

OutOfMemoryError

Reduce --max-model-len, use Q4 quantization, or upgrade GPU

Model download fails

Run huggingface-cli login and accept Llama 4 license at hf.co

Slow generation

Ensure GPU is being used (nvidia-smi); check --gpu-memory-utilization

vLLM crashes on start

Reduce context length; ensure CUDA 11.8+ installed

Ollama shows wrong model

Run ollama list to verify; ollama rm + ollama pull to re-download

Further Reading

Last updated

Was this helpful?