# Llama 4 (Scout & Maverick)

Meta's Llama 4, released April 2025, marks a fundamental shift to **Mixture of Experts (MoE)** architecture. Instead of activating all parameters for every token, Llama 4 routes each token to specialized "expert" sub-networks — delivering frontier performance at a fraction of the compute cost. Two open-weight models are available: **Scout** (ideal for single-GPU) and **Maverick** (multi-GPU powerhouse).

## Key Features

* **MoE Architecture**: Only 17B parameters active per token (out of 109B/400B total)
* **Massive Context Windows**: Scout supports 10M tokens, Maverick supports 1M tokens
* **Natively Multimodal**: Understands both text and images out of the box
* **Two Models**: Scout (16 experts, single-GPU friendly) and Maverick (128 experts, multi-GPU)
* **Competitive Performance**: Scout matches Gemma 3 27B; Maverick competes with GPT-4o class models
* **Open Weights**: Llama Community License (free for most commercial uses)

## Model Variants

| Model        | Total Params | Active Params | Experts | Context | Min VRAM (Q4) | Min VRAM (FP16) |
| ------------ | ------------ | ------------- | ------- | ------- | ------------- | --------------- |
| **Scout**    | 109B         | 17B           | 16      | 10M     | 12GB          | 80GB            |
| **Maverick** | 400B         | 17B           | 128     | 1M      | 48GB (multi)  | 320GB (multi)   |

## Requirements

| Component | Scout (Q4)  | Scout (FP16) | Maverick (Q4) |
| --------- | ----------- | ------------ | ------------- |
| GPU       | 1× RTX 4090 | 1× H100      | 4× RTX 4090   |
| VRAM      | 24GB        | 80GB         | 4×24GB        |
| RAM       | 32GB        | 64GB         | 128GB         |
| Disk      | 50GB        | 120GB        | 250GB         |
| CUDA      | 11.8+       | 12.0+        | 12.0+         |

**Recommended Clore.ai GPU**: RTX 4090 24GB (\~$0.5–2/day) for Scout — best value

## Quick Start with Ollama

The fastest way to get Llama 4 running:

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run Scout (quantized, ~12GB VRAM)
ollama run llama4-scout

# For longer context (uses more VRAM)
ollama run llama4-scout --ctx-size 32768
```

### Ollama as API Server

```bash
# Start server in background
ollama serve &

# Pull model
ollama pull llama4-scout

# Query via OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4-scout",
    "messages": [{"role": "user", "content": "Explain MoE architecture in 3 sentences"}]
  }'
```

## vLLM Setup (Production)

For production workloads with higher throughput:

```bash
# Install vLLM
pip install vllm

# Serve Scout on single GPU (quantized)
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

# Serve Scout on 2 GPUs (longer context)
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 128000 \
  --gpu-memory-utilization 0.90

# Serve Maverick on 4 GPUs
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 65536
```

### Query vLLM Server

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a Python function to calculate Fibonacci numbers"}
    ],
    temperature=0.7,
    max_tokens=1024
)
print(response.choices[0].message.content)
```

## HuggingFace Transformers

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True  # 4-bit quantization for 24GB GPUs
)

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a REST API with FastAPI that manages a todo list"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=2048, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))
```

## Docker Quick Start

```bash
# Using vLLM Docker image
docker run --gpus all -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --max-model-len 32768
```

## Why MoE Matters on Clore.ai

Traditional dense models (like Llama 3.3 70B) need massive VRAM because all 70B parameters are active. Llama 4 Scout has 109B total but only activates 17B per token — meaning:

* **Same quality as 70B+ dense models** at a fraction of the VRAM cost
* **Fits on a single RTX 4090** in quantized mode
* **10M token context** — process entire codebases, long documents, books
* **Cheaper to rent** — $0.5–2/day instead of $6–12/day for 70B models

## Tips for Clore.ai Users

* **Start with Scout Q4**: Best bang for buck on RTX 4090 — $0.5–2/day, covers 95% of use cases
* **Use `--max-model-len` wisely**: Don't set context higher than you need — it reserves VRAM. Start at 8192, increase as needed
* **Tensor Parallel for Maverick**: Rent 4× RTX 4090 machines for Maverick; use `--tensor-parallel-size 4`
* **HuggingFace Login Required**: `huggingface-cli login` — you need to accept the Llama license on HF first
* **Ollama for Quick Tests, vLLM for Production**: Ollama is faster to set up; vLLM gives higher throughput for API serving
* **Monitor GPU Memory**: `watch nvidia-smi` — MoE models can spike VRAM on long sequences

## Troubleshooting

| Issue                    | Solution                                                                  |
| ------------------------ | ------------------------------------------------------------------------- |
| `OutOfMemoryError`       | Reduce `--max-model-len`, use Q4 quantization, or upgrade GPU             |
| Model download fails     | Run `huggingface-cli login` and accept Llama 4 license at hf.co           |
| Slow generation          | Ensure GPU is being used (`nvidia-smi`); check `--gpu-memory-utilization` |
| vLLM crashes on start    | Reduce context length; ensure CUDA 11.8+ installed                        |
| Ollama shows wrong model | Run `ollama list` to verify; `ollama rm` + `ollama pull` to re-download   |

## Further Reading

* [Meta Llama 4 Blog Post](https://llama.meta.com/)
* [HuggingFace Model Card](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct)
* [vLLM Documentation](https://docs.vllm.ai/)
* [Ollama Model Library](https://ollama.com/library/llama4-scout)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/llama4.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
