# Llama 4 (Scout & Maverick)

Meta's Llama 4, released April 2025, marks a fundamental shift to **Mixture of Experts (MoE)** architecture. Instead of activating all parameters for every token, Llama 4 routes each token to specialized "expert" sub-networks — delivering frontier performance at a fraction of the compute cost. Two open-weight models are available: **Scout** (ideal for single-GPU) and **Maverick** (multi-GPU powerhouse).

## Key Features

* **MoE Architecture**: Only 17B parameters active per token (out of 109B/400B total)
* **Massive Context Windows**: Scout supports 10M tokens, Maverick supports 1M tokens
* **Natively Multimodal**: Understands both text and images out of the box
* **Two Models**: Scout (16 experts, single-GPU friendly) and Maverick (128 experts, multi-GPU)
* **Competitive Performance**: Scout matches Gemma 3 27B; Maverick competes with GPT-4o class models
* **Open Weights**: Llama Community License (free for most commercial uses)

## Model Variants

| Model        | Total Params | Active Params | Experts | Context | Min VRAM (Q4) | Min VRAM (FP16) |
| ------------ | ------------ | ------------- | ------- | ------- | ------------- | --------------- |
| **Scout**    | 109B         | 17B           | 16      | 10M     | 12GB          | 80GB            |
| **Maverick** | 400B         | 17B           | 128     | 1M      | 48GB (multi)  | 320GB (multi)   |

## Requirements

| Component | Scout (Q4)  | Scout (FP16) | Maverick (Q4) |
| --------- | ----------- | ------------ | ------------- |
| GPU       | 1× RTX 4090 | 1× H100      | 4× RTX 4090   |
| VRAM      | 24GB        | 80GB         | 4×24GB        |
| RAM       | 32GB        | 64GB         | 128GB         |
| Disk      | 50GB        | 120GB        | 250GB         |
| CUDA      | 11.8+       | 12.0+        | 12.0+         |

**Recommended Clore.ai GPU**: RTX 4090 24GB (\~$0.5–2/day) for Scout — best value

## Quick Start with Ollama

The fastest way to get Llama 4 running:

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run Scout (quantized, ~12GB VRAM)
ollama run llama4-scout

# For longer context (uses more VRAM)
ollama run llama4-scout --ctx-size 32768
```

### Ollama as API Server

```bash
# Start server in background
ollama serve &

# Pull model
ollama pull llama4-scout

# Query via OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4-scout",
    "messages": [{"role": "user", "content": "Explain MoE architecture in 3 sentences"}]
  }'
```

## vLLM Setup (Production)

For production workloads with higher throughput:

```bash
# Install vLLM
pip install vllm

# Serve Scout on single GPU (quantized)
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

# Serve Scout on 2 GPUs (longer context)
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 128000 \
  --gpu-memory-utilization 0.90

# Serve Maverick on 4 GPUs
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 65536
```

### Query vLLM Server

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a Python function to calculate Fibonacci numbers"}
    ],
    temperature=0.7,
    max_tokens=1024
)
print(response.choices[0].message.content)
```

## HuggingFace Transformers

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True  # 4-bit quantization for 24GB GPUs
)

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a REST API with FastAPI that manages a todo list"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=2048, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))
```

## Docker Quick Start

```bash
# Using vLLM Docker image
docker run --gpus all -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --max-model-len 32768
```

## Why MoE Matters on Clore.ai

Traditional dense models (like Llama 3.3 70B) need massive VRAM because all 70B parameters are active. Llama 4 Scout has 109B total but only activates 17B per token — meaning:

* **Same quality as 70B+ dense models** at a fraction of the VRAM cost
* **Fits on a single RTX 4090** in quantized mode
* **10M token context** — process entire codebases, long documents, books
* **Cheaper to rent** — $0.5–2/day instead of $6–12/day for 70B models

## Tips for Clore.ai Users

* **Start with Scout Q4**: Best bang for buck on RTX 4090 — $0.5–2/day, covers 95% of use cases
* **Use `--max-model-len` wisely**: Don't set context higher than you need — it reserves VRAM. Start at 8192, increase as needed
* **Tensor Parallel for Maverick**: Rent 4× RTX 4090 machines for Maverick; use `--tensor-parallel-size 4`
* **HuggingFace Login Required**: `huggingface-cli login` — you need to accept the Llama license on HF first
* **Ollama for Quick Tests, vLLM for Production**: Ollama is faster to set up; vLLM gives higher throughput for API serving
* **Monitor GPU Memory**: `watch nvidia-smi` — MoE models can spike VRAM on long sequences

## Troubleshooting

| Issue                    | Solution                                                                  |
| ------------------------ | ------------------------------------------------------------------------- |
| `OutOfMemoryError`       | Reduce `--max-model-len`, use Q4 quantization, or upgrade GPU             |
| Model download fails     | Run `huggingface-cli login` and accept Llama 4 license at hf.co           |
| Slow generation          | Ensure GPU is being used (`nvidia-smi`); check `--gpu-memory-utilization` |
| vLLM crashes on start    | Reduce context length; ensure CUDA 11.8+ installed                        |
| Ollama shows wrong model | Run `ollama list` to verify; `ollama rm` + `ollama pull` to re-download   |

## Further Reading

* [Meta Llama 4 Blog Post](https://llama.meta.com/)
* [HuggingFace Model Card](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct)
* [vLLM Documentation](https://docs.vllm.ai/)
* [Ollama Model Library](https://ollama.com/library/llama4-scout)
