# Llama 3.3 70B

{% hint style="info" %}
**Newer version available!** Meta released [**Llama 4**](https://docs.clore.ai/guides/language-models/llama4) in April 2025 with MoE architecture — Scout (17B active, fits on RTX 4090) delivers similar quality at a fraction of the VRAM. Consider upgrading.
{% endhint %}

Meta's latest and most efficient 70B model on CLORE.AI GPUs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Why Llama 3.3?

* **Best 70B model** - Matches Llama 3.1 405B performance at fraction of cost
* **Multilingual** - Supports 8 languages natively
* **128K context** - Long document processing
* **Open weights** - Free for commercial use

## Model Overview

| Spec           | Value                          |
| -------------- | ------------------------------ |
| Parameters     | 70B                            |
| Context Length | 128K tokens                    |
| Training Data  | 15T+ tokens                    |
| Languages      | EN, DE, FR, IT, PT, HI, ES, TH |
| License        | Llama 3.3 Community License    |

### Performance vs Other Models

| Benchmark    | Llama 3.3 70B | Llama 3.1 405B | GPT-4o |
| ------------ | ------------- | -------------- | ------ |
| MMLU         | 86.0          | 87.3           | 88.7   |
| HumanEval    | 88.4          | 89.0           | 90.2   |
| MATH         | 77.0          | 73.8           | 76.6   |
| Multilingual | 91.1          | 91.6           | -      |

## GPU Requirements

| Setup        | VRAM  | Performance | Cost                      |
| ------------ | ----- | ----------- | ------------------------- |
| Q4 quantized | 40GB  | Good        | A100 40GB (\~$0.17/hr)    |
| Q8 quantized | 70GB  | Better      | A100 80GB (\~$0.25/hr)    |
| FP16 full    | 140GB | Best        | 2x A100 80GB (\~$0.50/hr) |

**Recommended:** A100 40GB with Q4 quantization for best price/performance.

## Quick Deploy on CLORE.AI

### Using Ollama (Easiest)

**Docker Image:**

```
ollama/ollama
```

**Ports:**

```
22/tcp
11434/http
```

**After deploy:**

```bash
ollama pull llama3.3
ollama run llama3.3
```

### Using vLLM (Production)

**Docker Image:**

```
vllm/vllm-openai:latest
```

**Ports:**

```
22/tcp
8000/http
```

**Command:**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.3-70B-Instruct \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --host 0.0.0.0
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation Methods

### Method 1: Ollama (Recommended for Testing)

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Llama 3.3 (auto-downloads Q4 version)
ollama pull llama3.3

# Run interactively
ollama run llama3.3

# Or serve API
ollama serve
```

**API usage:**

```bash
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.3",
  "prompt": "Explain quantum computing in simple terms"
}'
```

### Method 2: vLLM (Production)

```bash
pip install vllm

# Single GPU (A100 40GB with AWQ quantization)
python -m vllm.entrypoints.openai.api_server \
    --model casperhansen/llama-3.3-70b-instruct-awq \
    --quantization awq \
    --max-model-len 16384 \
    --host 0.0.0.0

# Multi-GPU (2x A100 for full precision)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.3-70B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --host 0.0.0.0
```

**API usage (OpenAI-compatible):**

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
    ],
    temperature=0.7,
    max_tokens=1024
)

print(response.choices[0].message.content)
```

### Method 3: Transformers + bitsandbytes

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_id = "meta-llama/Llama-3.3-70B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

# Generate
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python web scraper using BeautifulSoup"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Method 4: llama.cpp (CPU+GPU hybrid)

```bash
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_CUDA=1

# Download GGUF model
wget https://huggingface.co/bartowski/Llama-3.3-70B-Instruct-GGUF/resolve/main/Llama-3.3-70B-Instruct-Q4_K_M.gguf

# Run server
./llama-server \
    -m Llama-3.3-70B-Instruct-Q4_K_M.gguf \
    -c 8192 \
    -ngl 80 \
    --host 0.0.0.0 \
    --port 8080
```

## Benchmarks

### Throughput (tokens/second)

| GPU          | Q4    | Q8    | FP16  |
| ------------ | ----- | ----- | ----- |
| A100 40GB    | 25-30 | -     | -     |
| A100 80GB    | 35-40 | 25-30 | -     |
| 2x A100 80GB | 50-60 | 40-45 | 30-35 |
| H100 80GB    | 60-70 | 45-50 | 35-40 |

### Time to First Token (TTFT)

| GPU          | Q4       | FP16     |
| ------------ | -------- | -------- |
| A100 40GB    | 0.8-1.2s | -        |
| A100 80GB    | 0.6-0.9s | -        |
| 2x A100 80GB | 0.4-0.6s | 0.8-1.0s |

### Context Length vs VRAM

| Context | Q4 VRAM | Q8 VRAM |
| ------- | ------- | ------- |
| 4K      | 38GB    | 72GB    |
| 8K      | 40GB    | 75GB    |
| 16K     | 44GB    | 80GB    |
| 32K     | 52GB    | 90GB    |
| 64K     | 68GB    | 110GB   |
| 128K    | 100GB   | 150GB   |

## Use Cases

### Code Generation

```python
messages = [
    {"role": "system", "content": "You are an expert programmer. Write clean, efficient, well-documented code."},
    {"role": "user", "content": "Create a REST API in FastAPI with user authentication using JWT tokens"}
]
```

### Document Analysis (Long Context)

```python
# Load long document
with open("large_document.txt") as f:
    document = f.read()

messages = [
    {"role": "system", "content": "You are a document analyst. Provide detailed, accurate analysis."},
    {"role": "user", "content": f"Analyze this document and provide a summary with key points:\n\n{document}"}
]
```

### Multilingual Tasks

```python
messages = [
    {"role": "system", "content": "You are a multilingual assistant."},
    {"role": "user", "content": "Translate this to German, French, and Spanish: 'The quick brown fox jumps over the lazy dog'"}
]
```

### Reasoning & Analysis

```python
messages = [
    {"role": "system", "content": "Think step by step. Show your reasoning."},
    {"role": "user", "content": "A train leaves Station A at 9:00 AM traveling at 60 mph. Another train leaves Station B (300 miles away) at 10:00 AM traveling toward Station A at 90 mph. When and where do they meet?"}
]
```

## Optimization Tips

### Memory Optimization

```python
# vLLM with memory optimization
python -m vllm.entrypoints.openai.api_server \
    --model casperhansen/llama-3.3-70b-instruct-awq \
    --quantization awq \
    --gpu-memory-utilization 0.95 \
    --max-model-len 8192
```

### Speed Optimization

```python
# Enable Flash Attention
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.3-70B-Instruct \
    --tensor-parallel-size 2 \
    --enable-prefix-caching
```

### Batch Processing

```python
# Process multiple requests efficiently
responses = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=messages,
    n=4,  # Generate 4 responses
    temperature=0.8
)
```

## Comparison with Other Models

| Feature   | Llama 3.3 70B | Llama 3.1 70B | Qwen 2.5 72B | Mixtral 8x22B |
| --------- | ------------- | ------------- | ------------ | ------------- |
| MMLU      | 86.0          | 83.6          | 85.3         | 77.8          |
| Coding    | 88.4          | 80.5          | 85.4         | 75.5          |
| Math      | 77.0          | 68.0          | 80.0         | 60.0          |
| Context   | 128K          | 128K          | 128K         | 64K           |
| Languages | 8             | 8             | 29           | 8             |
| License   | Open          | Open          | Open         | Open          |

**Verdict:** Llama 3.3 70B offers the best overall performance in its class, especially for coding and reasoning tasks.

## Troubleshooting

### Out of Memory

```bash
# Use AWQ quantization (most memory efficient)
--model casperhansen/llama-3.3-70b-instruct-awq --quantization awq

# Reduce context length
--max-model-len 8192

# Use tensor parallelism
--tensor-parallel-size 2
```

### Slow First Response

* First request loads model to GPU - wait 30-60 seconds
* Use `--enable-prefix-caching` for faster subsequent requests
* Pre-warm with dummy request

### Hugging Face Access

```bash
# Login to HF (required for gated model)
huggingface-cli login

# Or set environment variable
export HUGGING_FACE_HUB_TOKEN=hf_xxxxx
```

## Cost Estimate

| Setup       | GPU            | $/hour  | tokens/$ |
| ----------- | -------------- | ------- | -------- |
| Budget      | A100 40GB (Q4) | \~$0.17 | \~530K   |
| Balanced    | A100 80GB (Q4) | \~$0.25 | \~500K   |
| Performance | 2x A100 80GB   | \~$0.50 | \~360K   |
| Maximum     | H100 80GB      | \~$0.50 | \~500K   |

## Next Steps

* [vLLM Guide](https://docs.clore.ai/guides/language-models/vllm) - Production deployment
* [Ollama Guide](https://docs.clore.ai/guides/language-models/ollama) - Easy local setup
* [Multi-GPU Setup](https://docs.clore.ai/guides/advanced/multi-gpu-setup) - Scale to larger models
* [API Integration](https://docs.clore.ai/guides/advanced/api-integration) - Build applications
