# Llama 3.3 70B

{% hint style="info" %}
**Newer version available!** Meta released [**Llama 4**](https://docs.clore.ai/guides/language-models/llama4) in April 2025 with MoE architecture — Scout (17B active, fits on RTX 4090) delivers similar quality at a fraction of the VRAM. Consider upgrading.
{% endhint %}

Meta's latest and most efficient 70B model on CLORE.AI GPUs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Why Llama 3.3?

* **Best 70B model** - Matches Llama 3.1 405B performance at fraction of cost
* **Multilingual** - Supports 8 languages natively
* **128K context** - Long document processing
* **Open weights** - Free for commercial use

## Model Overview

| Spec           | Value                          |
| -------------- | ------------------------------ |
| Parameters     | 70B                            |
| Context Length | 128K tokens                    |
| Training Data  | 15T+ tokens                    |
| Languages      | EN, DE, FR, IT, PT, HI, ES, TH |
| License        | Llama 3.3 Community License    |

### Performance vs Other Models

| Benchmark    | Llama 3.3 70B | Llama 3.1 405B | GPT-4o |
| ------------ | ------------- | -------------- | ------ |
| MMLU         | 86.0          | 87.3           | 88.7   |
| HumanEval    | 88.4          | 89.0           | 90.2   |
| MATH         | 77.0          | 73.8           | 76.6   |
| Multilingual | 91.1          | 91.6           | -      |

## GPU Requirements

| Setup        | VRAM  | Performance | Cost                      |
| ------------ | ----- | ----------- | ------------------------- |
| Q4 quantized | 40GB  | Good        | A100 40GB (\~$0.17/hr)    |
| Q8 quantized | 70GB  | Better      | A100 80GB (\~$0.25/hr)    |
| FP16 full    | 140GB | Best        | 2x A100 80GB (\~$0.50/hr) |

**Recommended:** A100 40GB with Q4 quantization for best price/performance.

## Quick Deploy on CLORE.AI

### Using Ollama (Easiest)

**Docker Image:**

```
ollama/ollama
```

**Ports:**

```
22/tcp
11434/http
```

**After deploy:**

```bash
ollama pull llama3.3
ollama run llama3.3
```

### Using vLLM (Production)

**Docker Image:**

```
vllm/vllm-openai:latest
```

**Ports:**

```
22/tcp
8000/http
```

**Command:**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.3-70B-Instruct \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --host 0.0.0.0
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation Methods

### Method 1: Ollama (Recommended for Testing)

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Llama 3.3 (auto-downloads Q4 version)
ollama pull llama3.3

# Run interactively
ollama run llama3.3

# Or serve API
ollama serve
```

**API usage:**

```bash
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.3",
  "prompt": "Explain quantum computing in simple terms"
}'
```

### Method 2: vLLM (Production)

```bash
pip install vllm

# Single GPU (A100 40GB with AWQ quantization)
python -m vllm.entrypoints.openai.api_server \
    --model casperhansen/llama-3.3-70b-instruct-awq \
    --quantization awq \
    --max-model-len 16384 \
    --host 0.0.0.0

# Multi-GPU (2x A100 for full precision)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.3-70B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --host 0.0.0.0
```

**API usage (OpenAI-compatible):**

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
    ],
    temperature=0.7,
    max_tokens=1024
)

print(response.choices[0].message.content)
```

### Method 3: Transformers + bitsandbytes

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_id = "meta-llama/Llama-3.3-70B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

# Generate
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python web scraper using BeautifulSoup"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Method 4: llama.cpp (CPU+GPU hybrid)

```bash
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_CUDA=1

# Download GGUF model
wget https://huggingface.co/bartowski/Llama-3.3-70B-Instruct-GGUF/resolve/main/Llama-3.3-70B-Instruct-Q4_K_M.gguf

# Run server
./llama-server \
    -m Llama-3.3-70B-Instruct-Q4_K_M.gguf \
    -c 8192 \
    -ngl 80 \
    --host 0.0.0.0 \
    --port 8080
```

## Benchmarks

### Throughput (tokens/second)

| GPU          | Q4    | Q8    | FP16  |
| ------------ | ----- | ----- | ----- |
| A100 40GB    | 25-30 | -     | -     |
| A100 80GB    | 35-40 | 25-30 | -     |
| 2x A100 80GB | 50-60 | 40-45 | 30-35 |
| H100 80GB    | 60-70 | 45-50 | 35-40 |

### Time to First Token (TTFT)

| GPU          | Q4       | FP16     |
| ------------ | -------- | -------- |
| A100 40GB    | 0.8-1.2s | -        |
| A100 80GB    | 0.6-0.9s | -        |
| 2x A100 80GB | 0.4-0.6s | 0.8-1.0s |

### Context Length vs VRAM

| Context | Q4 VRAM | Q8 VRAM |
| ------- | ------- | ------- |
| 4K      | 38GB    | 72GB    |
| 8K      | 40GB    | 75GB    |
| 16K     | 44GB    | 80GB    |
| 32K     | 52GB    | 90GB    |
| 64K     | 68GB    | 110GB   |
| 128K    | 100GB   | 150GB   |

## Use Cases

### Code Generation

```python
messages = [
    {"role": "system", "content": "You are an expert programmer. Write clean, efficient, well-documented code."},
    {"role": "user", "content": "Create a REST API in FastAPI with user authentication using JWT tokens"}
]
```

### Document Analysis (Long Context)

```python
# Load long document
with open("large_document.txt") as f:
    document = f.read()

messages = [
    {"role": "system", "content": "You are a document analyst. Provide detailed, accurate analysis."},
    {"role": "user", "content": f"Analyze this document and provide a summary with key points:\n\n{document}"}
]
```

### Multilingual Tasks

```python
messages = [
    {"role": "system", "content": "You are a multilingual assistant."},
    {"role": "user", "content": "Translate this to German, French, and Spanish: 'The quick brown fox jumps over the lazy dog'"}
]
```

### Reasoning & Analysis

```python
messages = [
    {"role": "system", "content": "Think step by step. Show your reasoning."},
    {"role": "user", "content": "A train leaves Station A at 9:00 AM traveling at 60 mph. Another train leaves Station B (300 miles away) at 10:00 AM traveling toward Station A at 90 mph. When and where do they meet?"}
]
```

## Optimization Tips

### Memory Optimization

```python
# vLLM with memory optimization
python -m vllm.entrypoints.openai.api_server \
    --model casperhansen/llama-3.3-70b-instruct-awq \
    --quantization awq \
    --gpu-memory-utilization 0.95 \
    --max-model-len 8192
```

### Speed Optimization

```python
# Enable Flash Attention
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.3-70B-Instruct \
    --tensor-parallel-size 2 \
    --enable-prefix-caching
```

### Batch Processing

```python
# Process multiple requests efficiently
responses = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=messages,
    n=4,  # Generate 4 responses
    temperature=0.8
)
```

## Comparison with Other Models

| Feature   | Llama 3.3 70B | Llama 3.1 70B | Qwen 2.5 72B | Mixtral 8x22B |
| --------- | ------------- | ------------- | ------------ | ------------- |
| MMLU      | 86.0          | 83.6          | 85.3         | 77.8          |
| Coding    | 88.4          | 80.5          | 85.4         | 75.5          |
| Math      | 77.0          | 68.0          | 80.0         | 60.0          |
| Context   | 128K          | 128K          | 128K         | 64K           |
| Languages | 8             | 8             | 29           | 8             |
| License   | Open          | Open          | Open         | Open          |

**Verdict:** Llama 3.3 70B offers the best overall performance in its class, especially for coding and reasoning tasks.

## Troubleshooting

### Out of Memory

```bash
# Use AWQ quantization (most memory efficient)
--model casperhansen/llama-3.3-70b-instruct-awq --quantization awq

# Reduce context length
--max-model-len 8192

# Use tensor parallelism
--tensor-parallel-size 2
```

### Slow First Response

* First request loads model to GPU - wait 30-60 seconds
* Use `--enable-prefix-caching` for faster subsequent requests
* Pre-warm with dummy request

### Hugging Face Access

```bash
# Login to HF (required for gated model)
huggingface-cli login

# Or set environment variable
export HUGGING_FACE_HUB_TOKEN=hf_xxxxx
```

## Cost Estimate

| Setup       | GPU            | $/hour  | tokens/$ |
| ----------- | -------------- | ------- | -------- |
| Budget      | A100 40GB (Q4) | \~$0.17 | \~530K   |
| Balanced    | A100 80GB (Q4) | \~$0.25 | \~500K   |
| Performance | 2x A100 80GB   | \~$0.50 | \~360K   |
| Maximum     | H100 80GB      | \~$0.50 | \~500K   |

## Next Steps

* [vLLM Guide](https://docs.clore.ai/guides/language-models/vllm) - Production deployment
* [Ollama Guide](https://docs.clore.ai/guides/language-models/ollama) - Easy local setup
* [Multi-GPU Setup](https://docs.clore.ai/guides/advanced/multi-gpu-setup) - Scale to larger models
* [API Integration](https://docs.clore.ai/guides/advanced/api-integration) - Build applications


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/llama33.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
