Llama 3.3 70B

Run Meta's Llama 3.3 70B model on Clore.ai GPUs

Newer version available! Meta released Llama 4 in April 2025 with MoE architecture — Scout (17B active, fits on RTX 4090) delivers similar quality at a fraction of the VRAM. Consider upgrading.

Meta's latest and most efficient 70B model on CLORE.AI GPUs.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Why Llama 3.3?

Best 70B model - Matches Llama 3.1 405B performance at fraction of cost
Multilingual - Supports 8 languages natively
128K context - Long document processing
Open weights - Free for commercial use

Model Overview

Spec

Value

Parameters

70B

Context Length

128K tokens

Training Data

15T+ tokens

Languages

EN, DE, FR, IT, PT, HI, ES, TH

License

Llama 3.3 Community License

Performance vs Other Models

Benchmark

Llama 3.3 70B

Llama 3.1 405B

GPT-4o

MMLU

86.0

87.3

88.7

HumanEval

88.4

89.0

90.2

MATH

77.0

73.8

76.6

Multilingual

91.1

91.6

GPU Requirements

Setup

VRAM

Performance

Cost

Q4 quantized

40GB

Good

A100 40GB (~$0.17/hr)

Q8 quantized

70GB

Better

A100 80GB (~$0.25/hr)

FP16 full

140GB

Best

2x A100 80GB (~$0.50/hr)

Recommended: A100 40GB with Q4 quantization for best price/performance.

Quick Deploy on CLORE.AI

Using Ollama (Easiest)

Docker Image:

ollama/ollama

Ports:

22/tcp
11434/http

After deploy:

ollama pull llama3.3
ollama run llama3.3

Using vLLM (Production)

Docker Image:

vllm/vllm-openai:latest

Ports:

22/tcp
8000/http

Command:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.3-70B-Instruct \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --host 0.0.0.0

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Installation Methods

Method 1: Ollama (Recommended for Testing)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Llama 3.3 (auto-downloads Q4 version)
ollama pull llama3.3

# Run interactively
ollama run llama3.3

# Or serve API
ollama serve

API usage:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.3",
  "prompt": "Explain quantum computing in simple terms"
}'

Method 2: vLLM (Production)

pip install vllm

# Single GPU (A100 40GB with AWQ quantization)
python -m vllm.entrypoints.openai.api_server \
    --model casperhansen/llama-3.3-70b-instruct-awq \
    --quantization awq \
    --max-model-len 16384 \
    --host 0.0.0.0

# Multi-GPU (2x A100 for full precision)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.3-70B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --host 0.0.0.0

API usage (OpenAI-compatible):

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
    ],
    temperature=0.7,
    max_tokens=1024
)

print(response.choices[0].message.content)

Method 3: Transformers + bitsandbytes

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_id = "meta-llama/Llama-3.3-70B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

# Generate
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python web scraper using BeautifulSoup"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Method 4: llama.cpp (CPU+GPU hybrid)

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_CUDA=1

# Download GGUF model
wget https://huggingface.co/bartowski/Llama-3.3-70B-Instruct-GGUF/resolve/main/Llama-3.3-70B-Instruct-Q4_K_M.gguf

# Run server
./llama-server \
    -m Llama-3.3-70B-Instruct-Q4_K_M.gguf \
    -c 8192 \
    -ngl 80 \
    --host 0.0.0.0 \
    --port 8080

Benchmarks

Throughput (tokens/second)

GPU

FP16

A100 40GB

25-30

A100 80GB

35-40

25-30

2x A100 80GB

50-60

40-45

30-35

H100 80GB

60-70

45-50

35-40

Time to First Token (TTFT)

GPU

FP16

A100 40GB

0.8-1.2s

A100 80GB

0.6-0.9s

2x A100 80GB

0.4-0.6s

0.8-1.0s

Context Length vs VRAM

Context

Q4 VRAM

Q8 VRAM

38GB

72GB

40GB

75GB

16K

44GB

80GB

32K

52GB

90GB

64K

68GB

110GB

128K

100GB

150GB

Use Cases

Code Generation

messages = [
    {"role": "system", "content": "You are an expert programmer. Write clean, efficient, well-documented code."},
    {"role": "user", "content": "Create a REST API in FastAPI with user authentication using JWT tokens"}
]

Document Analysis (Long Context)

# Load long document
with open("large_document.txt") as f:
    document = f.read()

messages = [
    {"role": "system", "content": "You are a document analyst. Provide detailed, accurate analysis."},
    {"role": "user", "content": f"Analyze this document and provide a summary with key points:\n\n{document}"}
]

Multilingual Tasks

messages = [
    {"role": "system", "content": "You are a multilingual assistant."},
    {"role": "user", "content": "Translate this to German, French, and Spanish: 'The quick brown fox jumps over the lazy dog'"}
]

Reasoning & Analysis

messages = [
    {"role": "system", "content": "Think step by step. Show your reasoning."},
    {"role": "user", "content": "A train leaves Station A at 9:00 AM traveling at 60 mph. Another train leaves Station B (300 miles away) at 10:00 AM traveling toward Station A at 90 mph. When and where do they meet?"}
]

Optimization Tips

Memory Optimization

# vLLM with memory optimization
python -m vllm.entrypoints.openai.api_server \
    --model casperhansen/llama-3.3-70b-instruct-awq \
    --quantization awq \
    --gpu-memory-utilization 0.95 \
    --max-model-len 8192

Speed Optimization

# Enable Flash Attention
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.3-70B-Instruct \
    --tensor-parallel-size 2 \
    --enable-prefix-caching

Batch Processing

# Process multiple requests efficiently
responses = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=messages,
    n=4,  # Generate 4 responses
    temperature=0.8
)

Comparison with Other Models

Feature

Llama 3.3 70B

Llama 3.1 70B

Qwen 2.5 72B

Mixtral 8x22B

MMLU

86.0

83.6

85.3

77.8

Coding

88.4

80.5

85.4

75.5

Math

77.0

68.0

80.0

60.0

Context

128K

64K

Languages

License

Open

Verdict: Llama 3.3 70B offers the best overall performance in its class, especially for coding and reasoning tasks.

Troubleshooting

Out of Memory

# Use AWQ quantization (most memory efficient)
--model casperhansen/llama-3.3-70b-instruct-awq --quantization awq

# Reduce context length
--max-model-len 8192

# Use tensor parallelism
--tensor-parallel-size 2

Slow First Response

First request loads model to GPU - wait 30-60 seconds
Use --enable-prefix-caching for faster subsequent requests
Pre-warm with dummy request

Hugging Face Access

# Login to HF (required for gated model)
huggingface-cli login

# Or set environment variable
export HUGGING_FACE_HUB_TOKEN=hf_xxxxx

Cost Estimate

Setup

GPU

$/hour

tokens/$

Budget

A100 40GB (Q4)

~$0.17

~530K

Balanced

A100 80GB (Q4)

~$0.25

~500K

Performance

2x A100 80GB

~$0.50

~360K

Maximum

H100 80GB

~$0.50

~500K

Next Steps

vLLM Guide - Production deployment
Ollama Guide - Easy local setup
Multi-GPU Setup - Scale to larger models
API Integration - Build applications

PreviousLocalAI NextMistral & Mixtral

Last updated 26 days ago

Was this helpful?

hashtagWhy Llama 3.3?

hashtagModel Overview

hashtagPerformance vs Other Models

hashtagGPU Requirements

hashtagQuick Deploy on CLORE.AI

hashtagUsing Ollama (Easiest)

hashtagUsing vLLM (Production)

hashtagAccessing Your Service

hashtagInstallation Methods

hashtagMethod 1: Ollama (Recommended for Testing)

hashtagMethod 2: vLLM (Production)

hashtagMethod 3: Transformers + bitsandbytes

hashtagMethod 4: llama.cpp (CPU+GPU hybrid)

hashtagBenchmarks

hashtagThroughput (tokens/second)

hashtagTime to First Token (TTFT)

hashtagContext Length vs VRAM

hashtagUse Cases

hashtagCode Generation

hashtagDocument Analysis (Long Context)

hashtagMultilingual Tasks

hashtagReasoning & Analysis

hashtagOptimization Tips

hashtagMemory Optimization

hashtagSpeed Optimization

hashtagBatch Processing

hashtagComparison with Other Models

hashtagTroubleshooting

hashtagOut of Memory

hashtagSlow First Response

hashtagHugging Face Access

hashtagCost Estimate

hashtagNext Steps

Why Llama 3.3?

Model Overview

Performance vs Other Models

GPU Requirements

Quick Deploy on CLORE.AI

Using Ollama (Easiest)

Using vLLM (Production)

Accessing Your Service

Installation Methods

Method 1: Ollama (Recommended for Testing)

Method 2: vLLM (Production)

Method 3: Transformers + bitsandbytes

Method 4: llama.cpp (CPU+GPU hybrid)

Benchmarks

Throughput (tokens/second)

Time to First Token (TTFT)

Context Length vs VRAM

Use Cases

Code Generation

Document Analysis (Long Context)

Multilingual Tasks

Reasoning & Analysis

Optimization Tips

Memory Optimization

Speed Optimization

Batch Processing

Comparison with Other Models

Troubleshooting

Out of Memory

Slow First Response

Hugging Face Access

Cost Estimate

Next Steps