ExLlamaV2

Maximum speed LLM inference with ExLlamaV2 on Clore.ai GPUs

Run LLMs at maximum speed with ExLlamaV2.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Renting on CLORE.AI

Visit CLORE.AI Marketplace
Filter by GPU type, VRAM, and price
Choose On-Demand (fixed rate) or Spot (bid price)
Configure your order:
- Select Docker image
- Set ports (TCP for SSH, HTTP for web UIs)
- Add environment variables if needed
- Enter startup command
Select payment: CLORE, BTC, or USDT/USDC
Create order and wait for deployment

Access Your Server

Find connection details in My Orders
Web interfaces: Use the HTTP port URL
SSH: ssh -p <port> root@<proxy-address>

What is ExLlamaV2?

ExLlamaV2 is the fastest inference engine for large language models:

2-3x faster than other engines
Excellent quantization (EXL2)
Low VRAM usage
Supports speculative decoding

Requirements

Model Size

Min VRAM

Recommended

6GB

RTX 3060

13B

10GB

RTX 3090

34B

20GB

RTX 4090

70B

40GB

A100

Quick Deploy

Docker Image:

pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel

Ports:

22/tcp
8080/http

Command:

pip install exllamav2 && \
huggingface-cli download turboderp/Llama2-7B-exl2 --local-dir ./model && \
python -m exllamav2.server --model_dir ./model --host 0.0.0.0 --port 8080

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Installation


# Install from PyPI
pip install exllamav2

# Or from source (latest features)
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install .

Download Models

EXL2 Quantized Models


# Llama 3.1 8B (4.0 bpw)
huggingface-cli download turboderp/Llama2-7B-exl2 \
    --revision 4.0bpw \
    --local-dir ./llama2-7b-exl2

# Llama 3.1 8B (4.0 bpw)
huggingface-cli download turboderp/Llama2-13B-exl2 \
    --revision 4.0bpw \
    --local-dir ./llama2-13b-exl2

# Mistral 7B (4.0 bpw)
huggingface-cli download turboderp/Mistral-7B-instruct-exl2 \
    --revision 4.0bpw \
    --local-dir ./mistral-7b-exl2

# Mixtral 8x7B
huggingface-cli download turboderp/Mixtral-8x7B-instruct-exl2 \
    --revision 4.0bpw \
    --local-dir ./mixtral-exl2

Bits Per Weight (bpw)

BPW

Quality

VRAM (7B)

2.0

Low

~3GB

3.0

Good

~4GB

4.0

Great

~5GB

5.0

Excellent

~6GB

6.0

Near-FP16

~7GB

Python API

Basic Generation

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler

# Load model
config = ExLlamaV2Config()
config.model_dir = "./llama2-7b-exl2"
config.prepare()

model = ExLlamaV2(config)
model.load()

tokenizer = ExLlamaV2Tokenizer(config)
cache = ExLlamaV2Cache(model, lazy=True)

# Create generator
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

# Set sampling settings
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_k = 50
settings.top_p = 0.9

# Generate
prompt = "The future of artificial intelligence is"
output = generator.generate_simple(prompt, settings, num_tokens=200)
print(output)

Streaming Generation

from exllamav2.generator import ExLlamaV2StreamingGenerator

generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

prompt = "Write a short story about a robot:"
input_ids = tokenizer.encode(prompt)

generator.set_stop_conditions([tokenizer.eos_token_id])
generator.begin_stream(input_ids, settings)

while True:
    chunk, eos, _ = generator.stream()
    if eos:
        break
    print(chunk, end="", flush=True)

Chat Format

def format_chat(messages):
    text = ""
    for msg in messages:
        role = msg["role"]
        content = msg["content"]
        if role == "system":
            text += f"[INST] <<SYS>>\n{content}\n<</SYS>>\n\n"
        elif role == "user":
            text += f"{content} [/INST]"
        elif role == "assistant":
            text += f" {content}</s><s>[INST] "
    return text

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]

prompt = format_chat(messages)
output = generator.generate_simple(prompt, settings, num_tokens=300)

Server Mode

Start Server

python -m exllamav2.server \
    --model_dir ./llama2-7b-exl2 \
    --host 0.0.0.0 \
    --port 8080 \
    --max_seq_len 4096 \
    --cache_size 4096

API Usage

import requests

response = requests.post(
    "http://localhost:8080/v1/completions",
    json={
        "prompt": "Hello, how are you?",
        "max_tokens": 100,
        "temperature": 0.7
    }
)

print(response.json()["choices"][0]["text"])

Chat Completions

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama2-7b",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7
)

print(response.choices[0].message.content)

TabbyAPI (Recommended Server)

TabbyAPI provides a feature-rich ExLlamaV2 server:


# Clone TabbyAPI
git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI

# Install
pip install -r requirements.txt

# Configure

# Edit config.yml with your model path

# Run
python main.py

TabbyAPI Features

OpenAI-compatible API
Multiple model support
LoRA hot-swapping
Streaming
Function calling
Admin API

Speculative Decoding

Use a smaller model to accelerate generation:

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache

# Load main model (13B)
main_config = ExLlamaV2Config()
main_config.model_dir = "./llama2-13b-exl2"
main_config.prepare()
main_model = ExLlamaV2(main_config)
main_model.load()

# Load draft model (7B)
draft_config = ExLlamaV2Config()
draft_config.model_dir = "./llama2-7b-exl2"
draft_config.prepare()
draft_model = ExLlamaV2(draft_config)
draft_model.load()

# Create speculative generator
from exllamav2.generator import ExLlamaV2DraftGenerator

generator = ExLlamaV2DraftGenerator(
    main_model, draft_model,
    cache_main, cache_draft,
    tokenizer
)

# Generate (faster with speculation)
output = generator.generate_simple(prompt, settings, num_tokens=500)

Quantize Your Own Models

Convert to EXL2

from exllamav2 import ExLlamaV2, ExLlamaV2Config
from exllamav2.conversion import convert_model

# Source: HuggingFace model

# Target: EXL2 quantized

convert_model(
    input_dir="./llama-3.1-8b-hf",
    output_dir="./llama-3.1-8b-exl2-4bpw",
    cal_dataset="wikitext",  # Calibration dataset
    bits=4.0,  # Bits per weight
    head_bits=6,  # Higher precision for attention
)

Command Line

python convert.py \
    -i ./llama-3.1-8b-hf \
    -o ./llama-3.1-8b-exl2 \
    -cf ./llama-3.1-8b-exl2 \
    -b 4.0 \
    -hb 6

Memory Management

Cache Allocation


# Fixed cache size
cache = ExLlamaV2Cache(model, max_seq_len=4096)

# Dynamic cache
cache = ExLlamaV2Cache(model, lazy=True)
cache.current_seq_len = 0  # Grows as needed

Multi-GPU

config = ExLlamaV2Config()
config.model_dir = "./large-model"

# Split across GPUs
config.set_auto_split([0.5, 0.5])  # 50% each GPU

model = ExLlamaV2(config)
model.load()

Performance Comparison

Model

Engine

GPU

Tokens/sec

Llama 3.1 8B

ExLlamaV2

RTX 3090

~150

Llama 3.1 8B

llama.cpp

RTX 3090

~100

Llama 3.1 8B

vLLM

RTX 3090

~120

Llama 3.1 8B

ExLlamaV2

RTX 3090

~90

Mixtral 8x7B

ExLlamaV2

A100

~70

Advanced Settings

Sampling Parameters

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_k = 50
settings.top_p = 0.9
settings.token_repetition_penalty = 1.1
settings.token_frequency_penalty = 0.0
settings.token_presence_penalty = 0.0
settings.mirostat = False
settings.mirostat_tau = 5.0
settings.mirostat_eta = 0.1

Batch Generation

prompts = [
    "The meaning of life is",
    "Artificial intelligence will",
    "Climate change is"
]

outputs = []
for prompt in prompts:
    output = generator.generate_simple(prompt, settings, num_tokens=100)
    outputs.append(output)

Troubleshooting

CUDA Out of Memory


# Use smaller cache
cache = ExLlamaV2Cache(model, max_seq_len=2048)

# Or lower bpw model (3.0 instead of 4.0)

Slow Loading


# Enable fast loading
config.fasttensors = True

Model Not Found


# Check model files exist
ls ./model/

# Should contain: config.json, *.safetensors, tokenizer.json

Integration with LangChain

from langchain.llms.base import LLM
from typing import Optional, List

class ExLlamaV2LLM(LLM):
    model: ExLlamaV2
    tokenizer: ExLlamaV2Tokenizer
    generator: ExLlamaV2StreamingGenerator
    settings: ExLlamaV2Sampler.Settings

    @property
    def _llm_type(self) -> str:
        return "exllamav2"

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        return self.generator.generate_simple(prompt, self.settings, num_tokens=500)

# Usage
llm = ExLlamaV2LLM(model=model, tokenizer=tokenizer, generator=generator, settings=settings)
result = llm("What is quantum computing?")

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU

Hourly Rate

Daily Rate

4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplace for current rates.

Save money:

Use Spot market for flexible workloads (often 30-50% cheaper)
Pay with CLORE tokens
Compare prices across different providers

Next Steps

vLLM Inference - High throughput serving
llama.cpp Server - Cross-platform
Text Generation WebUI - Web interface

PreviousText Generation WebUI NextLocalAI

Last updated 25 days ago

Was this helpful?

hashtagRenting on CLORE.AI

hashtagAccess Your Server

hashtagWhat is ExLlamaV2?

hashtagRequirements

hashtagQuick Deploy

hashtagAccessing Your Service

hashtagInstallation

hashtagDownload Models

hashtagEXL2 Quantized Models

hashtagBits Per Weight (bpw)

hashtagPython API

hashtagBasic Generation

hashtagStreaming Generation

hashtagChat Format

hashtagServer Mode

hashtagStart Server

hashtagAPI Usage

hashtagChat Completions

hashtagTabbyAPI (Recommended Server)

hashtagTabbyAPI Features

hashtagSpeculative Decoding

hashtagQuantize Your Own Models

hashtagConvert to EXL2

hashtagCommand Line

hashtagMemory Management

hashtagCache Allocation

hashtagMulti-GPU

hashtagPerformance Comparison

hashtagAdvanced Settings

hashtagSampling Parameters

hashtagBatch Generation

hashtagTroubleshooting

hashtagCUDA Out of Memory

hashtagSlow Loading

hashtagModel Not Found

hashtagIntegration with LangChain

hashtagCost Estimate

hashtagNext Steps

Renting on CLORE.AI

Access Your Server

What is ExLlamaV2?

Requirements

Quick Deploy

Accessing Your Service

Installation

Download Models

EXL2 Quantized Models

Bits Per Weight (bpw)

Python API

Basic Generation

Streaming Generation

Chat Format

Server Mode

Start Server

API Usage

Chat Completions

TabbyAPI (Recommended Server)

TabbyAPI Features

Speculative Decoding

Quantize Your Own Models

Convert to EXL2

Command Line

Memory Management

Cache Allocation

Multi-GPU

Performance Comparison

Advanced Settings

Sampling Parameters

Batch Generation

Troubleshooting

CUDA Out of Memory

Slow Loading

Model Not Found

Integration with LangChain

Cost Estimate

Next Steps