# ExLlamaV2

Run LLMs at maximum speed with ExLlamaV2.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is ExLlamaV2?

ExLlamaV2 is the fastest inference engine for large language models:

* 2-3x faster than other engines
* Excellent quantization (EXL2)
* Low VRAM usage
* Supports speculative decoding

## Requirements

| Model Size | Min VRAM | Recommended |
| ---------- | -------- | ----------- |
| 7B         | 6GB      | RTX 3060    |
| 13B        | 10GB     | RTX 3090    |
| 34B        | 20GB     | RTX 4090    |
| 70B        | 40GB     | A100        |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
8080/http
```

**Command:**

```bash
pip install exllamav2 && \
huggingface-cli download turboderp/Llama2-7B-exl2 --local-dir ./model && \
python -m exllamav2.server --model_dir ./model --host 0.0.0.0 --port 8080
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation

```bash

# Install from PyPI
pip install exllamav2

# Or from source (latest features)
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install .
```

## Download Models

### EXL2 Quantized Models

```bash

# Llama 3.1 8B (4.0 bpw)
huggingface-cli download turboderp/Llama2-7B-exl2 \
    --revision 4.0bpw \
    --local-dir ./llama2-7b-exl2

# Llama 3.1 8B (4.0 bpw)
huggingface-cli download turboderp/Llama2-13B-exl2 \
    --revision 4.0bpw \
    --local-dir ./llama2-13b-exl2

# Mistral 7B (4.0 bpw)
huggingface-cli download turboderp/Mistral-7B-instruct-exl2 \
    --revision 4.0bpw \
    --local-dir ./mistral-7b-exl2

# Mixtral 8x7B
huggingface-cli download turboderp/Mixtral-8x7B-instruct-exl2 \
    --revision 4.0bpw \
    --local-dir ./mixtral-exl2
```

### Bits Per Weight (bpw)

| BPW | Quality   | VRAM (7B) |
| --- | --------- | --------- |
| 2.0 | Low       | \~3GB     |
| 3.0 | Good      | \~4GB     |
| 4.0 | Great     | \~5GB     |
| 5.0 | Excellent | \~6GB     |
| 6.0 | Near-FP16 | \~7GB     |

## Python API

### Basic Generation

```python
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler

# Load model
config = ExLlamaV2Config()
config.model_dir = "./llama2-7b-exl2"
config.prepare()

model = ExLlamaV2(config)
model.load()

tokenizer = ExLlamaV2Tokenizer(config)
cache = ExLlamaV2Cache(model, lazy=True)

# Create generator
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

# Set sampling settings
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_k = 50
settings.top_p = 0.9

# Generate
prompt = "The future of artificial intelligence is"
output = generator.generate_simple(prompt, settings, num_tokens=200)
print(output)
```

### Streaming Generation

```python
from exllamav2.generator import ExLlamaV2StreamingGenerator

generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

prompt = "Write a short story about a robot:"
input_ids = tokenizer.encode(prompt)

generator.set_stop_conditions([tokenizer.eos_token_id])
generator.begin_stream(input_ids, settings)

while True:
    chunk, eos, _ = generator.stream()
    if eos:
        break
    print(chunk, end="", flush=True)
```

### Chat Format

```python
def format_chat(messages):
    text = ""
    for msg in messages:
        role = msg["role"]
        content = msg["content"]
        if role == "system":
            text += f"[INST] <<SYS>>\n{content}\n<</SYS>>\n\n"
        elif role == "user":
            text += f"{content} [/INST]"
        elif role == "assistant":
            text += f" {content}</s><s>[INST] "
    return text

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]

prompt = format_chat(messages)
output = generator.generate_simple(prompt, settings, num_tokens=300)
```

## Server Mode

### Start Server

```bash
python -m exllamav2.server \
    --model_dir ./llama2-7b-exl2 \
    --host 0.0.0.0 \
    --port 8080 \
    --max_seq_len 4096 \
    --cache_size 4096
```

### API Usage

```python
import requests

response = requests.post(
    "http://localhost:8080/v1/completions",
    json={
        "prompt": "Hello, how are you?",
        "max_tokens": 100,
        "temperature": 0.7
    }
)

print(response.json()["choices"][0]["text"])
```

### Chat Completions

```python
import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama2-7b",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7
)

print(response.choices[0].message.content)
```

## TabbyAPI (Recommended Server)

TabbyAPI provides a feature-rich ExLlamaV2 server:

```bash

# Clone TabbyAPI
git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI

# Install
pip install -r requirements.txt

# Configure

# Edit config.yml with your model path

# Run
python main.py
```

### TabbyAPI Features

* OpenAI-compatible API
* Multiple model support
* LoRA hot-swapping
* Streaming
* Function calling
* Admin API

## Speculative Decoding

Use a smaller model to accelerate generation:

```python
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache

# Load main model (13B)
main_config = ExLlamaV2Config()
main_config.model_dir = "./llama2-13b-exl2"
main_config.prepare()
main_model = ExLlamaV2(main_config)
main_model.load()

# Load draft model (7B)
draft_config = ExLlamaV2Config()
draft_config.model_dir = "./llama2-7b-exl2"
draft_config.prepare()
draft_model = ExLlamaV2(draft_config)
draft_model.load()

# Create speculative generator
from exllamav2.generator import ExLlamaV2DraftGenerator

generator = ExLlamaV2DraftGenerator(
    main_model, draft_model,
    cache_main, cache_draft,
    tokenizer
)

# Generate (faster with speculation)
output = generator.generate_simple(prompt, settings, num_tokens=500)
```

## Quantize Your Own Models

### Convert to EXL2

```python
from exllamav2 import ExLlamaV2, ExLlamaV2Config
from exllamav2.conversion import convert_model

# Source: HuggingFace model

# Target: EXL2 quantized

convert_model(
    input_dir="./llama-3.1-8b-hf",
    output_dir="./llama-3.1-8b-exl2-4bpw",
    cal_dataset="wikitext",  # Calibration dataset
    bits=4.0,  # Bits per weight
    head_bits=6,  # Higher precision for attention
)
```

### Command Line

```bash
python convert.py \
    -i ./llama-3.1-8b-hf \
    -o ./llama-3.1-8b-exl2 \
    -cf ./llama-3.1-8b-exl2 \
    -b 4.0 \
    -hb 6
```

## Memory Management

### Cache Allocation

```python

# Fixed cache size
cache = ExLlamaV2Cache(model, max_seq_len=4096)

# Dynamic cache
cache = ExLlamaV2Cache(model, lazy=True)
cache.current_seq_len = 0  # Grows as needed
```

### Multi-GPU

```python
config = ExLlamaV2Config()
config.model_dir = "./large-model"

# Split across GPUs
config.set_auto_split([0.5, 0.5])  # 50% each GPU

model = ExLlamaV2(config)
model.load()
```

## Performance Comparison

| Model        | Engine    | GPU      | Tokens/sec |
| ------------ | --------- | -------- | ---------- |
| Llama 3.1 8B | ExLlamaV2 | RTX 3090 | \~150      |
| Llama 3.1 8B | llama.cpp | RTX 3090 | \~100      |
| Llama 3.1 8B | vLLM      | RTX 3090 | \~120      |
| Llama 3.1 8B | ExLlamaV2 | RTX 3090 | \~90       |
| Mixtral 8x7B | ExLlamaV2 | A100     | \~70       |

## Advanced Settings

### Sampling Parameters

```python
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_k = 50
settings.top_p = 0.9
settings.token_repetition_penalty = 1.1
settings.token_frequency_penalty = 0.0
settings.token_presence_penalty = 0.0
settings.mirostat = False
settings.mirostat_tau = 5.0
settings.mirostat_eta = 0.1
```

### Batch Generation

```python
prompts = [
    "The meaning of life is",
    "Artificial intelligence will",
    "Climate change is"
]

outputs = []
for prompt in prompts:
    output = generator.generate_simple(prompt, settings, num_tokens=100)
    outputs.append(output)
```

## Troubleshooting

### CUDA Out of Memory

```python

# Use smaller cache
cache = ExLlamaV2Cache(model, max_seq_len=2048)

# Or lower bpw model (3.0 instead of 4.0)
```

### Slow Loading

```python

# Enable fast loading
config.fasttensors = True
```

### Model Not Found

```bash

# Check model files exist
ls ./model/

# Should contain: config.json, *.safetensors, tokenizer.json
```

## Integration with LangChain

```python
from langchain.llms.base import LLM
from typing import Optional, List

class ExLlamaV2LLM(LLM):
    model: ExLlamaV2
    tokenizer: ExLlamaV2Tokenizer
    generator: ExLlamaV2StreamingGenerator
    settings: ExLlamaV2Sampler.Settings

    @property
    def _llm_type(self) -> str:
        return "exllamav2"

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        return self.generator.generate_simple(prompt, self.settings, num_tokens=500)

# Usage
llm = ExLlamaV2LLM(model=model, tokenizer=tokenizer, generator=generator, settings=settings)
result = llm("What is quantum computing?")
```

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* vLLM Inference - High throughput serving
* [llama.cpp Server](https://docs.clore.ai/guides/language-models/llamacpp-server) - Cross-platform
* [Text Generation WebUI](https://docs.clore.ai/guides/language-models/text-generation-webui) - Web interface
