# ExLlamaV2

Run LLMs at maximum speed with ExLlamaV2.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is ExLlamaV2?

ExLlamaV2 is the fastest inference engine for large language models:

* 2-3x faster than other engines
* Excellent quantization (EXL2)
* Low VRAM usage
* Supports speculative decoding

## Requirements

| Model Size | Min VRAM | Recommended |
| ---------- | -------- | ----------- |
| 7B         | 6GB      | RTX 3060    |
| 13B        | 10GB     | RTX 3090    |
| 34B        | 20GB     | RTX 4090    |
| 70B        | 40GB     | A100        |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
```

**Ports:**

```
22/tcp
8080/http
```

**Command:**

```bash
pip install exllamav2 && \
huggingface-cli download turboderp/Llama2-7B-exl2 --local-dir ./model && \
python -m exllamav2.server --model_dir ./model --host 0.0.0.0 --port 8080
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation

```bash

# Install from PyPI
pip install exllamav2

# Or from source (latest features)
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install .
```

## Download Models

### EXL2 Quantized Models

```bash

# Llama 3.1 8B (4.0 bpw)
huggingface-cli download turboderp/Llama2-7B-exl2 \
    --revision 4.0bpw \
    --local-dir ./llama2-7b-exl2

# Llama 3.1 8B (4.0 bpw)
huggingface-cli download turboderp/Llama2-13B-exl2 \
    --revision 4.0bpw \
    --local-dir ./llama2-13b-exl2

# Mistral 7B (4.0 bpw)
huggingface-cli download turboderp/Mistral-7B-instruct-exl2 \
    --revision 4.0bpw \
    --local-dir ./mistral-7b-exl2

# Mixtral 8x7B
huggingface-cli download turboderp/Mixtral-8x7B-instruct-exl2 \
    --revision 4.0bpw \
    --local-dir ./mixtral-exl2
```

### Bits Per Weight (bpw)

| BPW | Quality   | VRAM (7B) |
| --- | --------- | --------- |
| 2.0 | Low       | \~3GB     |
| 3.0 | Good      | \~4GB     |
| 4.0 | Great     | \~5GB     |
| 5.0 | Excellent | \~6GB     |
| 6.0 | Near-FP16 | \~7GB     |

## Python API

### Basic Generation

```python
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler

# Load model
config = ExLlamaV2Config()
config.model_dir = "./llama2-7b-exl2"
config.prepare()

model = ExLlamaV2(config)
model.load()

tokenizer = ExLlamaV2Tokenizer(config)
cache = ExLlamaV2Cache(model, lazy=True)

# Create generator
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

# Set sampling settings
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_k = 50
settings.top_p = 0.9

# Generate
prompt = "The future of artificial intelligence is"
output = generator.generate_simple(prompt, settings, num_tokens=200)
print(output)
```

### Streaming Generation

```python
from exllamav2.generator import ExLlamaV2StreamingGenerator

generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

prompt = "Write a short story about a robot:"
input_ids = tokenizer.encode(prompt)

generator.set_stop_conditions([tokenizer.eos_token_id])
generator.begin_stream(input_ids, settings)

while True:
    chunk, eos, _ = generator.stream()
    if eos:
        break
    print(chunk, end="", flush=True)
```

### Chat Format

```python
def format_chat(messages):
    text = ""
    for msg in messages:
        role = msg["role"]
        content = msg["content"]
        if role == "system":
            text += f"[INST] <<SYS>>\n{content}\n<</SYS>>\n\n"
        elif role == "user":
            text += f"{content} [/INST]"
        elif role == "assistant":
            text += f" {content}</s><s>[INST] "
    return text

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]

prompt = format_chat(messages)
output = generator.generate_simple(prompt, settings, num_tokens=300)
```

## Server Mode

### Start Server

```bash
python -m exllamav2.server \
    --model_dir ./llama2-7b-exl2 \
    --host 0.0.0.0 \
    --port 8080 \
    --max_seq_len 4096 \
    --cache_size 4096
```

### API Usage

```python
import requests

response = requests.post(
    "http://localhost:8080/v1/completions",
    json={
        "prompt": "Hello, how are you?",
        "max_tokens": 100,
        "temperature": 0.7
    }
)

print(response.json()["choices"][0]["text"])
```

### Chat Completions

```python
import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama2-7b",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7
)

print(response.choices[0].message.content)
```

## TabbyAPI (Recommended Server)

TabbyAPI provides a feature-rich ExLlamaV2 server:

```bash

# Clone TabbyAPI
git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI

# Install
pip install -r requirements.txt

# Configure

# Edit config.yml with your model path

# Run
python main.py
```

### TabbyAPI Features

* OpenAI-compatible API
* Multiple model support
* LoRA hot-swapping
* Streaming
* Function calling
* Admin API

## Speculative Decoding

Use a smaller model to accelerate generation:

```python
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache

# Load main model (13B)
main_config = ExLlamaV2Config()
main_config.model_dir = "./llama2-13b-exl2"
main_config.prepare()
main_model = ExLlamaV2(main_config)
main_model.load()

# Load draft model (7B)
draft_config = ExLlamaV2Config()
draft_config.model_dir = "./llama2-7b-exl2"
draft_config.prepare()
draft_model = ExLlamaV2(draft_config)
draft_model.load()

# Create speculative generator
from exllamav2.generator import ExLlamaV2DraftGenerator

generator = ExLlamaV2DraftGenerator(
    main_model, draft_model,
    cache_main, cache_draft,
    tokenizer
)

# Generate (faster with speculation)
output = generator.generate_simple(prompt, settings, num_tokens=500)
```

## Quantize Your Own Models

### Convert to EXL2

```python
from exllamav2 import ExLlamaV2, ExLlamaV2Config
from exllamav2.conversion import convert_model

# Source: HuggingFace model

# Target: EXL2 quantized

convert_model(
    input_dir="./llama-3.1-8b-hf",
    output_dir="./llama-3.1-8b-exl2-4bpw",
    cal_dataset="wikitext",  # Calibration dataset
    bits=4.0,  # Bits per weight
    head_bits=6,  # Higher precision for attention
)
```

### Command Line

```bash
python convert.py \
    -i ./llama-3.1-8b-hf \
    -o ./llama-3.1-8b-exl2 \
    -cf ./llama-3.1-8b-exl2 \
    -b 4.0 \
    -hb 6
```

## Memory Management

### Cache Allocation

```python

# Fixed cache size
cache = ExLlamaV2Cache(model, max_seq_len=4096)

# Dynamic cache
cache = ExLlamaV2Cache(model, lazy=True)
cache.current_seq_len = 0  # Grows as needed
```

### Multi-GPU

```python
config = ExLlamaV2Config()
config.model_dir = "./large-model"

# Split across GPUs
config.set_auto_split([0.5, 0.5])  # 50% each GPU

model = ExLlamaV2(config)
model.load()
```

## Performance Comparison

| Model        | Engine    | GPU      | Tokens/sec |
| ------------ | --------- | -------- | ---------- |
| Llama 3.1 8B | ExLlamaV2 | RTX 3090 | \~150      |
| Llama 3.1 8B | llama.cpp | RTX 3090 | \~100      |
| Llama 3.1 8B | vLLM      | RTX 3090 | \~120      |
| Llama 3.1 8B | ExLlamaV2 | RTX 3090 | \~90       |
| Mixtral 8x7B | ExLlamaV2 | A100     | \~70       |

## Advanced Settings

### Sampling Parameters

```python
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_k = 50
settings.top_p = 0.9
settings.token_repetition_penalty = 1.1
settings.token_frequency_penalty = 0.0
settings.token_presence_penalty = 0.0
settings.mirostat = False
settings.mirostat_tau = 5.0
settings.mirostat_eta = 0.1
```

### Batch Generation

```python
prompts = [
    "The meaning of life is",
    "Artificial intelligence will",
    "Climate change is"
]

outputs = []
for prompt in prompts:
    output = generator.generate_simple(prompt, settings, num_tokens=100)
    outputs.append(output)
```

## Troubleshooting

### CUDA Out of Memory

```python

# Use smaller cache
cache = ExLlamaV2Cache(model, max_seq_len=2048)

# Or lower bpw model (3.0 instead of 4.0)
```

### Slow Loading

```python

# Enable fast loading
config.fasttensors = True
```

### Model Not Found

```bash

# Check model files exist
ls ./model/

# Should contain: config.json, *.safetensors, tokenizer.json
```

## Integration with LangChain

```python
from langchain.llms.base import LLM
from typing import Optional, List

class ExLlamaV2LLM(LLM):
    model: ExLlamaV2
    tokenizer: ExLlamaV2Tokenizer
    generator: ExLlamaV2StreamingGenerator
    settings: ExLlamaV2Sampler.Settings

    @property
    def _llm_type(self) -> str:
        return "exllamav2"

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        return self.generator.generate_simple(prompt, self.settings, num_tokens=500)

# Usage
llm = ExLlamaV2LLM(model=model, tokenizer=tokenizer, generator=generator, settings=settings)
result = llm("What is quantum computing?")
```

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* vLLM Inference - High throughput serving
* [llama.cpp Server](https://docs.clore.ai/guides/language-models/llamacpp-server) - Cross-platform
* [Text Generation WebUI](https://docs.clore.ai/guides/language-models/text-generation-webui) - Web interface


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/exllamav2-fast.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
