For the complete documentation index, see llms.txt. This page is also available as Markdown.

ExLlamaV2

Maximum speed LLM inference with ExLlamaV2 on Clore.ai GPUs

Run LLMs at maximum speed with ExLlamaV2.

Renting on CLORE.AI

  1. Filter by GPU type, VRAM, and price

  2. Choose On-Demand (fixed rate) or Spot (bid price)

  3. Configure your order:

    • Select Docker image

    • Set ports (TCP for SSH, HTTP for web UIs)

    • Add environment variables if needed

    • Enter startup command

  4. Select payment: CLORE, BTC, or USDT/USDC

  5. Create order and wait for deployment

Access Your Server

  • Find connection details in My Orders

  • Web interfaces: Use the HTTP port URL

  • SSH: ssh -p <port> root@<proxy-address>

What is ExLlamaV2?

ExLlamaV2 is the fastest inference engine for large language models:

  • 2-3x faster than other engines

  • Excellent quantization (EXL2)

  • Low VRAM usage

  • Supports speculative decoding

Requirements

Model Size
Min VRAM
Recommended

7B

6GB

RTX 3060

13B

10GB

RTX 3090

34B

20GB

RTX 4090

70B

40GB

A100

Quick Deploy

Docker Image:

Ports:

Command:

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

  1. Go to My Orders page

  2. Click on your order

  3. Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Installation

Download Models

EXL2 Quantized Models

Bits Per Weight (bpw)

BPW
Quality
VRAM (7B)

2.0

Low

~3GB

3.0

Good

~4GB

4.0

Great

~5GB

5.0

Excellent

~6GB

6.0

Near-FP16

~7GB

Python API

Basic Generation

Streaming Generation

Chat Format

Server Mode

Start Server

API Usage

Chat Completions

TabbyAPI provides a feature-rich ExLlamaV2 server:

TabbyAPI Features

  • OpenAI-compatible API

  • Multiple model support

  • LoRA hot-swapping

  • Streaming

  • Function calling

  • Admin API

Speculative Decoding

Use a smaller model to accelerate generation:

Quantize Your Own Models

Convert to EXL2

Command Line

Memory Management

Cache Allocation

Multi-GPU

Performance Comparison

Model
Engine
GPU
Tokens/sec

Llama 3.1 8B

ExLlamaV2

RTX 3090

~150

Llama 3.1 8B

llama.cpp

RTX 3090

~100

Llama 3.1 8B

vLLM

RTX 3090

~120

Llama 3.1 8B

ExLlamaV2

RTX 3090

~90

Mixtral 8x7B

ExLlamaV2

A100

~70

Advanced Settings

Sampling Parameters

Batch Generation

Troubleshooting

CUDA Out of Memory

Slow Loading

Model Not Found

Integration with LangChain

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU
Hourly Rate
Daily Rate
4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplace for current rates.

Save money:

  • Use Spot market for flexible workloads (often 30-50% cheaper)

  • Pay with CLORE tokens

  • Compare prices across different providers

Next Steps

Last updated

Was this helpful?