ExLlamaV2

Maximale Geschwindigkeit LLM‑Inferenz mit ExLlamaV2 auf Clore.ai‑GPUs

Führen Sie LLMs mit ExLlamaV2 mit maximaler Geschwindigkeit aus.

Alle Beispiele können auf GPU-Servern ausgeführt werden, die über CLORE.AI Marketplace.

Mieten auf CLORE.AI

Besuchen Sie CLORE.AI Marketplace
Nach GPU-Typ, VRAM und Preis filtern
Wählen On-Demand (Festpreis) oder Spot (Gebotspreis)
Konfigurieren Sie Ihre Bestellung:
- Docker-Image auswählen
- Ports festlegen (TCP für SSH, HTTP für Web-UIs)
- Umgebungsvariablen bei Bedarf hinzufügen
- Startbefehl eingeben
Zahlung auswählen: CLORE, BTC, oder USDT/USDC
Bestellung erstellen und auf Bereitstellung warten

Zugriff auf Ihren Server

Verbindungsdetails finden Sie in Meine Bestellungen
Webschnittstellen: Verwenden Sie die HTTP-Port-URL
SSH: ssh -p <port> root@<proxy-address>

Was ist ExLlamaV2?

ExLlamaV2 ist die schnellste Inferenz-Engine für große Sprachmodelle:

2–3x schneller als andere Engines
Ausgezeichnete Quantisierung (EXL2)
Geringer VRAM-Verbrauch
Unterstützt spekulatives Decoding

Anforderungen

Modellgröße

Min. VRAM

Schnelle Bereitstellung

Docker-Image:

pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel

Ports:

22/tcp
8080/http

Befehl:

pip install exllamav2 && \
huggingface-cli download turboderp/Llama2-7B-exl2 --local-dir ./model && \
python -m exllamav2.server --model_dir ./model --host 0.0.0.0 --port 8080

Zugriff auf Ihren Dienst

Nach der Bereitstellung finden Sie Ihre http_pub URL in Meine Bestellungen:

Gehen Sie zur Meine Bestellungen Seite
Klicken Sie auf Ihre Bestellung
Finden Sie die http_pub URL (z. B., abc123.clorecloud.net)

Verwenden Sie https://IHRE_HTTP_PUB_URL anstelle von localhost in den Beispielen unten.

Installation


# Von PyPI installieren
pip install exllamav2

# Oder aus dem Quellcode (neueste Funktionen)
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install .

Modelle herunterladen

EXL2 quantisierte Modelle


# Llama 3.1 8B (4.0 bpw)
huggingface-cli download turboderp/Llama2-7B-exl2 \
    --revision 4.0bpw \
    --local-dir ./llama2-7b-exl2

# Llama 3.1 8B (4.0 bpw)
huggingface-cli download turboderp/Llama2-13B-exl2 \
    --revision 4.0bpw \
    --local-dir ./llama2-13b-exl2

# Mistral 7B (4.0 bpw)
huggingface-cli download turboderp/Mistral-7B-instruct-exl2 \
    --revision 4.0bpw \
    --local-dir ./mistral-7b-exl2

# Mixtral 8x7B
huggingface-cli download turboderp/Mixtral-8x7B-instruct-exl2 \
    --revision 4.0bpw \
    --local-dir ./mixtral-exl2

Bits pro Gewicht (bpw)

BPW

Qualität

VRAM (7B)

2.0

Gering

~3GB

3.0

Gut

~4GB

4.0

Großartig

~5GB

5.0

Ausgezeichnet

~6GB

6.0

Nahezu FP16

~7GB

Python-API

Grundlegende Generierung

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler

# Modell laden
config = ExLlamaV2Config()
config.model_dir = "./llama2-7b-exl2"
config.prepare()

model = ExLlamaV2(config)
model.load()

tokenizer = ExLlamaV2Tokenizer(config)
cache = ExLlamaV2Cache(model, lazy=True)

# Generator erstellen
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

# Sampling-Einstellungen festlegen
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_k = 50
settings.top_p = 0.9

# Generieren
prompt = "Die Zukunft der künstlichen Intelligenz ist"
output = generator.generate_simple(prompt, settings, num_tokens=200)
print(output)

Streaming-Generierung

from exllamav2.generator import ExLlamaV2StreamingGenerator

generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

prompt = "Schreibe eine Kurzgeschichte über einen Roboter:"
input_ids = tokenizer.encode(prompt)

generator.set_stop_conditions([tokenizer.eos_token_id])
generator.begin_stream(input_ids, settings)

while True:
    chunk, eos, _ = generator.stream()
    if eos:
        break
    print(chunk, end="", flush=True)

Chat-Format

def format_chat(messages):
    text = ""
    for msg in messages:
        role = msg["role"]
        content = msg["content"]
        if role == "system":
            text += f"[INST] <<SYS>>\n{content}\n<</SYS>>\n\n"
        elif role == "user":
            text += f"{content} [/INST]"
        elif role == "assistant":
            text += f" {content}</s><s>[INST] "
    return text

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]

prompt = format_chat(messages)
output = generator.generate_simple(prompt, settings, num_tokens=300)

Server-Modus

Server starten

python -m exllamav2.server \
    --model_dir ./llama2-7b-exl2 \
    --host 0.0.0.0 \
    --port 8080 \
    --max_seq_len 4096 \
    --cache_size 4096

API-Nutzung

import requests

response = requests.post(
    "http://localhost:8080/v1/completions",
    json={
        "prompt": "Hallo, wie geht es dir?",
        "max_tokens": 100,
        "temperature": 0.7
    }
)

print(response.json()["choices"][0]["text"])

Chat-Vervollständigungen

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="nicht benötigt"
)

response = client.chat.completions.create(
    model="llama2-7b",
    messages=[{"role": "user", "content": "Hallo!"}],
    temperature=0.7
)

print(response.choices[0].message.content)

TabbyAPI (empfohlener Server)

TabbyAPI bietet einen funktionsreichen ExLlamaV2-Server:


# TabbyAPI klonen
git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI

# Installieren
pip install -r requirements.txt

# Konfigurieren

# Bearbeite config.yml mit deinem Modellpfad

# Ausführen
python main.py

TabbyAPI-Funktionen

OpenAI-kompatible API
Unterstützung mehrerer Modelle
LoRA Hot-Swapping
Streaming
Funktionsaufrufe
Admin-API

Spekulatives Decoding

Verwende ein kleineres Modell, um die Generierung zu beschleunigen:

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache

# Hauptmodell laden (13B)
main_config = ExLlamaV2Config()
main_config.model_dir = "./llama2-13b-exl2"
main_config.prepare()
main_model = ExLlamaV2(main_config)
main_model.load()

# Entwurfsmodell laden (7B)
draft_config = ExLlamaV2Config()
draft_config.model_dir = "./llama2-7b-exl2"
draft_config.prepare()
draft_model = ExLlamaV2(draft_config)
draft_model.load()

# Spekulativen Generator erstellen
from exllamav2.generator import ExLlamaV2DraftGenerator

generator = ExLlamaV2DraftGenerator(
    main_model, draft_model,
    cache_main, cache_draft,
    tokenizer
)

# Generieren (schneller mit Spekulation)
output = generator.generate_simple(prompt, settings, num_tokens=500)

Quantisiere deine eigenen Modelle

In EXL2 konvertieren

from exllamav2 import ExLlamaV2, ExLlamaV2Config
from exllamav2.conversion import convert_model

# Quelle: HuggingFace-Modell

# Ziel: EXL2 quantisiert

convert_model(
    input_dir="./llama-3.1-8b-hf",
    output_dir="./llama-3.1-8b-exl2-4bpw",
    cal_dataset="wikitext",  # Kalibrierungsdatensatz
    bits=4.0,  # Bits pro Gewicht
    head_bits=6,  # Höhere Präzision für Attention
)

Kommandozeile

python convert.py \
    -i ./llama-3.1-8b-hf \
    -o ./llama-3.1-8b-exl2 \
    -cf ./llama-3.1-8b-exl2 \
    -b 4.0 \
    -hb 6

Speicherverwaltung

Cache-Zuweisung


# Feste Cache-Größe
cache = ExLlamaV2Cache(model, max_seq_len=4096)

# Dynamischer Cache
cache = ExLlamaV2Cache(model, lazy=True)
cache.current_seq_len = 0  # Wächst nach Bedarf

Multi-GPU

config = ExLlamaV2Config()
config.model_dir = "./large-model"

# Auf mehrere GPUs aufteilen
config.set_auto_split([0.5, 0.5])  # 50% pro GPU

model = ExLlamaV2(config)
model.load()

Leistungsvergleich

Modell

Engine

GPU

Tokens/sec

Llama 3.1 8B

ExLlamaV2

RTX 3090

~150

Llama 3.1 8B

llama.cpp

RTX 3090

~100

Llama 3.1 8B

vLLM

RTX 3090

~120

Llama 3.1 8B

ExLlamaV2

RTX 3090

~90

Mixtral 8x7B

ExLlamaV2

A100

~70

Erweiterte Einstellungen

Sampling-Parameter

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_k = 50
settings.top_p = 0.9
settings.token_repetition_penalty = 1.1
settings.token_frequency_penalty = 0.0
settings.token_presence_penalty = 0.0
settings.mirostat = False
settings.mirostat_tau = 5.0
settings.mirostat_eta = 0.1

Batch-Erzeugung

prompts = [
    "Der Sinn des Lebens ist",
    "Künstliche Intelligenz wird",
    "Der Klimawandel ist"
]

outputs = []
for prompt in prompts:
    output = generator.generate_simple(prompt, settings, num_tokens=100)
    outputs.append(output)

Fehlerbehebung

CUDA: Kein Speicher


# Verwende kleineren Cache
cache = ExLlamaV2Cache(model, max_seq_len=2048)

# Oder ein Modell mit niedrigerem bpw (3.0 statt 4.0)

Langsames Laden


# Schnelles Laden aktivieren
config.fasttensors = True

Modell nicht gefunden


# Überprüfe, ob Modelldateien existieren
ls ./model/

# Sollte enthalten: config.json, *.safetensors, tokenizer.json

Integration mit LangChain

from langchain.llms.base import LLM
from typing import Optional, List

class ExLlamaV2LLM(LLM):
    model: ExLlamaV2
    tokenizer: ExLlamaV2Tokenizer
    generator: ExLlamaV2StreamingGenerator
    settings: ExLlamaV2Sampler.Settings

    @property
    def _llm_type(self) -> str:
        return "exllamav2"

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        return self.generator.generate_simple(prompt, self.settings, num_tokens=500)

# Verwendung
llm = ExLlamaV2LLM(model=model, tokenizer=tokenizer, generator=generator, settings=settings)
result = llm("Was ist Quantencomputing?")

Kostenabschätzung

Typische CLORE.AI-Marktplatztarife (Stand 2024):

GPU

Stundensatz

Tagessatz

4-Stunden-Sitzung

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Preise variieren je nach Anbieter und Nachfrage. Prüfen Sie CLORE.AI Marketplace auf aktuelle Preise.

Geld sparen:

Verwenden Sie Spot Markt für flexible Workloads (oft 30–50% günstiger)
Bezahlen mit CLORE Token
Preise bei verschiedenen Anbietern vergleichen

Nächste Schritte

vLLM Inferenz – High-Throughput-Serving
llama.cpp Server - Plattformübergreifend
Text Generation WebUI - Weboberfläche

VorherigeText Generation WebUI NächsteLocalAI

Zuletzt aktualisiert vor 22 Tagen

War das hilfreich?

hashtagMieten auf CLORE.AI

hashtagZugriff auf Ihren Server

hashtagWas ist ExLlamaV2?

hashtagAnforderungen

hashtagSchnelle Bereitstellung

hashtagZugriff auf Ihren Dienst

hashtagInstallation

hashtagModelle herunterladen

hashtagEXL2 quantisierte Modelle

hashtagBits pro Gewicht (bpw)

hashtagPython-API

hashtagGrundlegende Generierung

hashtagStreaming-Generierung

hashtagChat-Format

hashtagServer-Modus

hashtagServer starten

hashtagAPI-Nutzung

hashtagChat-Vervollständigungen

hashtagTabbyAPI (empfohlener Server)

hashtagTabbyAPI-Funktionen

hashtagSpekulatives Decoding

hashtagQuantisiere deine eigenen Modelle

hashtagIn EXL2 konvertieren

hashtagKommandozeile

hashtagSpeicherverwaltung

hashtagCache-Zuweisung

hashtagMulti-GPU

hashtagLeistungsvergleich

hashtagErweiterte Einstellungen

hashtagSampling-Parameter

hashtagBatch-Erzeugung

hashtagFehlerbehebung

hashtagCUDA: Kein Speicher

hashtagLangsames Laden

hashtagModell nicht gefunden

hashtagIntegration mit LangChain

hashtagKostenabschätzung

hashtagNächste Schritte

Mieten auf CLORE.AI

Zugriff auf Ihren Server

Was ist ExLlamaV2?

Anforderungen

Schnelle Bereitstellung

Zugriff auf Ihren Dienst

Installation

Modelle herunterladen

EXL2 quantisierte Modelle

Bits pro Gewicht (bpw)

Python-API

Grundlegende Generierung

Streaming-Generierung

Chat-Format

Server-Modus

Server starten

API-Nutzung

Chat-Vervollständigungen

TabbyAPI (empfohlener Server)

TabbyAPI-Funktionen

Spekulatives Decoding

Quantisiere deine eigenen Modelle

In EXL2 konvertieren

Kommandozeile

Speicherverwaltung

Cache-Zuweisung

Multi-GPU

Leistungsvergleich

Erweiterte Einstellungen

Sampling-Parameter

Batch-Erzeugung

Fehlerbehebung

CUDA: Kein Speicher

Langsames Laden

Modell nicht gefunden

Integration mit LangChain

Kostenabschätzung

Nächste Schritte