ExLlamaV2

Clore.ai GPUs पर ExLlamaV2 के साथ अधिकतम गति LLM इनफेरेंस

ExLlamaV2 के साथ LLMs को अधिकतम गति पर चलाएँ।

सभी उदाहरण GPU सर्वरों पर चलाए जा सकते हैं जिन्हें द्वारा किराए पर लिया गया है CLORE.AI मार्केटप्लेस.

CLORE.AI पर किराये पर लेना

पर जाएँ CLORE.AI मार्केटप्लेस
GPU प्रकार, VRAM, और मूल्य के अनुसार फ़िल्टर करें
चुनें ऑन-डिमांड (निश्चित दर) या स्पॉट (बिड प्राइस)
अपना ऑर्डर कॉन्फ़िगर करें:
- Docker इमेज चुनें
- पोर्ट सेट करें (SSH के लिए TCP, वेब UI के लिए HTTP)
- यदि आवश्यक हो तो एनवायरनमेंट वेरिएबल जोड़ें
- स्टार्टअप कमांड दर्ज करें
भुगतान चुनें: CLORE, BTC, या USDT/USDC
ऑर्डर बनाएं और डिप्लॉयमेंट का इंतज़ार करें

अपने सर्वर तक पहुँचें

कनेक्शन विवरण में खोजें मेरे ऑर्डर
वेब इंटरफेस: HTTP पोर्ट URL का उपयोग करें
SSH: ssh -p <port> root@<proxy-address>

ExLlamaV2 क्या है?

ExLlamaV2 बड़े भाषा मॉडलों के लिए सबसे तेज़ इनफेरेंस इंजन है:

अन्य इंजनों की तुलना में 2-3x तेज़
उत्कृष्ट क्वांटाइज़ेशन (EXL2)
कम VRAM उपयोग
स्पेक्युलेटिव डिकोडिंग का समर्थन करता है

आवश्यकताएँ

मॉडल आकार

न्यूनतम VRAM

अनुशंसित

6GB

RTX 3060

13B

10GB

RTX 3090

34B

20GB

RTX 4090

70B

40GB

A100

त्वरित तैनाती

Docker इमेज:

pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel

पोर्ट:

22/tcp
8080/http

कमांड:

pip install exllamav2 && \
huggingface-cli download turboderp/Llama2-7B-exl2 --local-dir ./model && \
python -m exllamav2.server --model_dir ./model --host 0.0.0.0 --port 8080

अपनी सेवा तक पहुँचना

डिप्लॉयमेंट के बाद, अपना खोजें http_pub URL में मेरे ऑर्डर:

जाएँ मेरे ऑर्डर पृष्ठ
अपने ऑर्डर पर क्लिक करें
खोजें http_pub URL (उदा., abc123.clorecloud.net)

उपयोग करें https://YOUR_HTTP_PUB_URL की बजाय localhost नीचे दिए उदाहरणों में।

इंस्टॉलेशन


# PyPI से इंस्टॉल करें
pip install exllamav2

# या सोर्स से (नवीनतम सुविधाएँ)
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install .

मॉडल डाउनलोड करें

EXL2 क्वांटाइज्ड मॉडल


# Llama 3.1 8B (4.0 bpw)
huggingface-cli download turboderp/Llama2-7B-exl2 \
    --revision 4.0bpw \
    --local-dir ./llama2-7b-exl2

# Llama 3.1 8B (4.0 bpw)
huggingface-cli download turboderp/Llama2-13B-exl2 \
    --revision 4.0bpw \
    --local-dir ./llama2-13b-exl2

# Mistral 7B (4.0 bpw)
huggingface-cli download turboderp/Mistral-7B-instruct-exl2 \
    --revision 4.0bpw \
    --local-dir ./mistral-7b-exl2

# Mixtral 8x7B
huggingface-cli download turboderp/Mixtral-8x7B-instruct-exl2 \
    --revision 4.0bpw \
    --local-dir ./mixtral-exl2

वज़न प्रति बिट (Bits Per Weight, bpw)

BPW

गुणवत्ता

VRAM (7B)

2.0

कम

~3GB

3.0

अच्छा

~4GB

4.0

बहुत अच्छा

~5GB

5.0

उत्कृष्ट

~6GB

6.0

Near-FP16

~7GB

Python API

मूल जनरेशन

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler

# मॉडल लोड करें
config = ExLlamaV2Config()
config.model_dir = "./llama2-7b-exl2"
config.prepare()

model = ExLlamaV2(config)
model.load()

tokenizer = ExLlamaV2Tokenizer(config)
cache = ExLlamaV2Cache(model, lazy=True)

# जनरेटर बनाएं
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

# सैंपलिंग सेटिंग्स सेट करें
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_k = 50
settings.top_p = 0.9

# जनरेट करें
prompt = "The future of artificial intelligence is"
output = generator.generate_simple(prompt, settings, num_tokens=200)
print(output)

स्ट्रीमिंग जनरेशन

from exllamav2.generator import ExLlamaV2StreamingGenerator

generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

prompt = "Write a short story about a robot:"
input_ids = tokenizer.encode(prompt)

generator.set_stop_conditions([tokenizer.eos_token_id])
generator.begin_stream(input_ids, settings)

while True:
    chunk, eos, _ = generator.stream()
    if eos:
        break
    print(chunk, end="", flush=True)

चैट फ़ॉर्मेट

def format_chat(messages):
    text = ""
    for msg in messages:
        role = msg["role"]
        content = msg["content"]
        if role == "system":
            text += f"[INST] <<SYS>>\n{content}\n<</SYS>>\n\n"
        elif role == "user":
            text += f"{content} [/INST]"
        elif role == "assistant":
            text += f" {content}</s><s>[INST] "
    return text

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]

prompt = format_chat(messages)
output = generator.generate_simple(prompt, settings, num_tokens=300)

सर्वर मोड

सर्वर शुरू करें

python -m exllamav2.server \
    --model_dir ./llama2-7b-exl2 \
    --host 0.0.0.0 \
    --port 8080 \
    --max_seq_len 4096 \
    --cache_size 4096

API उपयोग

import requests

response = requests.post(
    "http://localhost:8080/v1/completions",
    json={
        "prompt": "Hello, how are you?",
        "max_tokens": 100,
        "temperature": 0.7
    }
)

print(response.json()["choices"][0]["text"])

चैट कंप्लीशन्स

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama2-7b",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7
)

print(response.choices[0].message.content)

TabbyAPI (सिफारिश किया गया सर्वर)

TabbyAPI एक फीचर-समृद्ध ExLlamaV2 सर्वर प्रदान करता है:


# TabbyAPI क्लोन करें
git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI

# इंस्टॉल करें
pip install -r requirements.txt

# कॉन्फ़िगर करें

# अपने मॉडल पथ के साथ config.yml संपादित करें

# चलाएँ
python main.py

TabbyAPI सुविधाएँ

OpenAI-अनुकूल API
एक से अधिक मॉडल समर्थन
LoRA हॉट-स्वैपिंग
स्ट्रीमिंग
फ़ंक्शन कॉलिंग
एडमिन API

अनुमानात्मक डिकोडिंग

जनरेशन तेज़ करने के लिए छोटा मॉडल उपयोग करें:

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache

# मुख्य मॉडल लोड करें (13B)
main_config = ExLlamaV2Config()
main_config.model_dir = "./llama2-13b-exl2"
main_config.prepare()
main_model = ExLlamaV2(main_config)
main_model.load()

# ड्राफ्ट मॉडल लोड करें (7B)
draft_config = ExLlamaV2Config()
draft_config.model_dir = "./llama2-7b-exl2"
draft_config.prepare()
draft_model = ExLlamaV2(draft_config)
draft_model.load()

# स्पेक्युलेटिव जनरेटर बनाएं
from exllamav2.generator import ExLlamaV2DraftGenerator

generator = ExLlamaV2DraftGenerator(
    main_model, draft_model,
    cache_main, cache_draft,
    tokenizer
)

# जनरेट करें (स्पेकुलेशन के साथ तेज़)
output = generator.generate_simple(prompt, settings, num_tokens=500)

अपने मॉडल क्वांटाइज़ करें

EXL2 में कन्वर्ट करें

from exllamav2 import ExLlamaV2, ExLlamaV2Config
from exllamav2.conversion import convert_model

# स्रोत: HuggingFace मॉडल

# लक्ष्य: EXL2 क्वांटाइज़्ड

convert_model(
    input_dir="./llama-3.1-8b-hf",
    output_dir="./llama-3.1-8b-exl2-4bpw",
    cal_dataset="wikitext",  # कैलिब्रेशन डेटासेट
    bits=4.0,  # वज़न प्रति बिट
    head_bits=6,  # अटेंशन के लिए उच्च सटीकता
)

कमांड लाइन

python convert.py \
    -i ./llama-3.1-8b-hf \
    -o ./llama-3.1-8b-exl2 \
    -cf ./llama-3.1-8b-exl2 \
    -b 4.0 \
    -hb 6

मेमोरी प्रबंधन

कैश आवंटन


# फिक्स्ड कैश साइज
cache = ExLlamaV2Cache(model, max_seq_len=4096)

# डायनामिक कैश
cache = ExLlamaV2Cache(model, lazy=True)
cache.current_seq_len = 0  # आवश्यकतानुसार बढ़ता है

मल्टी-GPU

config = ExLlamaV2Config()
config.model_dir = "./large-model"

# GPUs के बीच विभाजित करें
config.set_auto_split([0.5, 0.5])  # प्रत्येक GPU के लिए 50%

model = ExLlamaV2(config)
model.load()

प्रदर्शन तुलना

मॉडल

इंजन

GPU

टोकन/सेकंड

Llama 3.1 8B

ExLlamaV2

RTX 3090

~150

Llama 3.1 8B

llama.cpp

RTX 3090

~100

Llama 3.1 8B

vLLM

RTX 3090

~120

Llama 3.1 8B

ExLlamaV2

RTX 3090

~90

Mixtral 8x7B

ExLlamaV2

A100

~70

उन्नत सेटिंग्स

सैंपलिंग पैरामीटर्स

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_k = 50
settings.top_p = 0.9
settings.token_repetition_penalty = 1.1
settings.token_frequency_penalty = 0.0
settings.token_presence_penalty = 0.0
settings.mirostat = False
settings.mirostat_tau = 5.0
settings.mirostat_eta = 0.1

बैच जनरेशन

prompts = [
    "The meaning of life is",
    "Artificial intelligence will",
    "Climate change is"
]

outputs = []
for prompt in prompts:
    output = generator.generate_simple(prompt, settings, num_tokens=100)
    outputs.append(output)

समस्याओं का निवारण

CUDA मेमोरी समाप्त


# छोटे कैश का उपयोग करें
cache = ExLlamaV2Cache(model, max_seq_len=2048)

# या कम bpw मॉडल (4.0 की जगह 3.0)

धीमा लोड होना


# फास्ट लोडिंग सक्षम करें
config.fasttensors = True

मॉडल नहीं मिला


# मॉडल फाइलें मौजूद हैं यह जाँचें
ls ./model/

# इसमें होना चाहिए: config.json, *.safetensors, tokenizer.json

LangChain के साथ एकीकरण

from langchain.llms.base import LLM
from typing import Optional, List

class ExLlamaV2LLM(LLM):
    model: ExLlamaV2
    tokenizer: ExLlamaV2Tokenizer
    generator: ExLlamaV2StreamingGenerator
    settings: ExLlamaV2Sampler.Settings

    @property
    def _llm_type(self) -> str:
        return "exllamav2"

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        return self.generator.generate_simple(prompt, self.settings, num_tokens=500)

# उपयोग
llm = ExLlamaV2LLM(model=model, tokenizer=tokenizer, generator=generator, settings=settings)
result = llm("What is quantum computing?")

लागत अनुमान

सामान्य CLORE.AI मार्केटप्लेस दरें (2024 के अनुसार):

GPU

घंटात्मक दर

दैनिक दर

4-घंटे सत्र

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

कीमतें प्रदाता और मांग के अनुसार बदलती हैं। जाँच करें CLORE.AI मार्केटप्लेस वर्तमान दरों के लिए।

पैसे बचाएँ:

उपयोग करें स्पॉट लचीले वर्कलोड के लिए मार्केट (अक्सर 30-50% सस्ता)
भुगतान करें CLORE टोकन के साथ
विभिन्न प्रदाताओं के बीच कीमतों की तुलना करें

अगले कदम

vLLM इनफेरेंस - उच्च थ्रूपुट सर्विंग
llama.cpp सर्वर - क्रॉस-प्लेटफ़ॉर्म
Text Generation WebUI - वेब इंटरफ़ेस

PreviousText Generation WebUI NextLocalAI

Last updated 21 days ago

Was this helpful?

hashtagCLORE.AI पर किराये पर लेना

hashtagअपने सर्वर तक पहुँचें

hashtagExLlamaV2 क्या है?

hashtagआवश्यकताएँ

hashtagत्वरित तैनाती

hashtagअपनी सेवा तक पहुँचना

hashtagइंस्टॉलेशन

hashtagमॉडल डाउनलोड करें

hashtagEXL2 क्वांटाइज्ड मॉडल

hashtagवज़न प्रति बिट (Bits Per Weight, bpw)

hashtagPython API

hashtagमूल जनरेशन

hashtagस्ट्रीमिंग जनरेशन

hashtagचैट फ़ॉर्मेट

hashtagसर्वर मोड

hashtagसर्वर शुरू करें

hashtagAPI उपयोग

hashtagचैट कंप्लीशन्स

hashtagTabbyAPI (सिफारिश किया गया सर्वर)

hashtagTabbyAPI सुविधाएँ

hashtagअनुमानात्मक डिकोडिंग

hashtagअपने मॉडल क्वांटाइज़ करें

hashtagEXL2 में कन्वर्ट करें

hashtagकमांड लाइन

hashtagमेमोरी प्रबंधन

hashtagकैश आवंटन

hashtagमल्टी-GPU

hashtagप्रदर्शन तुलना

hashtagउन्नत सेटिंग्स

hashtagसैंपलिंग पैरामीटर्स

hashtagबैच जनरेशन

hashtagसमस्याओं का निवारण

hashtagCUDA मेमोरी समाप्त

hashtagधीमा लोड होना

hashtagमॉडल नहीं मिला

hashtagLangChain के साथ एकीकरण

hashtagलागत अनुमान

hashtagअगले कदम

CLORE.AI पर किराये पर लेना

अपने सर्वर तक पहुँचें

ExLlamaV2 क्या है?

आवश्यकताएँ

त्वरित तैनाती

अपनी सेवा तक पहुँचना

इंस्टॉलेशन

मॉडल डाउनलोड करें

EXL2 क्वांटाइज्ड मॉडल

वज़न प्रति बिट (Bits Per Weight, bpw)

Python API

मूल जनरेशन

स्ट्रीमिंग जनरेशन

चैट फ़ॉर्मेट

सर्वर मोड

सर्वर शुरू करें

API उपयोग

चैट कंप्लीशन्स

TabbyAPI (सिफारिश किया गया सर्वर)

TabbyAPI सुविधाएँ

अनुमानात्मक डिकोडिंग

अपने मॉडल क्वांटाइज़ करें

EXL2 में कन्वर्ट करें

कमांड लाइन

मेमोरी प्रबंधन

कैश आवंटन

मल्टी-GPU

प्रदर्शन तुलना

उन्नत सेटिंग्स

सैंपलिंग पैरामीटर्स

बैच जनरेशन

समस्याओं का निवारण

CUDA मेमोरी समाप्त

धीमा लोड होना

मॉडल नहीं मिला

LangChain के साथ एकीकरण

लागत अनुमान

अगले कदम