ExLlamaV2
Renting on CLORE.AI
Access Your Server
What is ExLlamaV2?
Requirements
Model Size
Min VRAM
Recommended
Quick Deploy
Accessing Your Service
Installation
Download Models
EXL2 Quantized Models
Bits Per Weight (bpw)
BPW
Quality
VRAM (7B)
Python API
Basic Generation
Streaming Generation
Chat Format
Server Mode
Start Server
API Usage
Chat Completions
TabbyAPI (Recommended Server)
TabbyAPI Features
Speculative Decoding
Quantize Your Own Models
Convert to EXL2
Command Line
Memory Management
Cache Allocation
Multi-GPU
Performance Comparison
Model
Engine
GPU
Tokens/sec
Advanced Settings
Sampling Parameters
Batch Generation
Troubleshooting
CUDA Out of Memory
Slow Loading
Model Not Found
Integration with LangChain
Cost Estimate
GPU
Hourly Rate
Daily Rate
4-Hour Session
Next Steps
Last updated
Was this helpful?