# Mistral.rs

**Blazing-fast LLM inference written in Rust** — production-ready server with GGUF, GGML, SafeTensors support and OpenAI-compatible API.

> 🦀 **Built in Rust** for maximum performance | GGUF & vision model support | Apache-2.0 License

***

## What is Mistral.rs?

Mistral.rs is a high-performance LLM inference engine written entirely in **Rust**. Originally focused on Mistral models, it now supports the full landscape of modern LLMs. The Rust foundation provides:

* **Zero-cost abstractions** — no garbage collection pauses during inference
* **Memory safety** — no null pointer exceptions or memory leaks
* **Deterministic performance** — consistent latency without JVM/Python overhead
* **Compile-time optimizations** — SIMD, threading, and GPU kernels optimized at build time

### Key Features

* **GGUF support** — run any quantized model (Q4\_K\_M, Q8\_0, etc.)
* **ISQ (In-Situ Quantization)** — quantize on the fly at load time
* **PagedAttention** — efficient KV cache with continuous batching
* **Vision Language Models** — LLaVA, Phi-3 Vision, Idefics support
* **Speculative decoding** — faster inference with draft models
* **X-LoRA** — scalable fine-tuned adapter support
* **OpenAI-compatible REST API** — drop-in replacement

### Supported Model Families

| Family          | Format            | Engine    |
| --------------- | ----------------- | --------- |
| Llama 2/3       | GGUF, SafeTensors | Rust CUDA |
| Mistral/Mixtral | GGUF, SafeTensors | Rust CUDA |
| Phi-2/3         | GGUF, SafeTensors | Rust CUDA |
| Gemma           | GGUF, SafeTensors | Rust CUDA |
| Qwen 2          | GGUF, SafeTensors | Rust CUDA |
| Starcoder 2     | GGUF              | Rust CUDA |
| LLaVA 1.5/1.6   | SafeTensors       | Vision    |
| Phi-3 Vision    | SafeTensors       | Vision    |

***

## Quick Start on Clore.ai

### Step 1: Find a GPU Server

On [clore.ai](https://clore.ai) marketplace:

* **Minimum:** 8GB VRAM (for 7B Q4 models)
* **Recommended:** RTX 3090/4090 (24GB) for larger models
* CUDA 11.8+ required

### Step 2: Deploy Mistral.rs Docker

```
Docker Image: ghcr.io/ericlbuehler/mistral.rs:cuda
```

**Port mappings:**

| Container Port | Purpose         |
| -------------- | --------------- |
| `22`           | SSH access      |
| `8080`         | REST API server |

**Available image variants:**

```bash
# CUDA (most Clore.ai servers)
ghcr.io/ericlbuehler/mistral.rs:cuda

# CPU only
ghcr.io/ericlbuehler/mistral.rs:cpu

# Metal (Apple Silicon - not for Clore.ai)
ghcr.io/ericlbuehler/mistral.rs:metal
```

### Step 3: Connect and Verify

```bash
ssh root@<clore-node-ip> -p <ssh-port>

# Check mistral.rs binary
mistralrs-server --help
```

***

## Running the Server

### Quick Start with GGUF Model

```bash
# Serve a GGUF model directly from HuggingFace
mistralrs-server \
  --port 8080 \
  --log info \
  gguf \
  -m TheBloke/Llama-2-7B-Chat-GGUF \
  -f llama-2-7b-chat.Q4_K_M.gguf
```

### Serve Mistral 7B (SafeTensors)

```bash
mistralrs-server \
  --port 8080 \
  plain \
  -m mistralai/Mistral-7B-Instruct-v0.3 \
  --isq Q4K
```

### Serve with In-Situ Quantization (ISQ)

ISQ quantizes the model at load time — no pre-quantized model needed:

```bash
# Load Llama 3 8B and quantize to Q4K on the fly
mistralrs-server \
  --port 8080 \
  plain \
  -m meta-llama/Meta-Llama-3-8B-Instruct \
  --isq Q4K

# Available ISQ options:
# Q4_0, Q4_1, Q5_0, Q5_1, Q8_0
# Q2K, Q3K, Q4K, Q5K, Q6K, Q8K
# HQQ4, HQQ8 (Half-Quadratic Quantization)
```

### Vision Language Model

```bash
mistralrs-server \
  --port 8080 \
  vision-plain \
  -m llava-hf/llava-1.5-7b-hf \
  --isq Q4K
```

### Speculative Decoding

```bash
# Use a small draft model to speed up generation
mistralrs-server \
  --port 8080 \
  speculative \
  -m meta-llama/Meta-Llama-3-8B-Instruct \
  --isq Q4K \
  -d meta-llama/Meta-Llama-3-1B-Instruct \
  --draft-isq Q4K \
  -n 5  # Speculative tokens
```

{% hint style="success" %}
**Speculative decoding** can provide **2–3x speedup** for most conversational workloads where the small draft model accurately predicts the next tokens.
{% endhint %}

***

## API Usage

### OpenAI-Compatible Endpoints

| Endpoint                 | Method | Description              |
| ------------------------ | ------ | ------------------------ |
| `/v1/chat/completions`   | POST   | Chat completions         |
| `/v1/completions`        | POST   | Text completions         |
| `/v1/models`             | GET    | List models              |
| `/v1/images/generations` | POST   | Image generation (VLMs)  |
| `/v1/re_isq`             | POST   | Re-quantize loaded model |
| `/health`                | GET    | Health check             |

### Python Example

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://<clore-node-ip>:<api-port>/v1",
    api_key="none"  # No auth required by default
)

# Chat completion
response = client.chat.completions.create(
    model="llama-3-8b",  # Model name is flexible
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to reverse a linked list"}
    ],
    temperature=0.1,  # Low temp for code generation
    max_tokens=1024
)
print(response.choices[0].message.content)
```

### Streaming Response

```python
with client.chat.completions.create(
    model="llama-3-8b",
    messages=[{"role": "user", "content": "Tell me a story about a robot."}],
    stream=True,
    max_tokens=512
) as stream:
    for chunk in stream:
        delta = chunk.choices[0].delta
        if hasattr(delta, 'content') and delta.content:
            print(delta.content, end="", flush=True)
print()
```

### Vision/Image Input

```python
import base64
from pathlib import Path

# Load image
image_data = base64.b64encode(Path("photo.jpg").read_bytes()).decode()

response = client.chat.completions.create(
    model="llava-1.5-7b",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_data}"
                    }
                },
                {
                    "type": "text",
                    "text": "What do you see in this image?"
                }
            ]
        }
    ]
)
print(response.choices[0].message.content)
```

### cURL Examples

```bash
# Basic chat
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b",
    "messages": [{"role": "user", "content": "What is Rust?"}],
    "temperature": 0.7,
    "max_tokens": 256
  }'

# List models
curl http://localhost:8080/v1/models

# Health check
curl http://localhost:8080/health
```

***

## Configuration Options

### Server Flags

```bash
mistralrs-server \
  --port 8080 \                    # API port (default: 1234)
  --host 0.0.0.0 \                 # Bind address
  --log info \                     # Log level: off/error/warn/info/debug/trace
  --token-source env:HF_TOKEN \    # HuggingFace token source
  --max-seqs 16 \                  # Maximum concurrent sequences
  --no-paged-attn \                # Disable PagedAttention (use for debugging)
  --prefix-cache-n 16 \            # Prefix cache entries
  plain \                          # Model type subcommand
  -m meta-llama/Meta-Llama-3-8B-Instruct \
  --isq Q4K
```

### ISQ Quantization Reference

| ISQ Option | Bits | Quality | VRAM (7B) |
| ---------- | ---- | ------- | --------- |
| `Q2K`      | 2    | ★★☆☆☆   | \~2.5GB   |
| `Q3K`      | 3    | ★★★☆☆   | \~3.5GB   |
| `Q4_0`     | 4    | ★★★★☆   | \~4.5GB   |
| `Q4K`      | 4    | ★★★★☆   | \~4.5GB   |
| `Q5K`      | 5    | ★★★★★   | \~5.5GB   |
| `Q6K`      | 6    | ★★★★★   | \~6.5GB   |
| `Q8_0`     | 8    | ★★★★★   | \~8GB     |
| `HQQ4`     | 4    | ★★★★☆   | \~4.5GB   |
| `HQQ8`     | 8    | ★★★★★   | \~8GB     |

{% hint style="info" %}
**HQQ (Half-Quadratic Quantization)** often achieves better quality than GGUF Q4 at the same bit level, especially for instruction-following tasks.
{% endhint %}

***

## Advanced Features

### X-LoRA (Mixture of LoRA Adapters)

Run multiple fine-tuned adapters dynamically selected per token:

```bash
mistralrs-server \
  --port 8080 \
  x-lora-plain \
  -m meta-llama/Meta-Llama-3-8B-Instruct \
  --isq Q4K \
  -x ./xlora-config.json
```

### Re-Quantize at Runtime

```bash
# Change quantization without restarting
curl http://localhost:8080/v1/re_isq \
  -H "Content-Type: application/json" \
  -d '{"isq_type": "Q8_0"}'
```

### Request Logging

```bash
# Enable request logging to file
mistralrs-server \
  --port 8080 \
  --log info \
  --request-logging-file ./requests.jsonl \
  plain \
  -m meta-llama/Meta-Llama-3-8B-Instruct \
  --isq Q4K
```

***

## Performance Tuning

### Optimize for Throughput

```bash
# Higher max-seqs for concurrent requests
mistralrs-server \
  --port 8080 \
  --max-seqs 32 \
  plain \
  -m meta-llama/Meta-Llama-3-8B-Instruct \
  --isq Q4K
```

### Optimize for Low Latency

```bash
# Lower max-seqs, disable prefix cache sharing
mistralrs-server \
  --port 8080 \
  --max-seqs 4 \
  --prefix-cache-n 0 \
  plain \
  -m meta-llama/Meta-Llama-3-8B-Instruct \
  --isq Q4K
```

### Monitor Performance

```bash
# Watch GPU usage during inference
watch -n 1 nvidia-smi

# Profile with nvtop
apt-get install nvtop && nvtop
```

***

## Docker Compose

```yaml
version: '3.8'
services:
  mistral-rs:
    image: ghcr.io/ericlbuehler/mistral.rs:cuda
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HF_TOKEN=${HUGGING_FACE_HUB_TOKEN}
    ports:
      - "8080:8080"
    volumes:
      - hf-cache:/root/.cache/huggingface
    command: >
      mistralrs-server
      --port 8080
      --host 0.0.0.0
      --log info
      --max-seqs 16
      --token-source env:HF_TOKEN
      plain
      -m meta-llama/Meta-Llama-3-8B-Instruct
      --isq Q4K
    restart: unless-stopped

volumes:
  hf-cache:
```

***

## Building from Source

If the Docker image doesn't match your CUDA version:

```bash
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env

# Clone and build
git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs

# Build with CUDA support
cargo build --release --features cuda

# Binary location
./target/release/mistralrs-server --help
```

{% hint style="warning" %}
**Build time:** Rust compilation is slow. Expect 10–20 minutes for a full build. Use `sccache` to speed up incremental builds: `cargo install sccache && RUSTC_WRAPPER=sccache cargo build --release --features cuda`
{% endhint %}

***

## Troubleshooting

### CUDA Library Not Found

```bash
# Check CUDA libraries
ldconfig -p | grep libcuda
ls /usr/local/cuda/lib64/

# Set library path
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
```

### Model Download Fails

```bash
# Set HuggingFace token
export HF_TOKEN=your_token_here

# Or use --token-source flag
mistralrs-server \
  --token-source env:HF_TOKEN \
  ...

# Or download manually first
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir ./llama3-8b
mistralrs-server ... plain -m ./llama3-8b --isq Q4K
```

### Port 8080 In Use

```bash
# Find and kill process
fuser -k 8080/tcp

# Use different port
mistralrs-server --port 9090 ...
```

### Out of Memory During Quantization

```bash
# ISQ quantizes on GPU — reduce other GPU usage first
# Or switch to GGUF (pre-quantized, lower peak memory)
mistralrs-server \
  gguf \
  -m TheBloke/Llama-2-7B-Chat-GGUF \
  -f llama-2-7b-chat.Q4_K_M.gguf
```

{% hint style="danger" %}
**ISQ vs GGUF:** ISQ quantizes at load time using GPU memory (temporary spike). If you're tight on VRAM, use pre-quantized GGUF files from TheBloke or similar — they use lower peak memory during loading.
{% endhint %}

***

## Clore.ai GPU Recommendations

Mistral.rs is a Rust-native engine — its low overhead means you get more throughput per GPU dollar vs Python-based servers.

| GPU       | VRAM  | Clore.ai Price | Recommended Use                              | Throughput (Mistral 7B Q4) |
| --------- | ----- | -------------- | -------------------------------------------- | -------------------------- |
| RTX 3090  | 24 GB | \~$0.12/hr     | Best budget option — 7B Q4/Q8, vision models | \~120 tok/s                |
| RTX 4090  | 24 GB | \~$0.70/hr     | High-throughput 7B–34B, speculative decoding | \~200 tok/s                |
| A100 40GB | 40 GB | \~$1.20/hr     | Production 34B–70B Q4 serving                | \~160 tok/s                |
| A100 80GB | 80 GB | \~$2.00/hr     | Full-precision 70B, multi-model              | \~185 tok/s                |

**Why RTX 3090 excels here:** Mistral.rs's Rust CUDA kernels avoid Python GIL overhead and garbage collection pauses that hurt Python servers. An RTX 3090 running Mistral 7B Q4\_K\_M delivers ~~120 tok/s — comparable to vLLM on the same hardware at a fraction of the cost (~~$0.12/hr vs cloud providers charging $1–2/hr).

**Speculative decoding:** Pair a large model (34B) with a small draft model (3B) for 2–3× speedup with no quality loss. RTX 4090 is ideal for this pattern.

***

## Resources

* 🐙 **GitHub:** [github.com/EricLBuehler/mistral.rs](https://github.com/EricLBuehler/mistral.rs)
* 📦 **Container Registry:** [ghcr.io/ericlbuehler/mistral.rs](https://ghcr.io/ericlbuehler/mistral.rs)
* 📚 **Documentation:** [ericlbuehler.github.io/mistral.rs](https://ericlbuehler.github.io/mistral.rs/mistralrs/)
* 💬 **Discord:** [discord.gg/SZrecqK8qw](https://discord.gg/SZrecqK8qw)
* 🤗 **GGUF Models:** [huggingface.co/TheBloke](https://huggingface.co/TheBloke)
