# Mistral.rs

**Blazing-fast LLM inference written in Rust** — production-ready server with GGUF, GGML, SafeTensors support and OpenAI-compatible API.

> 🦀 **Built in Rust** for maximum performance | GGUF & vision model support | Apache-2.0 License

***

## What is Mistral.rs?

Mistral.rs is a high-performance LLM inference engine written entirely in **Rust**. Originally focused on Mistral models, it now supports the full landscape of modern LLMs. The Rust foundation provides:

* **Zero-cost abstractions** — no garbage collection pauses during inference
* **Memory safety** — no null pointer exceptions or memory leaks
* **Deterministic performance** — consistent latency without JVM/Python overhead
* **Compile-time optimizations** — SIMD, threading, and GPU kernels optimized at build time

### Key Features

* **GGUF support** — run any quantized model (Q4\_K\_M, Q8\_0, etc.)
* **ISQ (In-Situ Quantization)** — quantize on the fly at load time
* **PagedAttention** — efficient KV cache with continuous batching
* **Vision Language Models** — LLaVA, Phi-3 Vision, Idefics support
* **Speculative decoding** — faster inference with draft models
* **X-LoRA** — scalable fine-tuned adapter support
* **OpenAI-compatible REST API** — drop-in replacement

### Supported Model Families

| Family          | Format            | Engine    |
| --------------- | ----------------- | --------- |
| Llama 2/3       | GGUF, SafeTensors | Rust CUDA |
| Mistral/Mixtral | GGUF, SafeTensors | Rust CUDA |
| Phi-2/3         | GGUF, SafeTensors | Rust CUDA |
| Gemma           | GGUF, SafeTensors | Rust CUDA |
| Qwen 2          | GGUF, SafeTensors | Rust CUDA |
| Starcoder 2     | GGUF              | Rust CUDA |
| LLaVA 1.5/1.6   | SafeTensors       | Vision    |
| Phi-3 Vision    | SafeTensors       | Vision    |

***

## Quick Start on Clore.ai

### Step 1: Find a GPU Server

On [clore.ai](https://clore.ai) marketplace:

* **Minimum:** 8GB VRAM (for 7B Q4 models)
* **Recommended:** RTX 3090/4090 (24GB) for larger models
* CUDA 11.8+ required

### Step 2: Deploy Mistral.rs Docker

```
Docker Image: ghcr.io/ericlbuehler/mistral.rs:cuda
```

**Port mappings:**

| Container Port | Purpose         |
| -------------- | --------------- |
| `22`           | SSH access      |
| `8080`         | REST API server |

**Available image variants:**

```bash
# CUDA (most Clore.ai servers)
ghcr.io/ericlbuehler/mistral.rs:cuda

# CPU only
ghcr.io/ericlbuehler/mistral.rs:cpu

# Metal (Apple Silicon - not for Clore.ai)
ghcr.io/ericlbuehler/mistral.rs:metal
```

### Step 3: Connect and Verify

```bash
ssh root@<clore-node-ip> -p <ssh-port>

# Check mistral.rs binary
mistralrs-server --help
```

***

## Running the Server

### Quick Start with GGUF Model

```bash
# Serve a GGUF model directly from HuggingFace
mistralrs-server \
  --port 8080 \
  --log info \
  gguf \
  -m TheBloke/Llama-2-7B-Chat-GGUF \
  -f llama-2-7b-chat.Q4_K_M.gguf
```

### Serve Mistral 7B (SafeTensors)

```bash
mistralrs-server \
  --port 8080 \
  plain \
  -m mistralai/Mistral-7B-Instruct-v0.3 \
  --isq Q4K
```

### Serve with In-Situ Quantization (ISQ)

ISQ quantizes the model at load time — no pre-quantized model needed:

```bash
# Load Llama 3 8B and quantize to Q4K on the fly
mistralrs-server \
  --port 8080 \
  plain \
  -m meta-llama/Meta-Llama-3-8B-Instruct \
  --isq Q4K

# Available ISQ options:
# Q4_0, Q4_1, Q5_0, Q5_1, Q8_0
# Q2K, Q3K, Q4K, Q5K, Q6K, Q8K
# HQQ4, HQQ8 (Half-Quadratic Quantization)
```

### Vision Language Model

```bash
mistralrs-server \
  --port 8080 \
  vision-plain \
  -m llava-hf/llava-1.5-7b-hf \
  --isq Q4K
```

### Speculative Decoding

```bash
# Use a small draft model to speed up generation
mistralrs-server \
  --port 8080 \
  speculative \
  -m meta-llama/Meta-Llama-3-8B-Instruct \
  --isq Q4K \
  -d meta-llama/Meta-Llama-3-1B-Instruct \
  --draft-isq Q4K \
  -n 5  # Speculative tokens
```

{% hint style="success" %}
**Speculative decoding** can provide **2–3x speedup** for most conversational workloads where the small draft model accurately predicts the next tokens.
{% endhint %}

***

## API Usage

### OpenAI-Compatible Endpoints

| Endpoint                 | Method | Description              |
| ------------------------ | ------ | ------------------------ |
| `/v1/chat/completions`   | POST   | Chat completions         |
| `/v1/completions`        | POST   | Text completions         |
| `/v1/models`             | GET    | List models              |
| `/v1/images/generations` | POST   | Image generation (VLMs)  |
| `/v1/re_isq`             | POST   | Re-quantize loaded model |
| `/health`                | GET    | Health check             |

### Python Example

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://<clore-node-ip>:<api-port>/v1",
    api_key="none"  # No auth required by default
)

# Chat completion
response = client.chat.completions.create(
    model="llama-3-8b",  # Model name is flexible
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to reverse a linked list"}
    ],
    temperature=0.1,  # Low temp for code generation
    max_tokens=1024
)
print(response.choices[0].message.content)
```

### Streaming Response

```python
with client.chat.completions.create(
    model="llama-3-8b",
    messages=[{"role": "user", "content": "Tell me a story about a robot."}],
    stream=True,
    max_tokens=512
) as stream:
    for chunk in stream:
        delta = chunk.choices[0].delta
        if hasattr(delta, 'content') and delta.content:
            print(delta.content, end="", flush=True)
print()
```

### Vision/Image Input

```python
import base64
from pathlib import Path

# Load image
image_data = base64.b64encode(Path("photo.jpg").read_bytes()).decode()

response = client.chat.completions.create(
    model="llava-1.5-7b",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_data}"
                    }
                },
                {
                    "type": "text",
                    "text": "What do you see in this image?"
                }
            ]
        }
    ]
)
print(response.choices[0].message.content)
```

### cURL Examples

```bash
# Basic chat
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b",
    "messages": [{"role": "user", "content": "What is Rust?"}],
    "temperature": 0.7,
    "max_tokens": 256
  }'

# List models
curl http://localhost:8080/v1/models

# Health check
curl http://localhost:8080/health
```

***

## Configuration Options

### Server Flags

```bash
mistralrs-server \
  --port 8080 \                    # API port (default: 1234)
  --host 0.0.0.0 \                 # Bind address
  --log info \                     # Log level: off/error/warn/info/debug/trace
  --token-source env:HF_TOKEN \    # HuggingFace token source
  --max-seqs 16 \                  # Maximum concurrent sequences
  --no-paged-attn \                # Disable PagedAttention (use for debugging)
  --prefix-cache-n 16 \            # Prefix cache entries
  plain \                          # Model type subcommand
  -m meta-llama/Meta-Llama-3-8B-Instruct \
  --isq Q4K
```

### ISQ Quantization Reference

| ISQ Option | Bits | Quality | VRAM (7B) |
| ---------- | ---- | ------- | --------- |
| `Q2K`      | 2    | ★★☆☆☆   | \~2.5GB   |
| `Q3K`      | 3    | ★★★☆☆   | \~3.5GB   |
| `Q4_0`     | 4    | ★★★★☆   | \~4.5GB   |
| `Q4K`      | 4    | ★★★★☆   | \~4.5GB   |
| `Q5K`      | 5    | ★★★★★   | \~5.5GB   |
| `Q6K`      | 6    | ★★★★★   | \~6.5GB   |
| `Q8_0`     | 8    | ★★★★★   | \~8GB     |
| `HQQ4`     | 4    | ★★★★☆   | \~4.5GB   |
| `HQQ8`     | 8    | ★★★★★   | \~8GB     |

{% hint style="info" %}
**HQQ (Half-Quadratic Quantization)** often achieves better quality than GGUF Q4 at the same bit level, especially for instruction-following tasks.
{% endhint %}

***

## Advanced Features

### X-LoRA (Mixture of LoRA Adapters)

Run multiple fine-tuned adapters dynamically selected per token:

```bash
mistralrs-server \
  --port 8080 \
  x-lora-plain \
  -m meta-llama/Meta-Llama-3-8B-Instruct \
  --isq Q4K \
  -x ./xlora-config.json
```

### Re-Quantize at Runtime

```bash
# Change quantization without restarting
curl http://localhost:8080/v1/re_isq \
  -H "Content-Type: application/json" \
  -d '{"isq_type": "Q8_0"}'
```

### Request Logging

```bash
# Enable request logging to file
mistralrs-server \
  --port 8080 \
  --log info \
  --request-logging-file ./requests.jsonl \
  plain \
  -m meta-llama/Meta-Llama-3-8B-Instruct \
  --isq Q4K
```

***

## Performance Tuning

### Optimize for Throughput

```bash
# Higher max-seqs for concurrent requests
mistralrs-server \
  --port 8080 \
  --max-seqs 32 \
  plain \
  -m meta-llama/Meta-Llama-3-8B-Instruct \
  --isq Q4K
```

### Optimize for Low Latency

```bash
# Lower max-seqs, disable prefix cache sharing
mistralrs-server \
  --port 8080 \
  --max-seqs 4 \
  --prefix-cache-n 0 \
  plain \
  -m meta-llama/Meta-Llama-3-8B-Instruct \
  --isq Q4K
```

### Monitor Performance

```bash
# Watch GPU usage during inference
watch -n 1 nvidia-smi

# Profile with nvtop
apt-get install nvtop && nvtop
```

***

## Docker Compose

```yaml
version: '3.8'
services:
  mistral-rs:
    image: ghcr.io/ericlbuehler/mistral.rs:cuda
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HF_TOKEN=${HUGGING_FACE_HUB_TOKEN}
    ports:
      - "8080:8080"
    volumes:
      - hf-cache:/root/.cache/huggingface
    command: >
      mistralrs-server
      --port 8080
      --host 0.0.0.0
      --log info
      --max-seqs 16
      --token-source env:HF_TOKEN
      plain
      -m meta-llama/Meta-Llama-3-8B-Instruct
      --isq Q4K
    restart: unless-stopped

volumes:
  hf-cache:
```

***

## Building from Source

If the Docker image doesn't match your CUDA version:

```bash
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env

# Clone and build
git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs

# Build with CUDA support
cargo build --release --features cuda

# Binary location
./target/release/mistralrs-server --help
```

{% hint style="warning" %}
**Build time:** Rust compilation is slow. Expect 10–20 minutes for a full build. Use `sccache` to speed up incremental builds: `cargo install sccache && RUSTC_WRAPPER=sccache cargo build --release --features cuda`
{% endhint %}

***

## Troubleshooting

### CUDA Library Not Found

```bash
# Check CUDA libraries
ldconfig -p | grep libcuda
ls /usr/local/cuda/lib64/

# Set library path
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
```

### Model Download Fails

```bash
# Set HuggingFace token
export HF_TOKEN=your_token_here

# Or use --token-source flag
mistralrs-server \
  --token-source env:HF_TOKEN \
  ...

# Or download manually first
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir ./llama3-8b
mistralrs-server ... plain -m ./llama3-8b --isq Q4K
```

### Port 8080 In Use

```bash
# Find and kill process
fuser -k 8080/tcp

# Use different port
mistralrs-server --port 9090 ...
```

### Out of Memory During Quantization

```bash
# ISQ quantizes on GPU — reduce other GPU usage first
# Or switch to GGUF (pre-quantized, lower peak memory)
mistralrs-server \
  gguf \
  -m TheBloke/Llama-2-7B-Chat-GGUF \
  -f llama-2-7b-chat.Q4_K_M.gguf
```

{% hint style="danger" %}
**ISQ vs GGUF:** ISQ quantizes at load time using GPU memory (temporary spike). If you're tight on VRAM, use pre-quantized GGUF files from TheBloke or similar — they use lower peak memory during loading.
{% endhint %}

***

## Clore.ai GPU Recommendations

Mistral.rs is a Rust-native engine — its low overhead means you get more throughput per GPU dollar vs Python-based servers.

| GPU       | VRAM  | Clore.ai Price | Recommended Use                              | Throughput (Mistral 7B Q4) |
| --------- | ----- | -------------- | -------------------------------------------- | -------------------------- |
| RTX 3090  | 24 GB | \~$0.12/hr     | Best budget option — 7B Q4/Q8, vision models | \~120 tok/s                |
| RTX 4090  | 24 GB | \~$0.70/hr     | High-throughput 7B–34B, speculative decoding | \~200 tok/s                |
| A100 40GB | 40 GB | \~$1.20/hr     | Production 34B–70B Q4 serving                | \~160 tok/s                |
| A100 80GB | 80 GB | \~$2.00/hr     | Full-precision 70B, multi-model              | \~185 tok/s                |

**Why RTX 3090 excels here:** Mistral.rs's Rust CUDA kernels avoid Python GIL overhead and garbage collection pauses that hurt Python servers. An RTX 3090 running Mistral 7B Q4\_K\_M delivers ~~120 tok/s — comparable to vLLM on the same hardware at a fraction of the cost (~~$0.12/hr vs cloud providers charging $1–2/hr).

**Speculative decoding:** Pair a large model (34B) with a small draft model (3B) for 2–3× speedup with no quality loss. RTX 4090 is ideal for this pattern.

***

## Resources

* 🐙 **GitHub:** [github.com/EricLBuehler/mistral.rs](https://github.com/EricLBuehler/mistral.rs)
* 📦 **Container Registry:** [ghcr.io/ericlbuehler/mistral.rs](https://ghcr.io/ericlbuehler/mistral.rs)
* 📚 **Documentation:** [ericlbuehler.github.io/mistral.rs](https://ericlbuehler.github.io/mistral.rs/mistralrs/)
* 💬 **Discord:** [discord.gg/SZrecqK8qw](https://discord.gg/SZrecqK8qw)
* 🤗 **GGUF Models:** [huggingface.co/TheBloke](https://huggingface.co/TheBloke)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/mistral-rs.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
