Mistral.rs

Blazing-fast LLM inference written in Rust — production-ready server with GGUF, GGML, SafeTensors support and OpenAI-compatible API.

🦀 Built in Rust for maximum performance | GGUF & vision model support | Apache-2.0 License


What is Mistral.rs?

Mistral.rs is a high-performance LLM inference engine written entirely in Rust. Originally focused on Mistral models, it now supports the full landscape of modern LLMs. The Rust foundation provides:

  • Zero-cost abstractions — no garbage collection pauses during inference

  • Memory safety — no null pointer exceptions or memory leaks

  • Deterministic performance — consistent latency without JVM/Python overhead

  • Compile-time optimizations — SIMD, threading, and GPU kernels optimized at build time

Key Features

  • GGUF support — run any quantized model (Q4_K_M, Q8_0, etc.)

  • ISQ (In-Situ Quantization) — quantize on the fly at load time

  • PagedAttention — efficient KV cache with continuous batching

  • Vision Language Models — LLaVA, Phi-3 Vision, Idefics support

  • Speculative decoding — faster inference with draft models

  • X-LoRA — scalable fine-tuned adapter support

  • OpenAI-compatible REST API — drop-in replacement

Supported Model Families

Family
Format
Engine

Llama 2/3

GGUF, SafeTensors

Rust CUDA

Mistral/Mixtral

GGUF, SafeTensors

Rust CUDA

Phi-2/3

GGUF, SafeTensors

Rust CUDA

Gemma

GGUF, SafeTensors

Rust CUDA

Qwen 2

GGUF, SafeTensors

Rust CUDA

Starcoder 2

GGUF

Rust CUDA

LLaVA 1.5/1.6

SafeTensors

Vision

Phi-3 Vision

SafeTensors

Vision


Quick Start on Clore.ai

Step 1: Find a GPU Server

On clore.aiarrow-up-right marketplace:

  • Minimum: 8GB VRAM (for 7B Q4 models)

  • Recommended: RTX 3090/4090 (24GB) for larger models

  • CUDA 11.8+ required

Step 2: Deploy Mistral.rs Docker

Port mappings:

Container Port
Purpose

22

SSH access

8080

REST API server

Available image variants:

Step 3: Connect and Verify


Running the Server

Quick Start with GGUF Model

Serve Mistral 7B (SafeTensors)

Serve with In-Situ Quantization (ISQ)

ISQ quantizes the model at load time — no pre-quantized model needed:

Vision Language Model

Speculative Decoding

circle-check

API Usage

OpenAI-Compatible Endpoints

Endpoint
Method
Description

/v1/chat/completions

POST

Chat completions

/v1/completions

POST

Text completions

/v1/models

GET

List models

/v1/images/generations

POST

Image generation (VLMs)

/v1/re_isq

POST

Re-quantize loaded model

/health

GET

Health check

Python Example

Streaming Response

Vision/Image Input

cURL Examples


Configuration Options

Server Flags

ISQ Quantization Reference

ISQ Option
Bits
Quality
VRAM (7B)

Q2K

2

★★☆☆☆

~2.5GB

Q3K

3

★★★☆☆

~3.5GB

Q4_0

4

★★★★☆

~4.5GB

Q4K

4

★★★★☆

~4.5GB

Q5K

5

★★★★★

~5.5GB

Q6K

6

★★★★★

~6.5GB

Q8_0

8

★★★★★

~8GB

HQQ4

4

★★★★☆

~4.5GB

HQQ8

8

★★★★★

~8GB

circle-info

HQQ (Half-Quadratic Quantization) often achieves better quality than GGUF Q4 at the same bit level, especially for instruction-following tasks.


Advanced Features

X-LoRA (Mixture of LoRA Adapters)

Run multiple fine-tuned adapters dynamically selected per token:

Re-Quantize at Runtime

Request Logging


Performance Tuning

Optimize for Throughput

Optimize for Low Latency

Monitor Performance


Docker Compose


Building from Source

If the Docker image doesn't match your CUDA version:

circle-exclamation

Troubleshooting

CUDA Library Not Found

Model Download Fails

Port 8080 In Use

Out of Memory During Quantization

triangle-exclamation

Clore.ai GPU Recommendations

Mistral.rs is a Rust-native engine — its low overhead means you get more throughput per GPU dollar vs Python-based servers.

GPU
VRAM
Clore.ai Price
Recommended Use
Throughput (Mistral 7B Q4)

RTX 3090

24 GB

~$0.12/hr

Best budget option — 7B Q4/Q8, vision models

~120 tok/s

RTX 4090

24 GB

~$0.70/hr

High-throughput 7B–34B, speculative decoding

~200 tok/s

A100 40GB

40 GB

~$1.20/hr

Production 34B–70B Q4 serving

~160 tok/s

A100 80GB

80 GB

~$2.00/hr

Full-precision 70B, multi-model

~185 tok/s

Why RTX 3090 excels here: Mistral.rs's Rust CUDA kernels avoid Python GIL overhead and garbage collection pauses that hurt Python servers. An RTX 3090 running Mistral 7B Q4_K_M delivers 120 tok/s — comparable to vLLM on the same hardware at a fraction of the cost ($0.12/hr vs cloud providers charging $1–2/hr).

Speculative decoding: Pair a large model (34B) with a small draft model (3B) for 2–3× speedup with no quality loss. RTX 4090 is ideal for this pattern.


Resources

Last updated

Was this helpful?