Mistral Small 3.1

Deploy Mistral Small 3.1 (24B) on Clore.ai — the ideal single-GPU production model

Mistral Small 3.1, released March 2025 by Mistral AI, is a 24-billion parameter dense model that punches way above its weight. With a 128K context window, native vision capabilities, best-in-class function calling, and an Apache 2.0 license, it's arguably the best model you can run on a single RTX 4090. It outperforms GPT-4o Mini and Claude 3.5 Haiku on most benchmarks while fitting comfortably on consumer hardware when quantized.

Key Features

  • 24B dense parameters — no MoE complexity, straightforward deployment

  • 128K context window — RULER 128K score of 81.2%, beats GPT-4o Mini (65.8%)

  • Native vision — analyze images, charts, documents, and screenshots

  • Apache 2.0 license — fully open for commercial and personal use

  • Elite function calling — native tool use with JSON output, ideal for agentic workflows

  • Multilingual — 25+ languages including CJK, Arabic, Hindi, and European languages

Requirements

Component
Quantized (Q4)
Full Precision (BF16)

GPU

1× RTX 4090 24GB

2× RTX 4090 or 1× H100

VRAM

~16GB

~55GB

RAM

32GB

64GB

Disk

20GB

50GB

CUDA

11.8+

12.0+

Clore.ai recommendation: RTX 4090 (~$0.5–2/day) for quantized inference — best price/performance ratio

Quick Start with Ollama

The fastest way to get Mistral Small 3.1 running:

Ollama as OpenAI-Compatible API

Ollama with Vision

vLLM Setup (Production)

For production workloads with high throughput and concurrent requests:

Serve on Single GPU (Text Only)

Query the Server

HuggingFace Transformers

For direct Python integration and experimentation:

Function Calling Example

Mistral Small 3.1 is one of the best small models for tool use:

Docker Quick Start

Tips for Clore.ai Users

  • RTX 4090 is the sweet spot: At $0.5–2/day, a single RTX 4090 runs Mistral Small 3.1 quantized with room to spare. Best cost/performance ratio on Clore.ai for a general-purpose LLM.

  • Use low temperature: Mistral AI recommends temperature=0.15 for most tasks. Higher temps cause inconsistent output with this model.

  • RTX 3090 works too: At $0.3–1/day, RTX 3090 (24GB) runs Q4 quantized with Ollama just fine. Slightly slower than 4090 but half the price.

  • Ollama for quick setups, vLLM for production: Ollama gives you a working model in 60 seconds. For concurrent API requests and higher throughput, switch to vLLM.

  • Function calling makes it special: Many 24B models can chat — few can reliably call tools. Mistral Small 3.1's function calling is on par with GPT-4o Mini. Build agents, API backends, and automation pipelines with confidence.

Troubleshooting

Issue
Solution

OutOfMemoryError on RTX 4090

Use quantized model via Ollama or load_in_4bit=True in Transformers. Full BF16 needs ~55GB.

Ollama model not found

Use ollama run mistral-small3.1 (official library name).

vLLM tokenizer errors

Always pass --tokenizer-mode mistral --config-format mistral --load-format mistral.

Poor output quality

Set temperature=0.15. Add a system prompt. Mistral Small is sensitive to temperature.

Vision not working on 1 GPU

Vision features need more VRAM. Use --tensor-parallel-size 2 or reduce --max-model-len.

Function calls return empty

Add --tool-call-parser mistral --enable-auto-tool-choice to vLLM serve.

Further Reading

Last updated

Was this helpful?