# Mistral Small 3.1

Mistral Small 3.1, released March 2025 by Mistral AI, is a **24-billion parameter dense model** that punches way above its weight. With a 128K context window, native vision capabilities, best-in-class function calling, and an **Apache 2.0 license**, it's arguably the best model you can run on a single RTX 4090. It outperforms GPT-4o Mini and Claude 3.5 Haiku on most benchmarks while fitting comfortably on consumer hardware when quantized.

## Key Features

* **24B dense parameters** — no MoE complexity, straightforward deployment
* **128K context window** — RULER 128K score of 81.2%, beats GPT-4o Mini (65.8%)
* **Native vision** — analyze images, charts, documents, and screenshots
* **Apache 2.0 license** — fully open for commercial and personal use
* **Elite function calling** — native tool use with JSON output, ideal for agentic workflows
* **Multilingual** — 25+ languages including CJK, Arabic, Hindi, and European languages

## Requirements

| Component | Quantized (Q4)   | Full Precision (BF16)  |
| --------- | ---------------- | ---------------------- |
| GPU       | 1× RTX 4090 24GB | 2× RTX 4090 or 1× H100 |
| VRAM      | \~16GB           | \~55GB                 |
| RAM       | 32GB             | 64GB                   |
| Disk      | 20GB             | 50GB                   |
| CUDA      | 11.8+            | 12.0+                  |

**Clore.ai recommendation**: RTX 4090 (\~$0.5–2/day) for quantized inference — best price/performance ratio

## Quick Start with Ollama

The fastest way to get Mistral Small 3.1 running:

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run Mistral Small 3.1 (auto-downloads ~14GB Q4 quantization)
ollama run mistral-small3.1

# Or specify a specific quantization
ollama run mistral-small3.1:24b-instruct-2503-q4_K_M
```

### Ollama as OpenAI-Compatible API

```bash
# Start Ollama server
ollama serve &

# Pull the model
ollama pull mistral-small3.1

# Query via API
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-small3.1",
    "messages": [
      {"role": "system", "content": "You are a helpful coding assistant."},
      {"role": "user", "content": "Write a Python decorator for rate limiting"}
    ],
    "temperature": 0.15
  }'
```

### Ollama with Vision

```bash
# Send an image for analysis
curl http://localhost:11434/api/chat -d '{
  "model": "mistral-small3.1",
  "messages": [{
    "role": "user",
    "content": "What does this image show?",
    "images": ["/path/to/image.jpg"]
  }]
}'
```

## vLLM Setup (Production)

For production workloads with high throughput and concurrent requests:

```bash
# Install vLLM (v0.8.1+ required)
pip install -U vllm

# Verify mistral_common is installed (should be automatic)
python -c "import mistral_common; print(mistral_common.__version__)"
```

### Serve on Single GPU (Text Only)

```bash
vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
  --tokenizer-mode mistral \
  --config-format mistral \
  --load-format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90
```

### Serve with Vision (2 GPUs Recommended)

```bash
vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
  --tokenizer-mode mistral \
  --config-format mistral \
  --load-format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --limit-mm-per-prompt 'image=10' \
  --tensor-parallel-size 2 \
  --max-model-len 65536
```

### Query the Server

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="mistralai/Mistral-Small-3.1-24B-Instruct-2503",
    messages=[
        {"role": "system", "content": "You are a helpful assistant. Today is 2026-02-20."},
        {"role": "user", "content": "Write a complete REST API in FastAPI with CRUD operations for a blog"}
    ],
    temperature=0.15,
    max_tokens=4096
)
print(response.choices[0].message.content)
```

## HuggingFace Transformers

For direct Python integration and experimentation:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True  # 4-bit quantization — fits on 24GB GPU
)

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Implement a binary search tree in Python with insert, delete, and search methods"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

output = model.generate(
    input_ids,
    max_new_tokens=2048,
    temperature=0.15,
    do_sample=True
)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))
```

## Function Calling Example

Mistral Small 3.1 is one of the best small models for tool use:

```python
import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_stock_price",
            "description": "Get the current stock price for a given ticker symbol",
            "parameters": {
                "type": "object",
                "required": ["ticker"],
                "properties": {
                    "ticker": {"type": "string", "description": "Stock ticker symbol (e.g., AAPL)"}
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "calculate_portfolio_value",
            "description": "Calculate total portfolio value given holdings",
            "parameters": {
                "type": "object",
                "required": ["holdings"],
                "properties": {
                    "holdings": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "ticker": {"type": "string"},
                                "shares": {"type": "number"}
                            }
                        }
                    }
                }
            }
        }
    }
]

response = client.chat.completions.create(
    model="mistralai/Mistral-Small-3.1-24B-Instruct-2503",
    messages=[{"role": "user", "content": "What's the current price of AAPL and MSFT?"}],
    tools=tools,
    tool_choice="auto",
    temperature=0.15
)

for tool_call in response.choices[0].message.tool_calls:
    print(f"Call: {tool_call.function.name}({tool_call.function.arguments})")
```

## Docker Quick Start

```bash
# Single GPU deployment
docker run --gpus all -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
  --tokenizer-mode mistral \
  --config-format mistral \
  --load-format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --max-model-len 32768

# With vision support (2 GPUs)
docker run --gpus all -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
  --tokenizer-mode mistral \
  --config-format mistral \
  --load-format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --limit-mm-per-prompt 'image=10' \
  --tensor-parallel-size 2
```

## Tips for Clore.ai Users

* **RTX 4090 is the sweet spot**: At $0.5–2/day, a single RTX 4090 runs Mistral Small 3.1 quantized with room to spare. Best cost/performance ratio on Clore.ai for a general-purpose LLM.
* **Use low temperature**: Mistral AI recommends `temperature=0.15` for most tasks. Higher temps cause inconsistent output with this model.
* **RTX 3090 works too**: At $0.3–1/day, RTX 3090 (24GB) runs Q4 quantized with Ollama just fine. Slightly slower than 4090 but half the price.
* **Ollama for quick setups, vLLM for production**: Ollama gives you a working model in 60 seconds. For concurrent API requests and higher throughput, switch to vLLM.
* **Function calling makes it special**: Many 24B models can chat — few can reliably call tools. Mistral Small 3.1's function calling is on par with GPT-4o Mini. Build agents, API backends, and automation pipelines with confidence.

## Troubleshooting

| Issue                          | Solution                                                                                       |
| ------------------------------ | ---------------------------------------------------------------------------------------------- |
| `OutOfMemoryError` on RTX 4090 | Use quantized model via Ollama or `load_in_4bit=True` in Transformers. Full BF16 needs \~55GB. |
| Ollama model not found         | Use `ollama run mistral-small3.1` (official library name).                                     |
| vLLM tokenizer errors          | Always pass `--tokenizer-mode mistral --config-format mistral --load-format mistral`.          |
| Poor output quality            | Set `temperature=0.15`. Add a system prompt. Mistral Small is sensitive to temperature.        |
| Vision not working on 1 GPU    | Vision features need more VRAM. Use `--tensor-parallel-size 2` or reduce `--max-model-len`.    |
| Function calls return empty    | Add `--tool-call-parser mistral --enable-auto-tool-choice` to vLLM serve.                      |

## Further Reading

* [Mistral Small 3.1 on HuggingFace](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)
* [Mistral AI Blog Post](https://mistral.ai/news/mistral-small-3-1/)
* [Ollama Model Page](https://ollama.com/library/mistral-small3.1)
* [vLLM Documentation](https://docs.vllm.ai/)
* [Mistral Common Library](https://github.com/mistralai/mistral-common)
* [Mistral AI Platform](https://console.mistral.ai/)
