# Mistral Small 3.1

Mistral Small 3.1, released March 2025 by Mistral AI, is a **24-billion parameter dense model** that punches way above its weight. With a 128K context window, native vision capabilities, best-in-class function calling, and an **Apache 2.0 license**, it's arguably the best model you can run on a single RTX 4090. It outperforms GPT-4o Mini and Claude 3.5 Haiku on most benchmarks while fitting comfortably on consumer hardware when quantized.

## Key Features

* **24B dense parameters** — no MoE complexity, straightforward deployment
* **128K context window** — RULER 128K score of 81.2%, beats GPT-4o Mini (65.8%)
* **Native vision** — analyze images, charts, documents, and screenshots
* **Apache 2.0 license** — fully open for commercial and personal use
* **Elite function calling** — native tool use with JSON output, ideal for agentic workflows
* **Multilingual** — 25+ languages including CJK, Arabic, Hindi, and European languages

## Requirements

| Component | Quantized (Q4)   | Full Precision (BF16)  |
| --------- | ---------------- | ---------------------- |
| GPU       | 1× RTX 4090 24GB | 2× RTX 4090 or 1× H100 |
| VRAM      | \~16GB           | \~55GB                 |
| RAM       | 32GB             | 64GB                   |
| Disk      | 20GB             | 50GB                   |
| CUDA      | 11.8+            | 12.0+                  |

**Clore.ai recommendation**: RTX 4090 (\~$0.5–2/day) for quantized inference — best price/performance ratio

## Quick Start with Ollama

The fastest way to get Mistral Small 3.1 running:

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run Mistral Small 3.1 (auto-downloads ~14GB Q4 quantization)
ollama run mistral-small3.1

# Or specify a specific quantization
ollama run mistral-small3.1:24b-instruct-2503-q4_K_M
```

### Ollama as OpenAI-Compatible API

```bash
# Start Ollama server
ollama serve &

# Pull the model
ollama pull mistral-small3.1

# Query via API
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-small3.1",
    "messages": [
      {"role": "system", "content": "You are a helpful coding assistant."},
      {"role": "user", "content": "Write a Python decorator for rate limiting"}
    ],
    "temperature": 0.15
  }'
```

### Ollama with Vision

```bash
# Send an image for analysis
curl http://localhost:11434/api/chat -d '{
  "model": "mistral-small3.1",
  "messages": [{
    "role": "user",
    "content": "What does this image show?",
    "images": ["/path/to/image.jpg"]
  }]
}'
```

## vLLM Setup (Production)

For production workloads with high throughput and concurrent requests:

```bash
# Install vLLM (v0.8.1+ required)
pip install -U vllm

# Verify mistral_common is installed (should be automatic)
python -c "import mistral_common; print(mistral_common.__version__)"
```

### Serve on Single GPU (Text Only)

```bash
vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
  --tokenizer-mode mistral \
  --config-format mistral \
  --load-format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90
```

### Serve with Vision (2 GPUs Recommended)

```bash
vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
  --tokenizer-mode mistral \
  --config-format mistral \
  --load-format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --limit-mm-per-prompt 'image=10' \
  --tensor-parallel-size 2 \
  --max-model-len 65536
```

### Query the Server

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="mistralai/Mistral-Small-3.1-24B-Instruct-2503",
    messages=[
        {"role": "system", "content": "You are a helpful assistant. Today is 2026-02-20."},
        {"role": "user", "content": "Write a complete REST API in FastAPI with CRUD operations for a blog"}
    ],
    temperature=0.15,
    max_tokens=4096
)
print(response.choices[0].message.content)
```

## HuggingFace Transformers

For direct Python integration and experimentation:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True  # 4-bit quantization — fits on 24GB GPU
)

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Implement a binary search tree in Python with insert, delete, and search methods"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

output = model.generate(
    input_ids,
    max_new_tokens=2048,
    temperature=0.15,
    do_sample=True
)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))
```

## Function Calling Example

Mistral Small 3.1 is one of the best small models for tool use:

```python
import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_stock_price",
            "description": "Get the current stock price for a given ticker symbol",
            "parameters": {
                "type": "object",
                "required": ["ticker"],
                "properties": {
                    "ticker": {"type": "string", "description": "Stock ticker symbol (e.g., AAPL)"}
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "calculate_portfolio_value",
            "description": "Calculate total portfolio value given holdings",
            "parameters": {
                "type": "object",
                "required": ["holdings"],
                "properties": {
                    "holdings": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "ticker": {"type": "string"},
                                "shares": {"type": "number"}
                            }
                        }
                    }
                }
            }
        }
    }
]

response = client.chat.completions.create(
    model="mistralai/Mistral-Small-3.1-24B-Instruct-2503",
    messages=[{"role": "user", "content": "What's the current price of AAPL and MSFT?"}],
    tools=tools,
    tool_choice="auto",
    temperature=0.15
)

for tool_call in response.choices[0].message.tool_calls:
    print(f"Call: {tool_call.function.name}({tool_call.function.arguments})")
```

## Docker Quick Start

```bash
# Single GPU deployment
docker run --gpus all -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
  --tokenizer-mode mistral \
  --config-format mistral \
  --load-format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --max-model-len 32768

# With vision support (2 GPUs)
docker run --gpus all -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
  --tokenizer-mode mistral \
  --config-format mistral \
  --load-format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --limit-mm-per-prompt 'image=10' \
  --tensor-parallel-size 2
```

## Tips for Clore.ai Users

* **RTX 4090 is the sweet spot**: At $0.5–2/day, a single RTX 4090 runs Mistral Small 3.1 quantized with room to spare. Best cost/performance ratio on Clore.ai for a general-purpose LLM.
* **Use low temperature**: Mistral AI recommends `temperature=0.15` for most tasks. Higher temps cause inconsistent output with this model.
* **RTX 3090 works too**: At $0.3–1/day, RTX 3090 (24GB) runs Q4 quantized with Ollama just fine. Slightly slower than 4090 but half the price.
* **Ollama for quick setups, vLLM for production**: Ollama gives you a working model in 60 seconds. For concurrent API requests and higher throughput, switch to vLLM.
* **Function calling makes it special**: Many 24B models can chat — few can reliably call tools. Mistral Small 3.1's function calling is on par with GPT-4o Mini. Build agents, API backends, and automation pipelines with confidence.

## Troubleshooting

| Issue                          | Solution                                                                                       |
| ------------------------------ | ---------------------------------------------------------------------------------------------- |
| `OutOfMemoryError` on RTX 4090 | Use quantized model via Ollama or `load_in_4bit=True` in Transformers. Full BF16 needs \~55GB. |
| Ollama model not found         | Use `ollama run mistral-small3.1` (official library name).                                     |
| vLLM tokenizer errors          | Always pass `--tokenizer-mode mistral --config-format mistral --load-format mistral`.          |
| Poor output quality            | Set `temperature=0.15`. Add a system prompt. Mistral Small is sensitive to temperature.        |
| Vision not working on 1 GPU    | Vision features need more VRAM. Use `--tensor-parallel-size 2` or reduce `--max-model-len`.    |
| Function calls return empty    | Add `--tool-call-parser mistral --enable-auto-tool-choice` to vLLM serve.                      |

## Further Reading

* [Mistral Small 3.1 on HuggingFace](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)
* [Mistral AI Blog Post](https://mistral.ai/news/mistral-small-3-1/)
* [Ollama Model Page](https://ollama.com/library/mistral-small3.1)
* [vLLM Documentation](https://docs.vllm.ai/)
* [Mistral Common Library](https://github.com/mistralai/mistral-common)
* [Mistral AI Platform](https://console.mistral.ai/)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/mistral-small.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
