# Mistral Medium 3.5 (128B Dense, 256K)

{% hint style="info" %}
**Status (April 2026):** Mistral Medium 3.5 was released on **April 29, 2026** by Mistral AI as the successor to Mistral Medium 3. Weights live at [huggingface.co/mistralai/Mistral-Medium-3.5](https://huggingface.co/mistralai/Mistral-Medium-3.5) under the **Mistral Research License (MRL)** for research; the **Mistral Commercial License** is required for production use beyond evaluation. vLLM (≥ 0.8.x) and SGLang ship day-0 support.
{% endhint %}

Mistral Medium 3.5 is a **128B dense transformer** with a **256K-token context window** and a **native reasoning toggle** that swaps between fast "instant" replies and longer chain-of-thought "deep" traces in the same checkpoint. The release consolidates three previously separate Mistral lines — **Medium 3** (general instruction), **Codestral** (code), and Mistral's reasoning preview — into a single toggleable model, which is the headline change for engineering teams who were juggling multiple weights.

For Clore.ai users, the practical implication is sizing. A 128B dense model in FP8 weighs roughly **128 GB** before KV cache, so it does **not** fit on a single 80 GB GPU at full precision — you need **4× H100 80 GB** (FP8) or **2× H200 141 GB** to serve it cleanly via vLLM. On the marketplace that lands around **$24–48/day** for the 4× H100 setup or **$30–50/day** for 2× H200, which is the sweet spot for most teams. Single-H100 deployments only work with aggressive Q4 GGUF quantization (\~70 tok/s via llama.cpp), and the 256 K context is the first thing to evaporate when you compress.

## Key Features

* **128B dense parameters** — no MoE routing tricks, predictable VRAM and latency profile, easier to fine-tune than sparse models
* **256K context window** — full-codebase analysis, long-document RAG, multi-turn agent loops without truncation
* **Dual-mode reasoning** — toggle `reasoning_mode=instant` for \~chat latency or `reasoning_mode=deep` to surface a `<think>` trace before the answer
* **Unified instruction + code + reasoning** — one set of weights replaces Medium 3 + Codestral + the reasoning preview
* **Function calling and structured outputs** — native JSON schema enforcement, OpenAI-compatible tool-call format
* **Open weights** — MRL for research, commercial license available; weights stay on your box and never round-trip to a vendor API
* **Day-0 vLLM and SGLang support** — production-ready FP8 paths, tensor parallelism, chunked prefill, continuous batching

## Reasoning Modes

Medium 3.5 is the first Mistral model to ship a single checkpoint that serves both "fast" and "thinking" answers. The toggle is controlled at request time, not at load time, so one vLLM process handles both modes for the same caller.

| Mode                | When to use                                                                 | Typical TTFT                    | Output shape                                  |
| ------------------- | --------------------------------------------------------------------------- | ------------------------------- | --------------------------------------------- |
| `instant` (default) | Chat, autocomplete, classification, function calls where latency matters    | 50–250 ms                       | Answer only                                   |
| `deep`              | Code review, multi-step planning, math, hard debugging, agent planning step | 1–6 s before first answer token | `<think>...</think>` trace, then final answer |

In `deep` mode the model emits a hidden reasoning span (wrapped in `<think>...</think>` by the chat template) before the visible response. This costs anywhere from a few hundred to a few thousand extra tokens per turn, so **don't enable it for every request** — reserve it for tasks where you'd otherwise prompt a smaller model with "think step by step." A reasonable pattern is to keep `instant` as the default and only escalate to `deep` for tool-call planning steps or final-answer synthesis.

{% hint style="warning" %}
**Vendor-suggested sampling.** Mistral recommends `temperature=0.15` for `instant` and `temperature=0.7` with `top_p=0.95` for `deep` mode. Zero-temperature sampling tends to truncate reasoning traces early.
{% endhint %}

## Choose Your Deployment

Three realistic configurations on the Clore.ai marketplace. Pick by VRAM budget first, throughput second.

| Setup                                                                                                               | Precision           | Total VRAM | Context (practical) | Throughput     | Recommended Clore tier             | Notes                                                   |
| ------------------------------------------------------------------------------------------------------------------- | ------------------- | ---------- | ------------------- | -------------- | ---------------------------------- | ------------------------------------------------------- |
| 1× H100 80 GB                                                                                                       | Q4 GGUF (llama.cpp) | 80 GB      | 32K–64K             | \~50–70 tok/s  | Single-GPU, evaluation/dev         | Aggressive quantization; lose some quality on long code |
| 4× [H100](https://clore.ai/rent-h100.html?utm_source=docs\&utm_medium=guide\&utm_campaign=mistral-medium-35) 80 GB  | FP8 (vLLM)          | 320 GB     | Full 256K           | \~80–140 tok/s | **Production sweet spot**          | TP=4, best tok/$ for sustained traffic                  |
| 2× [H200](https://clore.ai/rent-h200.html?utm_source=docs\&utm_medium=guide\&utm_campaign=mistral-medium-35) 141 GB | FP8 or BF16         | 282 GB     | Full 256K           | \~90–130 tok/s | High-context, fewer GPUs to manage | Simpler topology, headroom for KV cache on 256K         |

{% hint style="success" %}
**Default pick:** **4× H100 80 GB FP8** via vLLM. You get the full 256K context, \~100 tok/s sustained, OpenAI-compatible API, and clean tensor-parallel scaling — for roughly the daily cost of a single Claude Opus heavy-use seat.
{% endhint %}

## Server Requirements

| Component    | Minimum (Q4 single-GPU) | Recommended (FP8, 4× H100)     | High-context (2× H200) |
| ------------ | ----------------------- | ------------------------------ | ---------------------- |
| GPU VRAM     | 80 GB (1× H100)         | 4× 80 GB = 320 GB              | 2× 141 GB = 282 GB     |
| System RAM   | 128 GB                  | 256 GB                         | 256 GB                 |
| Disk (NVMe)  | 200 GB                  | 400 GB                         | 400 GB                 |
| Network      | 1 Gbps+ for HF download | 1 Gbps+                        | 1 Gbps+                |
| CUDA         | 12.4+                   | 12.4+                          | 12.6+                  |
| Driver       | ≥ 555                   | ≥ 555                          | ≥ 555                  |
| Startup time | 3–6 min (cold pull)     | 6–12 min (cold pull, 4 shards) | 5–10 min               |

The first cold start is dominated by the HuggingFace download — FP8 weights are roughly **128 GB**, BF16 closer to **256 GB**. Mount a persistent volume on `/root/.cache/huggingface` so you only pay that bandwidth cost once per server.

## Quick Deploy on CLORE.AI

The fastest path is the official `vllm/vllm-openai` image with tensor parallelism set to your GPU count. The example below assumes a 4× H100 instance.

**Docker image:**

```
vllm/vllm-openai:latest
```

**Ports:**

```
22/tcp
8000/http
```

**Startup command (4× H100, FP8):**

```bash
vllm serve mistralai/Mistral-Medium-3.5-FP8 \
    --tensor-parallel-size 4 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.90 \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --tool-call-parser mistral \
    --reasoning-parser mistral \
    --tokenizer-mode mistral \
    --config-format mistral \
    --load-format mistral \
    --served-model-name mistral-medium-3.5 \
    --host 0.0.0.0 \
    --port 8000
```

**Alternative — 2× H200 BF16:**

```bash
vllm serve mistralai/Mistral-Medium-3.5 \
    --tensor-parallel-size 2 \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.92 \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --tool-call-parser mistral \
    --reasoning-parser mistral \
    --tokenizer-mode mistral \
    --config-format mistral \
    --load-format mistral \
    --served-model-name mistral-medium-3.5 \
    --host 0.0.0.0 \
    --port 8000
```

{% hint style="info" %}
Start with `--max-model-len 65536` even on hardware that could fit more. KV cache memory grows linearly with context, and most workloads never hit 256K. Raise it once you've confirmed the request mix.
{% endhint %}

**SGLang alternative** (often faster on Hopper for long prefills):

```bash
python3 -m sglang.launch_server \
    --model-path mistralai/Mistral-Medium-3.5-FP8 \
    --tp-size 4 \
    --tool-call-parser mistral \
    --reasoning-parser mistral \
    --mem-fraction-static 0.88 \
    --context-length 65536 \
    --served-model-name mistral-medium-3.5 \
    --host 0.0.0.0 \
    --port 8000
```

## Usage Examples

After deployment, find your `http_pub` URL in **My Orders** on Clore.ai (e.g. `abc123.clorecloud.net`). Replace `localhost:8000` with `https://YOUR_HTTP_PUB_URL` in the examples below when calling from outside the server.

### 1. Chat — Instant Mode (default)

Low-latency reply, no visible reasoning trace. Good for chat UIs, autocomplete, classification.

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-medium-3.5",
    "messages": [
      {"role": "system", "content": "You are a senior backend engineer."},
      {"role": "user", "content": "Write a Go HTTP middleware that rate-limits per API key with a token bucket."}
    ],
    "temperature": 0.15,
    "max_tokens": 1024,
    "extra_body": {"reasoning_mode": "instant"}
  }'
```

### 2. Chat — Deep Mode (reasoning toggle)

Enables the `<think>` trace before the final answer. Use for hard debugging, planning, math.

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-medium-3.5",
    "messages": [
      {"role": "user", "content": "A user reports our payment webhook fires twice for 1% of orders. Walk through the most likely root causes in order of probability and propose a diagnostic plan."}
    ],
    "temperature": 0.7,
    "top_p": 0.95,
    "max_tokens": 4096,
    "extra_body": {"reasoning_mode": "deep"}
  }'
```

The response will include a `reasoning_content` field (vLLM parses the `<think>...</think>` span out of the visible message) alongside `content`. Strip or surface the trace depending on your product.

### 3. Python — OpenAI-Compatible Client

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# Instant mode — chat
response = client.chat.completions.create(
    model="mistral-medium-3.5",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Refactor this Python function for readability."}
    ],
    temperature=0.15,
    max_tokens=1024,
    extra_body={"reasoning_mode": "instant"}
)
print(response.choices[0].message.content)

# Deep mode — planning step
plan = client.chat.completions.create(
    model="mistral-medium-3.5",
    messages=[
        {"role": "user", "content": "Plan a migration from MongoDB to PostgreSQL for a 2TB orders table with zero downtime."}
    ],
    temperature=0.7,
    max_tokens=4096,
    extra_body={"reasoning_mode": "deep"}
)

msg = plan.choices[0].message
print("THINKING:\n", getattr(msg, "reasoning_content", ""))
print("\nANSWER:\n", msg.content)
```

### 4. Structured Outputs — JSON Schema

Medium 3.5 supports JSON-schema-guided decoding via vLLM's `response_format`. Useful when the downstream consumer is a parser, not a human.

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

schema = {
    "type": "object",
    "properties": {
        "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
        "categories": {
            "type": "array",
            "items": {"type": "string", "enum": ["auth", "payments", "db", "ui", "infra"]}
        },
        "summary": {"type": "string", "maxLength": 240},
        "next_action": {"type": "string"}
    },
    "required": ["severity", "categories", "summary", "next_action"],
    "additionalProperties": False
}

response = client.chat.completions.create(
    model="mistral-medium-3.5",
    messages=[
        {"role": "system", "content": "Classify the incoming bug report. Return strict JSON."},
        {"role": "user", "content": "Login fails for users with apostrophes in their email, returning 500 from /webapi/login."}
    ],
    temperature=0.0,
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "triage", "schema": schema, "strict": True}
    },
    extra_body={"reasoning_mode": "instant"}
)

import json
print(json.loads(response.choices[0].message.content))
```

### 5. Function Calling

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

tools = [{
    "type": "function",
    "function": {
        "name": "search_orders",
        "description": "Search the orders database by user ID and optional date range",
        "parameters": {
            "type": "object",
            "properties": {
                "user_id": {"type": "string"},
                "start_date": {"type": "string", "format": "date"},
                "end_date": {"type": "string", "format": "date"}
            },
            "required": ["user_id"]
        }
    }
}]

response = client.chat.completions.create(
    model="mistral-medium-3.5",
    messages=[{"role": "user", "content": "Find all orders for user u_4821 in April 2026."}],
    tools=tools,
    tool_choice="auto",
    temperature=0.1
)

for call in response.choices[0].message.tool_calls or []:
    print(call.function.name, call.function.arguments)
```

## Performance Tips

1. **Prefer the FP8 checkpoint on Hopper.** `Mistral-Medium-3.5-FP8` is the vendor-provided FP8 build and is roughly 2× lighter than BF16 with negligible quality loss on Hopper-class hardware. It's the right default for both 4× H100 and 2× H200.
2. **Tensor parallelism = GPU count.** For 4× H100 use `--tensor-parallel-size 4`; for 2× H200 use `--tensor-parallel-size 2`. Pipeline parallelism on a single node usually hurts throughput for a 128B dense model.
3. **Cap `max-model-len` to what you actually use.** KV cache at 256K is enormous — a single sequence at full context can eat 30–50 GB. Set `--max-model-len 65536` (or 32768) unless you have a verified need for more, and bump only after profiling.
4. **Enable chunked prefill.** `--enable-chunked-prefill` keeps decode tokens flowing while large prompts are still being processed. For 100K+ prompts this is the difference between "responsive" and "timed out."
5. **Cache the weights.** Mount a Docker volume on `/root/.cache/huggingface` and reuse it across restarts. Re-downloading 128 GB on every cold boot is the most common cause of "vLLM seems slow to start."
6. **KV-cache quantization for marginal headroom.** On 4× H100 you can squeeze more concurrent sessions with `--kv-cache-dtype fp8`. Vendor reports near-lossless quality; verify on your eval set before flipping in production.
7. **Don't use `deep` mode for every request.** Reasoning traces cost real tokens and real latency. Route by task type: classification, autocomplete, and tool-arg generation stay in `instant`; planning and verification steps escalate to `deep`.
8. **Speculative decoding helps.** vLLM and SGLang both support draft-model speculative decoding (e.g. with a Ministral 3B draft). On long code completions this typically buys 1.3–1.7× throughput at no quality cost.

## Benchmarks

{% hint style="warning" %}
**Vendor-published numbers — verify independently.** The table below comes from Mistral AI's April 29, 2026 announcement. Independent third-party reproductions (LMSys, EQ-Bench, the SWE-Bench leaderboard) are still rolling in. Treat as directional, not authoritative.
{% endhint %}

| Benchmark                  | Mistral Medium 3.5 (vendor) | Reference points (vendor-cited)       |
| -------------------------- | --------------------------- | ------------------------------------- |
| MMLU-Pro                   | \~78%                       | Llama 4 Maverick \~76%, GPT-5.4 \~81% |
| HumanEval                  | \~92%                       | Codestral 25.01 \~88%, GLM-5.1 \~94%  |
| LiveCodeBench (Apr 2026)   | \~68%                       | GLM-5.1 \~72%, Llama 4 Maverick \~64% |
| AIME 2025 (deep mode)      | \~62%                       | GPT-5.4 \~73%, GLM-5.1 \~58%          |
| GPQA Diamond (deep mode)   | \~59%                       | Claude Opus 4.6 \~63%, GLM-5.1 \~57%  |
| Long-context recall (128K) | \~95%                       | Llama 4 Maverick \~93%                |

The positioning Mistral is targeting: **roughly Llama 4 Maverick / GLM-5.1 tier on general tasks, narrower coding gap, distinct reasoning toggle**. It is not pitched as a GPT-5.4 / Claude Opus 4.6 challenger.

## Troubleshooting

| Issue                                              | Solution                                                                                                                                           |
| -------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
| `CUDA out of memory` on load (4× H100)             | You're probably loading BF16 by mistake. Use the FP8 checkpoint (`Mistral-Medium-3.5-FP8`) or drop to `--max-model-len 32768`.                     |
| `CUDA out of memory` mid-request with 256K context | KV cache exploded. Lower `--max-model-len`, enable `--kv-cache-dtype fp8`, or cap `--max-num-seqs` (try 8).                                        |
| Deep mode produces empty `reasoning_content`       | Confirm `--reasoning-parser mistral` is set in vLLM and that `temperature ≥ 0.5`. Zero-temp sampling truncates the trace.                          |
| Slow time-to-first-token in deep mode              | Expected — deep mode emits a `<think>` span before any visible output. Stream to the client with `stream=true` and surface a "thinking…" UI state. |
| `403 Forbidden` from HuggingFace download          | Mistral Medium 3.5 is **gated**. Accept the MRL on the model card and set `HF_TOKEN` in the container env.                                         |
| `tokenizer_mode mistral` errors                    | All three flags are required together: `--tokenizer-mode mistral --config-format mistral --load-format mistral`.                                   |
| Tool calls silently dropped                        | Set both `--enable-auto-tool-choice` and `--tool-call-parser mistral`. Without the parser, vLLM returns tool args as plain text.                   |
| Throughput collapses past \~32 concurrent sessions | You hit KV-cache eviction. Lower `--max-model-len`, raise `--gpu-memory-utilization` to 0.92, or scale out to a second replica.                    |
| License error blocking commercial use              | MRL is research-only. Contact Mistral sales for the commercial license before serving paying users.                                                |

## FAQ

**Q: Mistral Medium 3.5 vs Llama 4 Maverick — which should I pick?**

Both are in a similar weight class (Maverick is 17B-active MoE at 400B total; Medium 3.5 is 128B dense). Pick **Medium 3.5** if you want predictable VRAM/latency, the dual-mode reasoning toggle in one checkpoint, and stronger code performance. Pick **Llama 4 Maverick** if you need permissive licensing for unconditional commercial use (Llama 4 is community-licensed, Medium 3.5 needs Mistral commercial license for production) or if you want the cheaper inference cost per token that MoE buys you on a per-request basis.

**Q: How do I enable reasoning mode?**

Pass `extra_body={"reasoning_mode": "deep"}` in the OpenAI Python client, or include `"reasoning_mode": "deep"` at the top level of your HTTP JSON body. The default is `"instant"`. On the server side, make sure vLLM was launched with `--reasoning-parser mistral` so the `<think>` span gets parsed into the `reasoning_content` field instead of leaking into `content`.

**Q: Why 4× H100 instead of 2× H100?**

FP8 weights are \~128 GB before KV cache. 2× H100 80 GB gives you 160 GB total — enough to load the weights but with almost no headroom for KV cache, activations, or even a moderate context window. In practice 2× H100 OOMs immediately past 8K context. **4× H100 is the minimum for a usable 256K-capable deployment**; 2× H200 (282 GB) is the alternative if you'd rather manage fewer GPUs at slightly higher per-GPU cost.

**Q: Can I use Mistral Medium 3.5 commercially?**

The default Mistral Research License (MRL) allows research and internal evaluation but **not** commercial production. For paying-customer-facing deployments you need the **Mistral Commercial License** — contact Mistral sales. This is the same gating that applied to Medium 3 and Codestral previously. If commercial-friendly licensing is a hard requirement, look at [Mistral Small 3.1](/guides/language-models/mistral-small.md) (Apache 2.0) or [Llama 4](/guides/language-models/llama4.md) (Llama community license).

**Q: Does Medium 3.5 support vision or audio?**

No. Medium 3.5 is text-only. For multimodal Mistral, use [Mistral Large 3](/guides/language-models/mistral-large3.md), which ships a 2.5B vision encoder. For other multimodal options on Clore.ai, see Qwen3.5-Omni or Gemma 3.

## Related Guides

* [Mistral Large 3](/guides/language-models/mistral-large3.md) — 675B MoE multimodal frontier model, Apache 2.0, when you need vision and maximum quality
* [Mistral & Mixtral](/guides/language-models/mistral-mixtral.md) — older Mistral 7B and Mixtral 8x7B/8x22B for single-GPU deployments
* [vLLM](/guides/language-models/vllm.md) — production serving framework, the recommended backend for Medium 3.5
* [Llama 4](/guides/language-models/llama4.md) — closest open-weight peer at this scale, permissively licensed alternative

### External Links

* [Mistral Medium 3.5 on HuggingFace](https://huggingface.co/mistralai/Mistral-Medium-3.5)
* [Mistral Medium 3.5 FP8 checkpoint](https://huggingface.co/mistralai/Mistral-Medium-3.5-FP8)
* [Mistral AI announcement (April 29, 2026)](https://mistral.ai/news/mistral-medium-3-5)
* [Mistral Research License](https://mistral.ai/licenses/MRL-0.1.md)
* [vLLM documentation](https://docs.vllm.ai)
* [SGLang repo](https://github.com/sgl-project/sglang)
* [Clore.ai Marketplace](https://clore.ai/marketplace) — rent H100 / H200 from $0.50/day


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/mistral-medium35.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
