> For the complete documentation index, see [llms.txt](https://docs.clore.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.clore.ai/guides/guides_v2-hi/language-models/mistral-medium35.md).

# Mistral Medium 3.5 (128B Dense, 256K)

{% hint style="info" %}
**स्थिति (अप्रैल 2026):** Mistral Medium 3.5 जारी किया गया था **29 अप्रैल, 2026** Mistral AI द्वारा, Mistral Medium 3 के उत्तराधिकारी के रूप में। वेट्स यहाँ उपलब्ध हैं [huggingface.co/mistralai/Mistral-Medium-3.5](https://huggingface.co/mistralai/Mistral-Medium-3.5) के अंतर्गत **Mistral Research License (MRL)** अनुसंधान के लिए; **Mistral Commercial License** मूल्यांकन से आगे प्रोडक्शन उपयोग के लिए आवश्यक है। vLLM (≥ 0.8.x) और SGLang पहले दिन से समर्थन के साथ आते हैं।
{% endhint %}

Mistral Medium 3.5 एक **128B dense transformer** है, **256K-token context window** और एक **native reasoning toggle** के साथ, जो उसी चेकपॉइंट में तेज़ "instant" उत्तरों और लंबे chain-of-thought "deep" traces के बीच स्विच करता है। यह रिलीज़ पहले अलग रही Mistral की तीन लाइनों को एकीकृत करती है — **Medium 3** (general instruction), **Codestral** (code), और Mistral का reasoning preview — एक ही toggleable model में; यह इंजीनियरिंग टीमों के लिए मुख्य बदलाव है, जो कई वेट्स संभाल रही थीं।

Clore.ai उपयोगकर्ताओं के लिए, व्यावहारिक असर साइजिंग का है। FP8 में 128B dense model लगभग **128 GB** KV cache से पहले वजन रखता है, इसलिए यह **नहीं** पूरी precision पर एक अकेले 80 GB GPU में फिट होता — आपको **4× H100 80 GB** (FP8) या **2× H200 141 GB** चाहिए, ताकि vLLM के जरिए इसे साफ़ तौर पर सर्व किया जा सके। मार्केटप्लेस पर यह लगभग **$24–48/दिन** 4× H100 सेटअप के लिए या **$30–50/दिन** 2× H200 के लिए आता है, जो अधिकांश टीमों के लिए उपयुक्त मध्य-बिंदु है। सिंगल-H100 डिप्लॉयमेंट केवल आक्रामक Q4 GGUF quantization (\~70 tok/s via llama.cpp) के साथ काम करते हैं, और compress करते ही 256 K context सबसे पहले घटता है।

## मुख्य विशेषताएँ

* **128B dense parameters** — कोई MoE routing tricks नहीं, अनुमानित VRAM और latency profile, sparse मॉडलों की तुलना में fine-tune करना आसान
* **256K context window** — पूरे codebase का विश्लेषण, लंबे दस्तावेज़ों का RAG, truncation के बिना multi-turn agent loops
* **Dual-mode reasoning** — toggle `reasoning_mode=instant` को \~chat latency के लिए या `reasoning_mode=deep` का उपयोग `<think>` answer से पहले trace दिखाने के लिए
* **Unified instruction + code + reasoning** — वेट्स का एक सेट Medium 3 + Codestral + reasoning preview की जगह लेता है
* **Function calling और structured outputs** — native JSON schema enforcement, OpenAI-compatible tool-call format
* **Open weights** — शोध के लिए MRL, commercial license उपलब्ध; weights आपके box पर रहते हैं और vendor API पर वापस नहीं भेजे जाते
* **Day-0 vLLM और SGLang support** — production-ready FP8 paths, tensor parallelism, chunked prefill, continuous batching

## Reasoning Modes

Medium 3.5 पहला Mistral model है जो एक ऐसा single checkpoint देता है जो "fast" और "thinking" दोनों उत्तरों को सर्व करता है। toggle request time पर नियंत्रित होता है, load time पर नहीं, इसलिए एक ही vLLM process उसी caller के लिए दोनों modes संभालता है।

| मोड                 | कब उपयोग करें                                                                 | सामान्य TTFT                    | आउटपुट स्वरूप                               |
| ------------------- | ----------------------------------------------------------------------------- | ------------------------------- | ------------------------------------------- |
| `instant` (default) | Chat, autocomplete, classification, function calls जहाँ latency महत्वपूर्ण है | 50–250 ms                       | केवल उत्तर                                  |
| `deep`              | Code review, multi-step planning, math, कठिन debugging, agent planning step   | पहले answer token से पहले 1–6 s | `<think>...</think>` trace, फिर अंतिम उत्तर |

में `deep` मोड में model एक छिपा हुआ reasoning span निकालता है (wrapped in `<think>...</think>` chat template द्वारा) दृश्य response से पहले। इसमें प्रति turn कुछ सौ से लेकर कुछ हज़ार अतिरिक्त tokens तक लग सकते हैं, इसलिए **इसे हर request के लिए सक्षम न करें** — इसे उन tasks के लिए रखें जहाँ अन्यथा आप किसी छोटे मॉडल को "think step by step." प्रॉम्प्ट देते। एक उचित pattern है `instant` को default रखना और केवल `deep` तक बढ़ाना tool-call planning steps या final-answer synthesis के लिए।

{% hint style="warning" %}
**Vendor-सुझाया गया sampling.** Mistral अनुशंसा करता है `temperature=0.15` को `instant` और `temperature=0.7` के साथ `top_p=0.95` को `deep` मोड के लिए। Zero-temperature sampling अक्सर reasoning traces को जल्दी truncate कर देता है।
{% endhint %}

## अपना डिप्लॉयमेंट चुनें

Clore.ai marketplace पर तीन यथार्थवादी configurations। पहले VRAM budget, फिर throughput के आधार पर चुनें।

| सेटअप                                                                                                               | Precision           | कुल VRAM | Context (व्यावहारिक) | Throughput     | सुझाई गई Clore tier                        | नोट्स                                                  |
| ------------------------------------------------------------------------------------------------------------------- | ------------------- | -------- | -------------------- | -------------- | ------------------------------------------ | ------------------------------------------------------ |
| 1× H100 80 GB                                                                                                       | Q4 GGUF (llama.cpp) | 80 GB    | 32K–64K              | \~50–70 tok/s  | Single-GPU, evaluation/dev                 | आक्रामक quantization; लंबे code पर कुछ गुणवत्ता खोएँगे |
| 4× [H100](https://clore.ai/rent-h100.html?utm_source=docs\&utm_medium=guide\&utm_campaign=mistral-medium-35) 80 GB  | FP8 (vLLM)          | 320 GB   | पूर्ण 256K           | \~80–140 tok/s | **Production sweet spot**                  | TP=4, सतत traffic के लिए सर्वश्रेष्ठ tok/$             |
| 2× [H200](https://clore.ai/rent-h200.html?utm_source=docs\&utm_medium=guide\&utm_campaign=mistral-medium-35) 141 GB | FP8 या BF16         | 282 GB   | पूर्ण 256K           | \~90–130 tok/s | उच्च-context, प्रबंधित करने के लिए कम GPUs | सरल topology, 256K पर KV cache के लिए headroom         |

{% hint style="success" %}
**डिफ़ॉल्ट विकल्प:** **4× H100 80 GB FP8** vLLM के माध्यम से। आपको पूरा 256K context, \~100 tok/s sustained, OpenAI-compatible API, और साफ़ tensor-parallel scaling मिलता है — लगभग एक Claude Opus heavy-use seat की दैनिक लागत के बराबर।
{% endhint %}

## सर्वर आवश्यकताएँ

| घटक          | न्यूनतम (Q4 single-GPU)    | सुझाया गया (FP8, 4× H100)      | उच्च-context (2× H200) |
| ------------ | -------------------------- | ------------------------------ | ---------------------- |
| GPU VRAM     | 80 GB (1× H100)            | 4× 80 GB = 320 GB              | 2× 141 GB = 282 GB     |
| System RAM   | 128 GB                     | 256 GB                         | 256 GB                 |
| Disk (NVMe)  | 200 GB                     | 400 GB                         | 400 GB                 |
| नेटवर्क      | HF download के लिए 1 Gbps+ | 1 Gbps+                        | 1 Gbps+                |
| CUDA         | 12.4+                      | 12.4+                          | 12.6+                  |
| Driver       | ≥ 555                      | ≥ 555                          | ≥ 555                  |
| Startup time | 3–6 min (cold pull)        | 6–12 min (cold pull, 4 shards) | 5–10 min               |

पहला cold start HuggingFace download से प्रभावित होता है — FP8 weights लगभग **128 GB**, BF16 लगभग **256 GB**के करीब होते हैं। एक persistent volume माउंट करें `/root/.cache/huggingface` पर, ताकि आप यह bandwidth लागत प्रति server केवल एक बार चुकाएँ।

## CLORE.AI पर त्वरित डिप्लॉय

सबसे तेज़ तरीका आधिकारिक `vllm/vllm-openai` image है, जिसमें tensor parallelism आपके GPU count के अनुसार सेट होता है। नीचे दिया उदाहरण 4× H100 instance मानता है।

**Docker image:**

```
vllm/vllm-openai:latest
```

**पोर्ट्स:**

```
22/tcp
8000/http
```

**Startup command (4× H100, FP8):**

```bash
vllm serve mistralai/Mistral-Medium-3.5-FP8 \\
    --tensor-parallel-size 4 \\
    --max-model-len 65536 \\
    --gpu-memory-utilization 0.90 \\
    --enable-chunked-prefill \\
    --enable-auto-tool-choice \\
    --tool-call-parser mistral \\
    --reasoning-parser mistral \\
    --tokenizer-mode mistral \\
    --config-format mistral \\
    --load-format mistral \\
    --served-model-name mistral-medium-3.5 \\
    --host 0.0.0.0 \\
    --port 8000
```

**विकल्प — 2× H200 BF16:**

```bash
vllm serve mistralai/Mistral-Medium-3.5 \\
    --tensor-parallel-size 2 \\
    --max-model-len 131072 \\
    --gpu-memory-utilization 0.92 \\
    --enable-chunked-prefill \\
    --enable-auto-tool-choice \\
    --tool-call-parser mistral \\
    --reasoning-parser mistral \\
    --tokenizer-mode mistral \\
    --config-format mistral \\
    --load-format mistral \\
    --served-model-name mistral-medium-3.5 \\
    --host 0.0.0.0 \\
    --port 8000
```

{% hint style="info" %}
से शुरू करें `--max-model-len 65536` यहाँ तक कि ऐसे हार्डवेयर पर भी जो इससे अधिक फिट कर सकता हो। KV cache memory context के साथ रेखीय रूप से बढ़ती है, और अधिकांश workloads कभी 256K तक नहीं पहुँचते। request mix की पुष्टि करने के बाद ही इसे बढ़ाएँ।
{% endhint %}

**SGLang विकल्प** (लंबे prefill पर अक्सर Hopper पर तेज़):

```bash
python3 -m sglang.launch_server \\
    --model-path mistralai/Mistral-Medium-3.5-FP8 \\
    --tp-size 4 \\
    --tool-call-parser mistral \\
    --reasoning-parser mistral \\
    --mem-fraction-static 0.88 \\
    --context-length 65536 \\
    --served-model-name mistral-medium-3.5 \\
    --host 0.0.0.0 \\
    --port 8000
```

## उपयोग के उदाहरण

डिप्लॉयमेंट के बाद, अपना खोजें `http_pub` URL **मेरे ऑर्डर्स** पर Clore.ai (जैसे `abc123.clorecloud.net`). बदलें `localhost:8000` के साथ `https://YOUR_HTTP_PUB_URL` नीचे दिए उदाहरणों में, जब सर्वर के बाहर से कॉल करें।

### 1. चैट — Instant Mode (default)

कम latency वाला उत्तर, कोई visible reasoning trace नहीं। चैट UI, autocomplete, classification के लिए अच्छा।

```bash
curl http://localhost:8000/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -d '{
    "model": "mistral-medium-3.5",
    "messages": [
      {"role": "system", "content": "आप एक वरिष्ठ बैकएंड इंजीनियर हैं."},
      {"role": "user", "content": "API key के अनुसार token bucket rate-limiting वाला Go HTTP middleware लिखें."}
    ],
    "temperature": 0.15,
    "max_tokens": 1024,
    "extra_body": {"reasoning_mode": "instant"}
  }'
```

### 2. चैट — Deep Mode (reasoning toggle)

सक्षम करता है `<think>` अंतिम उत्तर से पहले trace। कठिन debugging, planning, math के लिए उपयोग करें।

```bash
curl http://localhost:8000/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -d '{
    "model": "mistral-medium-3.5",
    "messages": [
      {"role": "user", "content": "एक उपयोगकर्ता बताता है कि हमारा payment webhook 1% orders के लिए दो बार fire होता है। संभावना के क्रम में सबसे संभावित root causes बताइए और एक diagnostic plan प्रस्तावित कीजिए."}
    ],
    "temperature": 0.7,
    "top_p": 0.95,
    "max_tokens": 4096,
    "extra_body": {"reasoning_mode": "deep"}
  }'
```

उत्तर में शामिल होगा एक `reasoning_content` field (vLLM parse करता है `<think>...</think>` span को visible message से बाहर) साथ में `content`. अपने product के अनुसार trace को हटाएँ या दिखाएँ।

### 3. Python — OpenAI-Compatible Client

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# Instant mode — chat
response = client.chat.completions.create(
    model="mistral-medium-3.5",
    messages=[
        {"role": "system", "content": "आप एक सहायक coding assistant हैं."},
        {"role": "user", "content": "पठनीयता के लिए इस Python function को refactor करें."}
    ],
    temperature=0.15,
    max_tokens=1024,
    extra_body={"reasoning_mode": "instant"}
)
print(response.choices[0].message.content)

# Deep mode — planning step
plan = client.chat.completions.create(
    model="mistral-medium-3.5",
    messages=[
        {"role": "user", "content": "2TB orders table के लिए बिना downtime MongoDB से PostgreSQL में migration की योजना बनाइए."}
    ],
    temperature=0.7,
    max_tokens=4096,
    extra_body={"reasoning_mode": "deep"}
)

msg = plan.choices[0].message
print("THINKING:\n", getattr(msg, "reasoning_content", ""))
print("\nANSWER:\n", msg.content)
```

### 4. Structured Outputs — JSON Schema

Medium 3.5 vLLM के `response_format`के माध्यम से JSON-schema-guided decoding का समर्थन करता है। जब downstream consumer parser हो, इंसान नहीं, तब उपयोगी।

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

schema = {
    "type": "object",
    "properties": {
        "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
        "categories": {
            "type": "array",
            "items": {"type": "string", "enum": ["auth", "payments", "db", "ui", "infra"]}
        },
        "summary": {"type": "string", "maxLength": 240},
        "next_action": {"type": "string"}
    },
    "required": ["severity", "categories", "summary", "next_action"],
    "additionalProperties": False
}

response = client.chat.completions.create(
    model="mistral-medium-3.5",
    messages=[
        {"role": "system", "content": "आने वाली bug report को वर्गीकृत करें। सख्त JSON लौटाएँ."},
        {"role": "user", "content": "जिन उपयोगकर्ताओं के email में apostrophes हैं, उनके लिए login विफल होता है और /webapi/login से 500 लौटता है."}
    ],
    temperature=0.0,
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "triage", "schema": schema, "strict": True}
    },
    extra_body={"reasoning_mode": "instant"}
)

import json
print(json.loads(response.choices[0].message.content))
```

### 5. Function Calling

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

tools = [{
    "type": "function",
    "function": {
        "name": "search_orders",
        "description": "user ID और वैकल्पिक date range के आधार पर orders database खोजें",
        "parameters": {
            "type": "object",
            "properties": {
                "user_id": {"type": "string"},
                "start_date": {"type": "string", "format": "date"},
                "end_date": {"type": "string", "format": "date"}
            },
            "required": ["user_id"]
        }
    }
}]

response = client.chat.completions.create(
    model="mistral-medium-3.5",
    messages=[{"role": "user", "content": "अप्रैल 2026 में user u_4821 के सभी orders ढूँढें."}],
    tools=tools,
    tool_choice="auto",
    temperature=0.1
)

for call in response.choices[0].message.tool_calls or []:
    print(call.function.name, call.function.arguments)
```

## प्रदर्शन सुझाव

1. **Hopper पर FP8 checkpoint को प्राथमिकता दें।** `Mistral-Medium-3.5-FP8` vendor द्वारा प्रदान किया गया FP8 build है और Hopper-class hardware पर नगण्य quality loss के साथ BF16 की तुलना में लगभग 2× हल्का है। यह 4× H100 और 2× H200 दोनों के लिए सही default है।
2. **Tensor parallelism = GPU count.** 4× H100 के लिए उपयोग करें `--tensor-parallel-size 4`; 2× H200 के लिए उपयोग करें `--tensor-parallel-size 2`. Single node पर pipeline parallelism आम तौर पर 128B dense model के लिए throughput को नुकसान पहुँचाता है।
3. **सीमा निर्धारित करें `max-model-len` को उतना ही जितना आप वास्तव में उपयोग करते हैं।** 256K पर KV cache बहुत बड़ा होता है — पूर्ण context पर एक single sequence 30–50 GB खा सकता है। सेट करें `--max-model-len 65536` (या 32768) जब तक कि आपको अधिक की सत्यापित आवश्यकता न हो, और profiling के बाद ही बढ़ाएँ।
4. **Chunked prefill सक्षम करें।** `--enable-chunked-prefill` बड़े prompts के अभी भी process होने पर decode tokens को बहने देता रहता है। 100K+ prompts के लिए यह "responsive" और "timed out." के बीच का अंतर है
5. **Weights को cache करें।** Docker volume mount करें `/root/.cache/huggingface` पर और इसे restarts के बीच reuse करें। हर cold boot पर 128 GB फिर से डाउनलोड करना "vLLM शुरू होने में धीमा लगता है" का सबसे आम कारण है।
6. **मामूली headroom के लिए KV-cache quantization।** 4× H100 पर आप अधिक concurrent sessions समेट सकते हैं `--kv-cache-dtype fp8`से। Vendor लगभग lossless quality बताता है; production में बदलने से पहले अपने eval set पर सत्यापित करें।
7. **का उपयोग न करें `deep` मोड हर request के लिए।** Reasoning traces में वास्तविक tokens और वास्तविक latency लगती है। task type के अनुसार route करें: classification, autocomplete, और tool-arg generation `instant`में रहते हैं; planning और verification steps `deep`.
8. **Speculative decoding मदद करता है।** vLLM और SGLang दोनों draft-model speculative decoding का समर्थन करते हैं (उदा. Ministral 3B draft के साथ)। लंबे code completions पर यह आम तौर पर बिना quality लागत के 1.3–1.7× throughput देता है।

## बेंचमार्क

{% hint style="warning" %}
**Vendor द्वारा प्रकाशित संख्याएँ — स्वतंत्र रूप से सत्यापित करें।** नीचे दी गई तालिका Mistral AI की 29 अप्रैल, 2026 घोषणा से ली गई है। स्वतंत्र third-party reproductions (LMSys, EQ-Bench, SWE-Bench leaderboard) अभी भी आ रहे हैं। इसे दिशात्मक मानें, आधिकारिक नहीं।
{% endhint %}

| Benchmark                  | Mistral Medium 3.5 (vendor) | Reference points (vendor-cited)       |
| -------------------------- | --------------------------- | ------------------------------------- |
| MMLU-Pro                   | \~78%                       | Llama 4 Maverick \~76%, GPT-5.4 \~81% |
| HumanEval                  | \~92%                       | Codestral 25.01 \~88%, GLM-5.1 \~94%  |
| LiveCodeBench (Apr 2026)   | \~68%                       | GLM-5.1 \~72%, Llama 4 Maverick \~64% |
| AIME 2025 (deep mode)      | \~62%                       | GPT-5.4 \~73%, GLM-5.1 \~58%          |
| GPQA Diamond (deep mode)   | \~59%                       | Claude Opus 4.6 \~63%, GLM-5.1 \~57%  |
| Long-context recall (128K) | \~95%                       | Llama 4 Maverick \~93%                |

Mistral जिस positioning को लक्षित कर रहा है: **सामान्य tasks पर लगभग Llama 4 Maverick / GLM-5.1 tier, कम coding gap, अलग reasoning toggle**. इसे GPT-5.4 / Claude Opus 4.6 के प्रतिद्वंद्वी के रूप में पेश नहीं किया गया है।

## समस्या निवारण

| समस्या                                                      | समाधान                                                                                                                                                        |
| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `CUDA out of memory` लोड पर (4× H100)                       | आप संभवतः गलती से BF16 लोड कर रहे हैं। FP8 checkpoint उपयोग करें (`Mistral-Medium-3.5-FP8`) या डाउनग्रेड करें `--max-model-len 32768`.                        |
| `CUDA out of memory` 256K context के साथ request के बीच में | KV cache फट गया। कम करें `--max-model-len`, सक्षम करें `--kv-cache-dtype fp8`या cap करें `--max-num-seqs` (8 आज़माएँ)।                                        |
| Deep mode खाली उत्पन्न करता है `reasoning_content`          | सुनिश्चित करें `--reasoning-parser mistral` vLLM में सेट है और `temperature ≥ 0.5`है। Zero-temp sampling trace को truncate कर देता है।                        |
| Deep mode में first-token तक समय धीमा                       | अपेक्षित — deep mode एक `<think>` span किसी भी visible output से पहले निकालता है। क्लाइंट को stream करें `stream=true` के साथ और "thinking…" UI state दिखाएँ। |
| `403 Forbidden` HuggingFace download से                     | Mistral Medium 3.5 **gated**है। model card पर MRL स्वीकार करें और सेट करें `HF_TOKEN` container env में।                                                      |
| `tokenizer_mode mistral` त्रुटियाँ                          | तीनों flags एक साथ आवश्यक हैं: `--tokenizer-mode mistral --config-format mistral --load-format mistral`.                                                      |
| Tool calls चुपचाप गिर जाते हैं                              | दोनों सेट करें `--enable-auto-tool-choice` और `--tool-call-parser mistral`. Parser के बिना, vLLM tool args को plain text के रूप में लौटाता है।                |
| \~32 concurrent sessions के बाद throughput ढह जाता है       | आपने KV-cache eviction को hit किया। कम करें `--max-model-len`या बढ़ाएँ `--gpu-memory-utilization` को 0.92 तक, या दूसरी replica पर scale out करें।             |
| Commercial use को रोकने वाली license त्रुटि                 | MRL केवल शोध के लिए है। भुगतान करने वाले users को सर्व करने से पहले commercial license के लिए Mistral sales से संपर्क करें।                                   |

## FAQ

**प्र: Mistral Medium 3.5 बनाम Llama 4 Maverick — मुझे कौन-सा चुनना चाहिए?**

दोनों समान weight class में हैं (Maverick 400B total पर 17B-active MoE है; Medium 3.5 128B dense है)। चुनें **Medium 3.5** यदि आप अनुमानित VRAM/latency, एक ही checkpoint में dual-mode reasoning toggle, और बेहतर code performance चाहते हैं। चुनें **Llama 4 Maverick** यदि आपको बिना शर्त commercial use के लिए permissive licensing चाहिए (Llama 4 community-licensed है, Medium 3.5 को production के लिए Mistral commercial license चाहिए) या यदि आप MoE से मिलने वाली per-request basis पर सस्ती inference cost चाहते हैं।

**प्र: मैं reasoning mode कैसे सक्षम करूँ?**

पास करें `extra_body={"reasoning_mode": "deep"}` OpenAI Python client में, या शामिल करें `"reasoning_mode": "deep"` अपने HTTP JSON body के top level पर। डिफ़ॉल्ट है `"instant"`. Server side पर, सुनिश्चित करें कि vLLM के साथ लॉन्च किया गया था `--reasoning-parser mistral` ताकि `<think>` span parse होकर `reasoning_content` field में आ जाए बजाय इसके कि वह लीक होकर `content`.

**प्र: 2× H100 के बजाय 4× H100 क्यों?**

FP8 weights KV cache से पहले लगभग 128 GB होते हैं। 2× H100 80 GB आपको कुल 160 GB देता है — weights लोड करने के लिए पर्याप्त, लेकिन KV cache, activations, या मध्यम context window के लिए लगभग कोई headroom नहीं। व्यवहार में 2× H100 8K context के बाद तुरंत OOM हो जाता है। **4× H100 उपयोगी 256K-capable deployment के लिए न्यूनतम है**; 2× H200 (282 GB) विकल्प है यदि आप प्रति-GPU थोड़ा अधिक लागत पर कम GPUs प्रबंधित करना चाहें।

**प्र: क्या मैं Mistral Medium 3.5 का commercial उपयोग कर सकता हूँ?**

डिफ़ॉल्ट Mistral Research License (MRL) शोध और आंतरिक मूल्यांकन की अनुमति देता है, लेकिन **नहीं** commercial production. भुगतान करने वाले ग्राहकों के सामने deployment के लिए आपको **Mistral Commercial License** — Mistral sales से संपर्क करें। यही gating पहले Medium 3 और Codestral पर भी लागू थी। यदि commercial-friendly licensing एक कठोर आवश्यकता है, तो देखें [Mistral Small 3.1](/guides/guides_v2-hi/language-models/mistral-small.md) (Apache 2.0) या [Llama 4](/guides/guides_v2-hi/language-models/llama4.md) (Llama community license).

**प्र: क्या Medium 3.5 vision या audio का समर्थन करता है?**

नहीं। Medium 3.5 केवल टेक्स्ट के लिए है। मल्टीमॉडल Mistral के लिए उपयोग करें [Mistral Large 3](/guides/guides_v2-hi/language-models/mistral-large3.md), जो 2.5B विज़न एन्कोडर के साथ आता है। Clore.ai पर अन्य मल्टीमॉडल विकल्पों के लिए, Qwen3.5-Omni या Gemma 3 देखें।

## संबंधित मार्गदर्शिकाएँ

* [Mistral Large 3](/guides/guides_v2-hi/language-models/mistral-large3.md) — 675B MoE मल्टीमॉडल फ्रंटियर मॉडल, Apache 2.0, जब आपको विज़न और अधिकतम गुणवत्ता की आवश्यकता हो
* [Mistral और Mixtral](/guides/guides_v2-hi/language-models/mistral-mixtral.md) — पुराने Mistral 7B और Mixtral 8x7B/8x22B, सिंगल-GPU परिनियोजनों के लिए
* [vLLM](/guides/guides_v2-hi/language-models/vllm.md) — उत्पादन सर्विंग फ्रेमवर्क, Medium 3.5 के लिए अनुशंसित बैकएंड
* [Llama 4](/guides/guides_v2-hi/language-models/llama4.md) — इस पैमाने पर सबसे निकटतम ओपन-वेट समकक्ष, उदार लाइसेंस वाला विकल्प

### बाहरी लिंक

* [HuggingFace पर Mistral Medium 3.5](https://huggingface.co/mistralai/Mistral-Medium-3.5)
* [Mistral Medium 3.5 FP8 चेकपॉइंट](https://huggingface.co/mistralai/Mistral-Medium-3.5-FP8)
* [Mistral AI घोषणा (29 अप्रैल, 2026)](https://mistral.ai/news/mistral-medium-3-5)
* [Mistral रिसर्च लाइसेंस](https://mistral.ai/licenses/MRL-0.1.md)
* [vLLM दस्तावेज़ीकरण](https://docs.vllm.ai)
* [SGLang रिपॉ](https://github.com/sgl-project/sglang)
* [Clore.ai मार्केटप्लेस](https://clore.ai/marketplace) — $0.50/दिन से H100 / H200 किराए पर लें


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-hi/language-models/mistral-medium35.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.