# Mistral & Mixtral

{% hint style="info" %}
**Newer versions available!** Check out [**Mistral Small 3.1**](https://docs.clore.ai/guides/language-models/mistral-small) (24B, Apache 2.0, fits on RTX 4090) and [**Mistral Large 3**](https://docs.clore.ai/guides/language-models/mistral-large3) (675B MoE, frontier-class).
{% endhint %}

Run Mistral and Mixtral models for high-quality text generation.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## Model Overview

| Model               | Parameters           | VRAM  | Specialty         |
| ------------------- | -------------------- | ----- | ----------------- |
| Mistral-7B          | 7B                   | 8GB   | General purpose   |
| Mistral-7B-Instruct | 7B                   | 8GB   | Chat/instruction  |
| Mixtral-8x7B        | 46.7B (12.9B active) | 24GB  | MoE, best quality |
| Mixtral-8x22B       | 141B                 | 80GB+ | Largest MoE       |

## Quick Deploy

**Docker Image:**

```
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
```

**Ports:**

```
22/tcp
8000/http
```

**Command:**

```bash
pip install vllm && \
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --port 8000
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Installation Options

### Using Ollama (Easiest)

```bash

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run Mistral
ollama run mistral

# Run Mixtral
ollama run mixtral
```

### Using vLLM

```bash
pip install vllm

# Start server
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --dtype float16
```

### Using Transformers

```bash
pip install transformers accelerate
```

## Mistral-7B with Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    inputs,
    max_new_tokens=500,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

## Mixtral-8x7B

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

outputs = model.generate(
    inputs,
    max_new_tokens=1000,
    do_sample=True,
    temperature=0.7
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Quantized Models (Lower VRAM)

### 4-bit Quantization

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    quantization_config=quantization_config,
    device_map="auto"
)
```

### GGUF with llama.cpp

```bash

# Download GGUF model
wget https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf

# Run with llama.cpp
./main -m Mistral-7B-Instruct-v0.3-Q4_K_M.gguf \
    -p "Explain machine learning" \
    -n 500
```

## vLLM Server (Production)

```bash
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --dtype float16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9
```

### OpenAI-Compatible API

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)
```

## Streaming

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

stream = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Write a story about a robot"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
```

## Function Calling

Mistral supports function calling:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools
)

print(response.choices[0].message.tool_calls)
```

## Gradio Interface

```python
import gradio as gr
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

def chat(message, history, temperature, max_tokens):
    messages = []
    for h in history:
        messages.append({"role": "user", "content": h[0]})
        messages.append({"role": "assistant", "content": h[1]})
    messages.append({"role": "user", "content": message})

    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

    outputs = model.generate(
        inputs,
        max_new_tokens=max_tokens,
        temperature=temperature,
        do_sample=True
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract assistant response
    return response.split("[/INST]")[-1].strip()

demo = gr.ChatInterface(
    fn=chat,
    additional_inputs=[
        gr.Slider(0.1, 2.0, value=0.7, label="Temperature"),
        gr.Slider(100, 2000, value=500, step=100, label="Max Tokens")
    ],
    title="Mistral-7B Chat"
)

demo.launch(server_name="0.0.0.0", server_port=7860)
```

## Performance Comparison

### Throughput (tokens/sec)

| Model             | RTX 3060 | RTX 3090 | RTX 4090 | A100 40GB |
| ----------------- | -------- | -------- | -------- | --------- |
| Mistral-7B FP16   | 45       | 80       | 120      | 150       |
| Mistral-7B Q4     | 70       | 110      | 160      | 200       |
| Mixtral-8x7B FP16 | -        | -        | 30       | 60        |
| Mixtral-8x7B Q4   | -        | 25       | 50       | 80        |
| Mixtral-8x22B Q4  | -        | -        | -        | 25        |

### Time to First Token (TTFT)

| Model         | RTX 3090 | RTX 4090 | A100  |
| ------------- | -------- | -------- | ----- |
| Mistral-7B    | 80ms     | 50ms     | 35ms  |
| Mixtral-8x7B  | -        | 150ms    | 90ms  |
| Mixtral-8x22B | -        | -        | 200ms |

### Context Length vs VRAM (Mistral-7B)

| Context | FP16 | Q8   | Q4   |
| ------- | ---- | ---- | ---- |
| 4K      | 15GB | 9GB  | 5GB  |
| 8K      | 18GB | 11GB | 7GB  |
| 16K     | 24GB | 15GB | 9GB  |
| 32K     | 36GB | 22GB | 14GB |

## VRAM Requirements

| Model         | FP16  | 8-bit | 4-bit |
| ------------- | ----- | ----- | ----- |
| Mistral-7B    | 14GB  | 8GB   | 5GB   |
| Mixtral-8x7B  | 90GB  | 45GB  | 24GB  |
| Mixtral-8x22B | 180GB | 90GB  | 48GB  |

## Use Cases

### Code Generation

```python
prompt = """
Write a Python class for a REST API client with:
- Authentication handling
- Retry logic
- Error handling
"""
```

### Data Analysis

```python
prompt = """
Analyze this data and provide insights:
Sales Q1: $100K
Sales Q2: $150K
Sales Q3: $120K
Sales Q4: $200K
"""
```

### Creative Writing

```python
prompt = """
Write a short story about an AI that becomes self-aware,
in the style of Isaac Asimov.
"""
```

## Troubleshooting

### Out of Memory

* Use 4-bit quantization
* Use Mistral-7B instead of Mixtral
* Reduce max\_model\_len

### Slow Generation

* Use vLLM for production
* Enable flash attention
* Use tensor parallelism for multi-GPU

### Poor Output Quality

* Adjust temperature (0.1-0.9)
* Use instruct variant
* Better system prompts

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* [vLLM](https://docs.clore.ai/guides/language-models/vllm) - Production serving
* [Ollama](https://docs.clore.ai/guides/language-models/ollama) - Easy deployment
* [DeepSeek-V3](https://docs.clore.ai/guides/language-models/deepseek-v3) - Best reasoning model
* [Qwen2.5](https://docs.clore.ai/guides/language-models/qwen25) - Multilingual alternative
