Mistral & Mixtral

Run Mistral and Mixtral models on Clore.ai GPUs

Newer versions available! Check out Mistral Small 3.1 (24B, Apache 2.0, fits on RTX 4090) and Mistral Large 3 (675B MoE, frontier-class).

Run Mistral and Mixtral models for high-quality text generation.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Renting on CLORE.AI

Visit CLORE.AI Marketplace
Filter by GPU type, VRAM, and price
Choose On-Demand (fixed rate) or Spot (bid price)
Configure your order:
- Select Docker image
- Set ports (TCP for SSH, HTTP for web UIs)
- Add environment variables if needed
- Enter startup command
Select payment: CLORE, BTC, or USDT/USDC
Create order and wait for deployment

Access Your Server

Find connection details in My Orders
Web interfaces: Use the HTTP port URL
SSH: ssh -p <port> root@<proxy-address>

Model Overview

Model

Parameters

VRAM

Specialty

Mistral-7B

8GB

General purpose

Mistral-7B-Instruct

8GB

Chat/instruction

Mixtral-8x7B

46.7B (12.9B active)

24GB

MoE, best quality

Mixtral-8x22B

141B

80GB+

Largest MoE

Quick Deploy

Docker Image:

pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime

Ports:

22/tcp
8000/http

Command:

pip install vllm && \
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --port 8000

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Installation Options

Using Ollama (Easiest)


# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run Mistral
ollama run mistral

# Run Mixtral
ollama run mixtral

Using vLLM

pip install vllm

# Start server
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --dtype float16

Using Transformers

pip install transformers accelerate

Mistral-7B with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    inputs,
    max_new_tokens=500,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Mixtral-8x7B

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

outputs = model.generate(
    inputs,
    max_new_tokens=1000,
    do_sample=True,
    temperature=0.7
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantized Models (Lower VRAM)

4-bit Quantization

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    quantization_config=quantization_config,
    device_map="auto"
)

GGUF with llama.cpp


# Download GGUF model
wget https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf

# Run with llama.cpp
./main -m Mistral-7B-Instruct-v0.3-Q4_K_M.gguf \
    -p "Explain machine learning" \
    -n 500

vLLM Server (Production)

python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --dtype float16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9

OpenAI-Compatible API

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Streaming

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

stream = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Write a story about a robot"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Function Calling

Mistral supports function calling:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools
)

print(response.choices[0].message.tool_calls)

Gradio Interface

import gradio as gr
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

def chat(message, history, temperature, max_tokens):
    messages = []
    for h in history:
        messages.append({"role": "user", "content": h[0]})
        messages.append({"role": "assistant", "content": h[1]})
    messages.append({"role": "user", "content": message})

    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

    outputs = model.generate(
        inputs,
        max_new_tokens=max_tokens,
        temperature=temperature,
        do_sample=True
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract assistant response
    return response.split("[/INST]")[-1].strip()

demo = gr.ChatInterface(
    fn=chat,
    additional_inputs=[
        gr.Slider(0.1, 2.0, value=0.7, label="Temperature"),
        gr.Slider(100, 2000, value=500, step=100, label="Max Tokens")
    ],
    title="Mistral-7B Chat"
)

demo.launch(server_name="0.0.0.0", server_port=7860)

Performance Comparison

Throughput (tokens/sec)

Model

RTX 3060

RTX 3090

RTX 4090

A100 40GB

Mistral-7B FP16

120

150

Mistral-7B Q4

110

160

200

Mixtral-8x7B FP16

Mixtral-8x7B Q4

Mixtral-8x22B Q4

Time to First Token (TTFT)

Model

RTX 3090

RTX 4090

A100

Mistral-7B

80ms

50ms

35ms

Mixtral-8x7B

150ms

90ms

Mixtral-8x22B

200ms

Context Length vs VRAM (Mistral-7B)

Context

FP16

15GB

9GB

5GB

18GB

11GB

7GB

16K

24GB

15GB

9GB

32K

36GB

22GB

14GB

VRAM Requirements

Model

FP16

8-bit

4-bit

Mistral-7B

14GB

8GB

5GB

Mixtral-8x7B

90GB

45GB

24GB

Mixtral-8x22B

180GB

90GB

48GB

Use Cases

Code Generation

prompt = """
Write a Python class for a REST API client with:
- Authentication handling
- Retry logic
- Error handling
"""

Data Analysis

prompt = """
Analyze this data and provide insights:
Sales Q1: $100K
Sales Q2: $150K
Sales Q3: $120K
Sales Q4: $200K
"""

Creative Writing

prompt = """
Write a short story about an AI that becomes self-aware,
in the style of Isaac Asimov.
"""

Troubleshooting

Out of Memory

Use 4-bit quantization
Use Mistral-7B instead of Mixtral
Reduce max_model_len

Slow Generation

Use vLLM for production
Enable flash attention
Use tensor parallelism for multi-GPU

Poor Output Quality

Adjust temperature (0.1-0.9)
Use instruct variant
Better system prompts

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU

Hourly Rate

Daily Rate

4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplace for current rates.

Save money:

Use Spot market for flexible workloads (often 30-50% cheaper)
Pay with CLORE tokens
Compare prices across different providers

Next Steps

vLLM - Production serving
Ollama - Easy deployment
DeepSeek-V3 - Best reasoning model
Qwen2.5 - Multilingual alternative

PreviousLlama 3.3 70B NextDeepSeek Coder

Last updated 26 days ago

Was this helpful?

hashtagRenting on CLORE.AI

hashtagAccess Your Server

hashtagModel Overview

hashtagQuick Deploy

hashtagAccessing Your Service

hashtagInstallation Options

hashtagUsing Ollama (Easiest)

hashtagUsing vLLM

hashtagUsing Transformers

hashtagMistral-7B with Transformers

hashtagMixtral-8x7B

hashtagQuantized Models (Lower VRAM)

hashtag4-bit Quantization

hashtagGGUF with llama.cpp

hashtagvLLM Server (Production)

hashtagOpenAI-Compatible API

hashtagStreaming

hashtagFunction Calling

hashtagGradio Interface

hashtagPerformance Comparison

hashtagThroughput (tokens/sec)

hashtagTime to First Token (TTFT)

hashtagContext Length vs VRAM (Mistral-7B)

hashtagVRAM Requirements

hashtagUse Cases

hashtagCode Generation

hashtagData Analysis

hashtagCreative Writing

hashtagTroubleshooting

hashtagOut of Memory

hashtagSlow Generation

hashtagPoor Output Quality

hashtagCost Estimate

hashtagNext Steps

Renting on CLORE.AI

Access Your Server

Model Overview

Quick Deploy

Accessing Your Service

Installation Options

Using Ollama (Easiest)

Using vLLM

Using Transformers

Mistral-7B with Transformers

Mixtral-8x7B

Quantized Models (Lower VRAM)

4-bit Quantization

GGUF with llama.cpp

vLLM Server (Production)

OpenAI-Compatible API

Streaming

Function Calling

Gradio Interface

Performance Comparison

Throughput (tokens/sec)

Time to First Token (TTFT)

Context Length vs VRAM (Mistral-7B)

VRAM Requirements

Use Cases

Code Generation

Data Analysis

Creative Writing

Troubleshooting

Out of Memory

Slow Generation

Poor Output Quality

Cost Estimate

Next Steps