# DeepSeek-R1 Reasoning Model

{% hint style="success" %}
All examples run on GPU servers rented via the [CLORE.AI Marketplace](https://clore.ai/marketplace). RTX 4090 instances start at \~$0.50/day.
{% endhint %}

## Overview

DeepSeek-R1 is a 671B-parameter open-weight reasoning model released in January 2025 by DeepSeek under the **Apache 2.0** license. It is the first open model to match OpenAI o1 across math, coding, and scientific benchmarks — while exposing its entire chain-of-thought through explicit `<think>` tags.

The full model uses **Mixture-of-Experts (MoE)** with 37B active parameters per token, making inference tractable despite the headline parameter count. For most practitioners, the **distilled variants** (1.5B → 70B) are more practical: they inherit R1's reasoning patterns through knowledge distillation into Qwen-2.5 and Llama-3 base architectures and run on commodity GPUs.

## Key Features

* **Explicit chain-of-thought** — every response begins with a `<think>` block where the model reasons, backtracks, and self-corrects before producing a final answer
* **Reinforcement-learning trained** — reasoning ability emerges from RL reward signals rather than hand-authored chain-of-thought data
* **Six distilled variants** — 1.5B, 7B, 8B, 14B, 32B, 70B parameter models distilled from the full 671B into Qwen and Llama architectures
* **Apache 2.0 license** — fully commercial, no royalties, no usage restrictions
* **Wide framework support** — Ollama, vLLM, llama.cpp, SGLang, Transformers, TGI all work out of the box
* **AIME 2024 Pass\@1: 79.8%** — ties with OpenAI o1 on competition math
* **Codeforces 2029 Elo** — exceeds o1's 1891 on competitive programming

## Model Variants

| Variant                | Parameters        | Architecture | FP16 VRAM | Q4 VRAM  | Q4 Disk  |
| ---------------------- | ----------------- | ------------ | --------- | -------- | -------- |
| DeepSeek-R1 (full MoE) | 671B (37B active) | DeepSeek MoE | \~1.3 TB  | \~350 GB | \~340 GB |
| R1-Distill-Llama-70B   | 70B               | Llama 3      | 140 GB    | 40 GB    | 42 GB    |
| R1-Distill-Qwen-32B    | 32B               | Qwen 2.5     | 64 GB     | 22 GB    | 20 GB    |
| R1-Distill-Qwen-14B    | 14B               | Qwen 2.5     | 28 GB     | 10 GB    | 9 GB     |
| R1-Distill-Llama-8B    | 8B                | Llama 3      | 16 GB     | 6 GB     | 5.5 GB   |
| R1-Distill-Qwen-7B     | 7B                | Qwen 2.5     | 14 GB     | 5 GB     | 4.5 GB   |
| R1-Distill-Qwen-1.5B   | 1.5B              | Qwen 2.5     | 3 GB      | 2 GB     | 1.2 GB   |

### Choosing a Variant

| Use Case                              | Recommended Variant    | GPU on Clore                 |
| ------------------------------------- | ---------------------- | ---------------------------- |
| Quick experiments, edge testing       | R1-Distill-Qwen-1.5B   | Any GPU                      |
| Budget deployment, fast inference     | R1-Distill-Qwen-7B     | RTX 3090 (\~$0.30–1/day)     |
| Single-GPU production sweet spot      | R1-Distill-Qwen-14B Q4 | RTX 4090 (\~$0.50–2/day)     |
| Best quality-per-dollar (recommended) | R1-Distill-Qwen-32B Q4 | RTX 4090 24 GB or A100 40 GB |
| Maximum distilled quality             | R1-Distill-Llama-70B   | 2× A100 80 GB                |
| Research, full-fidelity reasoning     | DeepSeek-R1 671B       | 8× H100 cluster              |

### HuggingFace Repositories

| Variant           | Repository                                                                                                    |
| ----------------- | ------------------------------------------------------------------------------------------------------------- |
| Full R1           | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)                                     |
| Llama-70B distill | [deepseek-ai/DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B) |
| Qwen-32B distill  | [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B)   |
| Qwen-14B distill  | [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B)   |
| Llama-8B distill  | [deepseek-ai/DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)   |
| Qwen-7B distill   | [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)     |
| Qwen-1.5B distill | [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) |

## Requirements

| Component  | Minimum (7B Q4) | Recommended (32B Q4) |
| ---------- | --------------- | -------------------- |
| GPU VRAM   | 6 GB            | 24 GB                |
| System RAM | 16 GB           | 32 GB                |
| Disk       | 10 GB           | 30 GB                |
| CUDA       | 12.1+           | 12.4+                |
| Docker     | 24.0+           | 25.0+                |

## Ollama Quick Start

Ollama handles quantization, downloading, and serving automatically — the fastest path to a running DeepSeek-R1.

### Install and run

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run (pick your variant):
ollama run deepseek-r1:1.5b    # Edge / testing
ollama run deepseek-r1:7b      # Budget — fits any 8 GB+ GPU
ollama run deepseek-r1:14b     # RTX 4090 sweet spot
ollama run deepseek-r1:32b     # A100 / 2× RTX 4090
ollama run deepseek-r1:70b     # 2× A100 80 GB
```

### Example interactive session

```
>>> Prove that √2 is irrational.

<think>
I'll use proof by contradiction.
Assume √2 is rational, so √2 = p/q where p,q are integers with gcd(p,q) = 1.
Then 2 = p²/q², which gives p² = 2q².
This means p² is even, so p must be even. Write p = 2k.
Substituting: (2k)² = 2q² → 4k² = 2q² → q² = 2k².
So q² is also even, meaning q is even.
But if both p and q are even, gcd(p,q) ≥ 2, contradicting our assumption.
</think>

**Proof that √2 is irrational (by contradiction):**

Assume √2 = p/q in lowest terms (gcd(p,q) = 1). Squaring gives p² = 2q².
Since p² is even, p is even — write p = 2k. Then 4k² = 2q², so q² = 2k²,
meaning q is also even. This contradicts gcd(p,q) = 1. ∎
```

### Use the OpenAI-compatible API

```bash
# Start Ollama as a server (if not already running)
ollama serve &

# Query via curl
curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1:32b",
    "messages": [{"role": "user", "content": "Factor x^4 - 1 completely over the integers."}],
    "temperature": 0.6
  }' | python3 -m json.tool
```

### Python client (via OpenAI SDK)

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="deepseek-r1:32b",
    messages=[
        {"role": "user", "content": "Write a Python function to find the longest palindromic substring."}
    ],
    temperature=0.6,
    max_tokens=4096,
)
print(response.choices[0].message.content)
```

## vLLM Production Setup

vLLM delivers the highest throughput for multi-user serving with continuous batching, PagedAttention, and prefix caching.

### Single GPU — 7B / 14B

```bash
pip install vllm

# 7B on any 16 GB+ GPU
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 16384

# 14B on RTX 4090 (24 GB)
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.92
```

### Multi-GPU — 32B (recommended)

```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --enable-prefix-caching
```

> **Tip:** The 32B Q4 GPTQ or AWQ checkpoint fits on a single RTX 4090 (24 GB):
>
> ```bash
> vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
>     --quantization awq --host 0.0.0.0 --port 8000 \
>     --max-model-len 16384
> ```

### Multi-GPU — 70B

```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90
```

### Query the vLLM endpoint

```bash
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    "messages": [{"role": "user", "content": "Solve: find all primes p such that p^2 + 2 is also prime."}],
    "temperature": 0.6,
    "max_tokens": 4096
  }'
```

## Transformers / Python (with `<think>` Tag Parsing)

Use HuggingFace Transformers when you need fine-grained control over generation or want to integrate R1 into a Python pipeline.

### Basic generation

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, re

MODEL = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"

tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

prompt = "What is the sum of the first 100 positive integers?"
messages = [{"role": "user", "content": prompt}]
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=2048,
        temperature=0.6,
        do_sample=True,
    )

full_response = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(full_response)
```

### Parsing `<think>` tags

```python
def parse_r1_response(text: str) -> dict:
    """Split a DeepSeek-R1 response into thinking and answer parts."""
    think_match = re.search(r"<think>(.*?)</think>", text, re.DOTALL)
    thinking = think_match.group(1).strip() if think_match else ""
    answer = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
    return {
        "thinking": thinking,
        "answer": answer,
        "thinking_tokens": len(thinking.split()),
    }

result = parse_r1_response(full_response)
print(f"Model reasoned for {result['thinking_tokens']} words")
print(f"Answer: {result['answer']}")
```

### Streaming with `<think>` state tracking

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

stream = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    messages=[{"role": "user", "content": "Derive the quadratic formula from ax² + bx + c = 0."}],
    stream=True,
    max_tokens=4096,
    temperature=0.6,
)

in_think = False
for chunk in stream:
    token = chunk.choices[0].delta.content or ""
    if "<think>" in token:
        in_think = True
        print("[Reasoning] ", end="", flush=True)
        continue
    if "</think>" in token:
        in_think = False
        print("\n[Answer] ", end="", flush=True)
        continue
    if not in_think:
        print(token, end="", flush=True)
print()
```

## Docker Deployment on Clore.ai

### Ollama Docker (simplest)

**Docker image:** `ollama/ollama` **Ports:** `22/tcp, 11434/http`

```bash
# On the Clore instance
docker run -d --gpus all \
    -v ollama_data:/root/.ollama \
    -p 11434:11434 \
    --name deepseek-r1 \
    ollama/ollama

# Pull and serve the model
docker exec deepseek-r1 ollama pull deepseek-r1:32b
```

### vLLM Docker (production)

**Docker image:** `vllm/vllm-openai:latest` **Ports:** `22/tcp, 8000/http`

```yaml
# docker-compose.yml
version: "3.8"
services:
  deepseek-r1:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - hf_cache:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN:-}
    command: >
      --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
      --host 0.0.0.0 --port 8000
      --tensor-parallel-size 2
      --max-model-len 32768
      --gpu-memory-utilization 0.90
      --enable-prefix-caching
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 300s
volumes:
  hf_cache:
```

Deploy on Clore.ai:

1. Open [clore.ai/marketplace](https://clore.ai/marketplace)
2. Filter by **2× GPU, 48 GB+ VRAM total** (e.g. 2× RTX 4090 or A100 80 GB)
3. Set the Docker image to `vllm/vllm-openai:latest`
4. Map port **8000** as HTTP
5. Paste the command from the compose file above into the startup command
6. Connect via the HTTP endpoint once the health check passes

## Tips for Clore.ai Deployments

### Choosing the right GPU

| Budget       | GPU              | Daily Cost   | Best Variant                       |
| ------------ | ---------------- | ------------ | ---------------------------------- |
| Minimal      | RTX 3090 (24 GB) | $0.30 – 1.00 | R1-Distill-Qwen-7B or 14B Q4       |
| Standard     | RTX 4090 (24 GB) | $0.50 – 2.00 | R1-Distill-Qwen-14B FP16 or 32B Q4 |
| Production   | A100 80 GB       | $3 – 8       | R1-Distill-Qwen-32B FP16           |
| High quality | 2× A100 80 GB    | $6 – 16      | R1-Distill-Llama-70B FP16          |

### Performance tuning

* **Temperature 0.6** is the recommended default for reasoning tasks — DeepSeek's own papers use this value
* **Set `max_tokens` generously** — reasoning models produce long `<think>` blocks; 4096+ for non-trivial problems
* **Enable prefix caching** (`--enable-prefix-caching` in vLLM) when using a shared system prompt
* **Limit concurrency** (`--max-num-seqs 16`) for reasoning workloads — each request uses more compute than a standard chat
* **Use Q4 quantization** to fit 32B on a single 24 GB GPU with minimal quality loss (the distill already compresses R1's knowledge)

### Context length considerations

Reasoning models consume more context than standard chat models because of the `<think>` block:

| Task Complexity              | Typical Thinking Length | Total Context Needed |
| ---------------------------- | ----------------------- | -------------------- |
| Simple arithmetic            | \~100 tokens            | \~300 tokens         |
| Code generation              | \~500–1000 tokens       | \~2000 tokens        |
| Competition math (AIME)      | \~2000–4000 tokens      | \~5000 tokens        |
| Multi-step research analysis | \~4000–8000 tokens      | \~10000 tokens       |

## Troubleshooting

### Out of memory (OOM)

```bash
# Reduce context length
--max-model-len 8192    # instead of 32768

# Limit concurrent sequences
--max-num-seqs 8

# Use quantization
--quantization awq      # or gptq
```

### Model produces no `<think>` block

Some system prompts suppress thinking. Avoid instructions like "be concise" or "don't explain your reasoning." Use a minimal system prompt or none at all:

```python
# Good — preserves reasoning
messages = [{"role": "user", "content": "..."}]

# Bad — may suppress thinking
messages = [
    {"role": "system", "content": "Be extremely brief. No explanations."},
    {"role": "user", "content": "..."}
]
```

### Repetitive or looping `<think>` output

Lower the temperature to reduce randomness in the reasoning chain:

```python
temperature = 0.0   # Deterministic — best for math/code
temperature = 0.3   # Slight variation — good for analysis
```

### Slow first token (high TTFT)

This is expected — the model generates `<think>` tokens before the visible answer. For latency-sensitive applications where reasoning is not needed, use [DeepSeek-V3](https://docs.clore.ai/guides/language-models/deepseek-v3) instead.

### Download stalls on Clore instance

HuggingFace downloads can be slow on some providers. Pre-cache the model into a persistent volume:

```bash
# Download once into a volume
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --local-dir /data/models/deepseek-r1-32b

# Point vLLM at local path
vllm serve /data/models/deepseek-r1-32b --host 0.0.0.0 --port 8000
```

## Further Reading

* [DeepSeek-R1 Paper](https://arxiv.org/abs/2501.12948) — *Incentivizing Reasoning Capability in LLMs via Reinforcement Learning*
* [DeepSeek-R1 GitHub](https://github.com/deepseek-ai/DeepSeek-R1) — Official repository with model cards
* [DeepSeek-V3 Guide](https://docs.clore.ai/guides/language-models/deepseek-v3) — Non-reasoning general-purpose model from the same lab
* [vLLM Guide](https://docs.clore.ai/guides/language-models/vllm) — Comprehensive production serving setup
* [Ollama Guide](https://docs.clore.ai/guides/language-models/ollama) — Simple local deployment for any model
* [Open WebUI Guide](https://docs.clore.ai/guides/language-models/open-webui) — Chat UI with native `<think>` tag rendering
* [Qwen 2.5 Guide](https://docs.clore.ai/guides/language-models/qwen25) — The base architecture used by most R1 distills
