# DeepSeek-R1 Reasoning Model

{% hint style="success" %}
All examples run on GPU servers rented via the [CLORE.AI Marketplace](https://clore.ai/marketplace). RTX 4090 instances start at \~$0.50/day.
{% endhint %}

## Overview

DeepSeek-R1 is a 671B-parameter open-weight reasoning model released in January 2025 by DeepSeek under the **Apache 2.0** license. It is the first open model to match OpenAI o1 across math, coding, and scientific benchmarks — while exposing its entire chain-of-thought through explicit `<think>` tags.

The full model uses **Mixture-of-Experts (MoE)** with 37B active parameters per token, making inference tractable despite the headline parameter count. For most practitioners, the **distilled variants** (1.5B → 70B) are more practical: they inherit R1's reasoning patterns through knowledge distillation into Qwen-2.5 and Llama-3 base architectures and run on commodity GPUs.

## Key Features

* **Explicit chain-of-thought** — every response begins with a `<think>` block where the model reasons, backtracks, and self-corrects before producing a final answer
* **Reinforcement-learning trained** — reasoning ability emerges from RL reward signals rather than hand-authored chain-of-thought data
* **Six distilled variants** — 1.5B, 7B, 8B, 14B, 32B, 70B parameter models distilled from the full 671B into Qwen and Llama architectures
* **Apache 2.0 license** — fully commercial, no royalties, no usage restrictions
* **Wide framework support** — Ollama, vLLM, llama.cpp, SGLang, Transformers, TGI all work out of the box
* **AIME 2024 Pass\@1: 79.8%** — ties with OpenAI o1 on competition math
* **Codeforces 2029 Elo** — exceeds o1's 1891 on competitive programming

## Model Variants

| Variant                | Parameters        | Architecture | FP16 VRAM | Q4 VRAM  | Q4 Disk  |
| ---------------------- | ----------------- | ------------ | --------- | -------- | -------- |
| DeepSeek-R1 (full MoE) | 671B (37B active) | DeepSeek MoE | \~1.3 TB  | \~350 GB | \~340 GB |
| R1-Distill-Llama-70B   | 70B               | Llama 3      | 140 GB    | 40 GB    | 42 GB    |
| R1-Distill-Qwen-32B    | 32B               | Qwen 2.5     | 64 GB     | 22 GB    | 20 GB    |
| R1-Distill-Qwen-14B    | 14B               | Qwen 2.5     | 28 GB     | 10 GB    | 9 GB     |
| R1-Distill-Llama-8B    | 8B                | Llama 3      | 16 GB     | 6 GB     | 5.5 GB   |
| R1-Distill-Qwen-7B     | 7B                | Qwen 2.5     | 14 GB     | 5 GB     | 4.5 GB   |
| R1-Distill-Qwen-1.5B   | 1.5B              | Qwen 2.5     | 3 GB      | 2 GB     | 1.2 GB   |

### Choosing a Variant

| Use Case                              | Recommended Variant    | GPU on Clore                                                                                                                |
| ------------------------------------- | ---------------------- | --------------------------------------------------------------------------------------------------------------------------- |
| Quick experiments, edge testing       | R1-Distill-Qwen-1.5B   | Any GPU                                                                                                                     |
| Budget deployment, fast inference     | R1-Distill-Qwen-7B     | RTX 3090 (\~$0.30–1/day)                                                                                                    |
| Single-GPU production sweet spot      | R1-Distill-Qwen-14B Q4 | [RTX 4090](https://clore.ai/rent-4090.html?utm_source=docs\&utm_medium=guide\&utm_campaign=deepseek-r1) (\~$0.50–2/day)     |
| Best quality-per-dollar (recommended) | R1-Distill-Qwen-32B Q4 | [RTX 4090 24 GB](https://clore.ai/rent-4090.html?utm_source=docs\&utm_medium=guide\&utm_campaign=deepseek-r1) or A100 40 GB |
| Maximum distilled quality             | R1-Distill-Llama-70B   | 2× A100 80 GB                                                                                                               |
| Research, full-fidelity reasoning     | DeepSeek-R1 671B       | 8× H100 cluster                                                                                                             |

### HuggingFace Repositories

| Variant           | Repository                                                                                                    |
| ----------------- | ------------------------------------------------------------------------------------------------------------- |
| Full R1           | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)                                     |
| Llama-70B distill | [deepseek-ai/DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B) |
| Qwen-32B distill  | [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B)   |
| Qwen-14B distill  | [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B)   |
| Llama-8B distill  | [deepseek-ai/DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)   |
| Qwen-7B distill   | [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)     |
| Qwen-1.5B distill | [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) |

## Requirements

| Component  | Minimum (7B Q4) | Recommended (32B Q4) |
| ---------- | --------------- | -------------------- |
| GPU VRAM   | 6 GB            | 24 GB                |
| System RAM | 16 GB           | 32 GB                |
| Disk       | 10 GB           | 30 GB                |
| CUDA       | 12.1+           | 12.4+                |
| Docker     | 24.0+           | 25.0+                |

## Ollama Quick Start

Ollama handles quantization, downloading, and serving automatically — the fastest path to a running DeepSeek-R1.

### Install and run

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run (pick your variant):
ollama run deepseek-r1:1.5b    # Edge / testing
ollama run deepseek-r1:7b      # Budget — fits any 8 GB+ GPU
ollama run deepseek-r1:14b     # RTX 4090 sweet spot
ollama run deepseek-r1:32b     # A100 / 2× RTX 4090
ollama run deepseek-r1:70b     # 2× A100 80 GB
```

### Example interactive session

```
>>> Prove that √2 is irrational.

<think>
I'll use proof by contradiction.
Assume √2 is rational, so √2 = p/q where p,q are integers with gcd(p,q) = 1.
Then 2 = p²/q², which gives p² = 2q².
This means p² is even, so p must be even. Write p = 2k.
Substituting: (2k)² = 2q² → 4k² = 2q² → q² = 2k².
So q² is also even, meaning q is even.
But if both p and q are even, gcd(p,q) ≥ 2, contradicting our assumption.
</think>

**Proof that √2 is irrational (by contradiction):**

Assume √2 = p/q in lowest terms (gcd(p,q) = 1). Squaring gives p² = 2q².
Since p² is even, p is even — write p = 2k. Then 4k² = 2q², so q² = 2k²,
meaning q is also even. This contradicts gcd(p,q) = 1. ∎
```

### Use the OpenAI-compatible API

```bash
# Start Ollama as a server (if not already running)
ollama serve &

# Query via curl
curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1:32b",
    "messages": [{"role": "user", "content": "Factor x^4 - 1 completely over the integers."}],
    "temperature": 0.6
  }' | python3 -m json.tool
```

### Python client (via OpenAI SDK)

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="deepseek-r1:32b",
    messages=[
        {"role": "user", "content": "Write a Python function to find the longest palindromic substring."}
    ],
    temperature=0.6,
    max_tokens=4096,
)
print(response.choices[0].message.content)
```

## vLLM Production Setup

vLLM delivers the highest throughput for multi-user serving with continuous batching, PagedAttention, and prefix caching.

### Single GPU — 7B / 14B

```bash
pip install vllm

# 7B on any 16 GB+ GPU
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 16384

# 14B on RTX 4090 (24 GB)
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.92
```

### Multi-GPU — 32B (recommended)

```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --enable-prefix-caching
```

> **Tip:** The 32B Q4 GPTQ or AWQ checkpoint fits on a single RTX 4090 (24 GB):
>
> ```bash
> vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
>     --quantization awq --host 0.0.0.0 --port 8000 \
>     --max-model-len 16384
> ```

### Multi-GPU — 70B

```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90
```

### Query the vLLM endpoint

```bash
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    "messages": [{"role": "user", "content": "Solve: find all primes p such that p^2 + 2 is also prime."}],
    "temperature": 0.6,
    "max_tokens": 4096
  }'
```

## Transformers / Python (with `<think>` Tag Parsing)

Use HuggingFace Transformers when you need fine-grained control over generation or want to integrate R1 into a Python pipeline.

### Basic generation

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, re

MODEL = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"

tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

prompt = "What is the sum of the first 100 positive integers?"
messages = [{"role": "user", "content": prompt}]
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=2048,
        temperature=0.6,
        do_sample=True,
    )

full_response = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(full_response)
```

### Parsing `<think>` tags

```python
def parse_r1_response(text: str) -> dict:
    """Split a DeepSeek-R1 response into thinking and answer parts."""
    think_match = re.search(r"<think>(.*?)</think>", text, re.DOTALL)
    thinking = think_match.group(1).strip() if think_match else ""
    answer = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
    return {
        "thinking": thinking,
        "answer": answer,
        "thinking_tokens": len(thinking.split()),
    }

result = parse_r1_response(full_response)
print(f"Model reasoned for {result['thinking_tokens']} words")
print(f"Answer: {result['answer']}")
```

### Streaming with `<think>` state tracking

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

stream = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    messages=[{"role": "user", "content": "Derive the quadratic formula from ax² + bx + c = 0."}],
    stream=True,
    max_tokens=4096,
    temperature=0.6,
)

in_think = False
for chunk in stream:
    token = chunk.choices[0].delta.content or ""
    if "<think>" in token:
        in_think = True
        print("[Reasoning] ", end="", flush=True)
        continue
    if "</think>" in token:
        in_think = False
        print("\n[Answer] ", end="", flush=True)
        continue
    if not in_think:
        print(token, end="", flush=True)
print()
```

## Docker Deployment on Clore.ai

### Ollama Docker (simplest)

**Docker image:** `ollama/ollama` **Ports:** `22/tcp, 11434/http`

```bash
# On the Clore instance
docker run -d --gpus all \
    -v ollama_data:/root/.ollama \
    -p 11434:11434 \
    --name deepseek-r1 \
    ollama/ollama

# Pull and serve the model
docker exec deepseek-r1 ollama pull deepseek-r1:32b
```

### vLLM Docker (production)

**Docker image:** `vllm/vllm-openai:latest` **Ports:** `22/tcp, 8000/http`

```yaml
# docker-compose.yml
version: "3.8"
services:
  deepseek-r1:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - hf_cache:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN:-}
    command: >
      --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
      --host 0.0.0.0 --port 8000
      --tensor-parallel-size 2
      --max-model-len 32768
      --gpu-memory-utilization 0.90
      --enable-prefix-caching
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 300s
volumes:
  hf_cache:
```

Deploy on Clore.ai:

1. Open [clore.ai/marketplace](https://clore.ai/marketplace)
2. Filter by **2× GPU, 48 GB+ VRAM total** (e.g. 2× RTX 4090 or A100 80 GB)
3. Set the Docker image to `vllm/vllm-openai:latest`
4. Map port **8000** as HTTP
5. Paste the command from the compose file above into the startup command
6. Connect via the HTTP endpoint once the health check passes

## Tips for Clore.ai Deployments

### Choosing the right GPU

| Budget       | GPU              | Daily Cost   | Best Variant                       |
| ------------ | ---------------- | ------------ | ---------------------------------- |
| Minimal      | RTX 3090 (24 GB) | $0.30 – 1.00 | R1-Distill-Qwen-7B or 14B Q4       |
| Standard     | RTX 4090 (24 GB) | $0.50 – 2.00 | R1-Distill-Qwen-14B FP16 or 32B Q4 |
| Production   | A100 80 GB       | $3 – 8       | R1-Distill-Qwen-32B FP16           |
| High quality | 2× A100 80 GB    | $6 – 16      | R1-Distill-Llama-70B FP16          |

### Performance tuning

* **Temperature 0.6** is the recommended default for reasoning tasks — DeepSeek's own papers use this value
* **Set `max_tokens` generously** — reasoning models produce long `<think>` blocks; 4096+ for non-trivial problems
* **Enable prefix caching** (`--enable-prefix-caching` in vLLM) when using a shared system prompt
* **Limit concurrency** (`--max-num-seqs 16`) for reasoning workloads — each request uses more compute than a standard chat
* **Use Q4 quantization** to fit 32B on a single 24 GB GPU with minimal quality loss (the distill already compresses R1's knowledge)

### Context length considerations

Reasoning models consume more context than standard chat models because of the `<think>` block:

| Task Complexity              | Typical Thinking Length | Total Context Needed |
| ---------------------------- | ----------------------- | -------------------- |
| Simple arithmetic            | \~100 tokens            | \~300 tokens         |
| Code generation              | \~500–1000 tokens       | \~2000 tokens        |
| Competition math (AIME)      | \~2000–4000 tokens      | \~5000 tokens        |
| Multi-step research analysis | \~4000–8000 tokens      | \~10000 tokens       |

## Troubleshooting

### Out of memory (OOM)

```bash
# Reduce context length
--max-model-len 8192    # instead of 32768

# Limit concurrent sequences
--max-num-seqs 8

# Use quantization
--quantization awq      # or gptq
```

### Model produces no `<think>` block

Some system prompts suppress thinking. Avoid instructions like "be concise" or "don't explain your reasoning." Use a minimal system prompt or none at all:

```python
# Good — preserves reasoning
messages = [{"role": "user", "content": "..."}]

# Bad — may suppress thinking
messages = [
    {"role": "system", "content": "Be extremely brief. No explanations."},
    {"role": "user", "content": "..."}
]
```

### Repetitive or looping `<think>` output

Lower the temperature to reduce randomness in the reasoning chain:

```python
temperature = 0.0   # Deterministic — best for math/code
temperature = 0.3   # Slight variation — good for analysis
```

### Slow first token (high TTFT)

This is expected — the model generates `<think>` tokens before the visible answer. For latency-sensitive applications where reasoning is not needed, use [DeepSeek-V3](/guides/language-models/deepseek-v3.md) instead.

### Download stalls on Clore instance

HuggingFace downloads can be slow on some providers. Pre-cache the model into a persistent volume:

```bash
# Download once into a volume
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --local-dir /data/models/deepseek-r1-32b

# Point vLLM at local path
vllm serve /data/models/deepseek-r1-32b --host 0.0.0.0 --port 8000
```

## Further Reading

* [DeepSeek-R1 Paper](https://arxiv.org/abs/2501.12948) — *Incentivizing Reasoning Capability in LLMs via Reinforcement Learning*
* [DeepSeek-R1 GitHub](https://github.com/deepseek-ai/DeepSeek-R1) — Official repository with model cards
* [DeepSeek-V3 Guide](/guides/language-models/deepseek-v3.md) — Non-reasoning general-purpose model from the same lab
* [vLLM Guide](/guides/language-models/vllm.md) — Comprehensive production serving setup
* [Ollama Guide](/guides/language-models/ollama.md) — Simple local deployment for any model
* [Open WebUI Guide](/guides/language-models/open-webui.md) — Chat UI with native `<think>` tag rendering
* [Qwen 2.5 Guide](/guides/language-models/qwen25.md) — The base architecture used by most R1 distills


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/deepseek-r1.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.