DeepSeek-R1 Reasoning Model

Run DeepSeek-R1 open-source reasoning model on Clore.ai GPUs

circle-check

Overview

DeepSeek-R1 is a 671B-parameter open-weight reasoning model released in January 2025 by DeepSeek under the Apache 2.0 license. It is the first open model to match OpenAI o1 across math, coding, and scientific benchmarks — while exposing its entire chain-of-thought through explicit <think> tags.

The full model uses Mixture-of-Experts (MoE) with 37B active parameters per token, making inference tractable despite the headline parameter count. For most practitioners, the distilled variants (1.5B → 70B) are more practical: they inherit R1's reasoning patterns through knowledge distillation into Qwen-2.5 and Llama-3 base architectures and run on commodity GPUs.

Key Features

  • Explicit chain-of-thought — every response begins with a <think> block where the model reasons, backtracks, and self-corrects before producing a final answer

  • Reinforcement-learning trained — reasoning ability emerges from RL reward signals rather than hand-authored chain-of-thought data

  • Six distilled variants — 1.5B, 7B, 8B, 14B, 32B, 70B parameter models distilled from the full 671B into Qwen and Llama architectures

  • Apache 2.0 license — fully commercial, no royalties, no usage restrictions

  • Wide framework support — Ollama, vLLM, llama.cpp, SGLang, Transformers, TGI all work out of the box

  • AIME 2024 Pass@1: 79.8% — ties with OpenAI o1 on competition math

  • Codeforces 2029 Elo — exceeds o1's 1891 on competitive programming

Model Variants

Variant
Parameters
Architecture
FP16 VRAM
Q4 VRAM
Q4 Disk

DeepSeek-R1 (full MoE)

671B (37B active)

DeepSeek MoE

~1.3 TB

~350 GB

~340 GB

R1-Distill-Llama-70B

70B

Llama 3

140 GB

40 GB

42 GB

R1-Distill-Qwen-32B

32B

Qwen 2.5

64 GB

22 GB

20 GB

R1-Distill-Qwen-14B

14B

Qwen 2.5

28 GB

10 GB

9 GB

R1-Distill-Llama-8B

8B

Llama 3

16 GB

6 GB

5.5 GB

R1-Distill-Qwen-7B

7B

Qwen 2.5

14 GB

5 GB

4.5 GB

R1-Distill-Qwen-1.5B

1.5B

Qwen 2.5

3 GB

2 GB

1.2 GB

Choosing a Variant

Use Case
Recommended Variant
GPU on Clore

Quick experiments, edge testing

R1-Distill-Qwen-1.5B

Any GPU

Budget deployment, fast inference

R1-Distill-Qwen-7B

RTX 3090 (~$0.30–1/day)

Single-GPU production sweet spot

R1-Distill-Qwen-14B Q4

RTX 4090 (~$0.50–2/day)

Best quality-per-dollar (recommended)

R1-Distill-Qwen-32B Q4

RTX 4090 24 GB or A100 40 GB

Maximum distilled quality

R1-Distill-Llama-70B

2× A100 80 GB

Research, full-fidelity reasoning

DeepSeek-R1 671B

8× H100 cluster

HuggingFace Repositories

Requirements

Component
Minimum (7B Q4)
Recommended (32B Q4)

GPU VRAM

6 GB

24 GB

System RAM

16 GB

32 GB

Disk

10 GB

30 GB

CUDA

12.1+

12.4+

Docker

24.0+

25.0+

Ollama Quick Start

Ollama handles quantization, downloading, and serving automatically — the fastest path to a running DeepSeek-R1.

Install and run

Example interactive session

Use the OpenAI-compatible API

Python client (via OpenAI SDK)

vLLM Production Setup

vLLM delivers the highest throughput for multi-user serving with continuous batching, PagedAttention, and prefix caching.

Single GPU — 7B / 14B

Tip: The 32B Q4 GPTQ or AWQ checkpoint fits on a single RTX 4090 (24 GB):

Multi-GPU — 70B

Query the vLLM endpoint

Transformers / Python (with <think> Tag Parsing)

Use HuggingFace Transformers when you need fine-grained control over generation or want to integrate R1 into a Python pipeline.

Basic generation

Parsing <think> tags

Streaming with <think> state tracking

Docker Deployment on Clore.ai

Ollama Docker (simplest)

Docker image: ollama/ollama Ports: 22/tcp, 11434/http

vLLM Docker (production)

Docker image: vllm/vllm-openai:latest Ports: 22/tcp, 8000/http

Deploy on Clore.ai:

  1. Filter by 2× GPU, 48 GB+ VRAM total (e.g. 2× RTX 4090 or A100 80 GB)

  2. Set the Docker image to vllm/vllm-openai:latest

  3. Map port 8000 as HTTP

  4. Paste the command from the compose file above into the startup command

  5. Connect via the HTTP endpoint once the health check passes

Tips for Clore.ai Deployments

Choosing the right GPU

Budget
GPU
Daily Cost
Best Variant

Minimal

RTX 3090 (24 GB)

$0.30 – 1.00

R1-Distill-Qwen-7B or 14B Q4

Standard

RTX 4090 (24 GB)

$0.50 – 2.00

R1-Distill-Qwen-14B FP16 or 32B Q4

Production

A100 80 GB

$3 – 8

R1-Distill-Qwen-32B FP16

High quality

2× A100 80 GB

$6 – 16

R1-Distill-Llama-70B FP16

Performance tuning

  • Temperature 0.6 is the recommended default for reasoning tasks — DeepSeek's own papers use this value

  • Set max_tokens generously — reasoning models produce long <think> blocks; 4096+ for non-trivial problems

  • Enable prefix caching (--enable-prefix-caching in vLLM) when using a shared system prompt

  • Limit concurrency (--max-num-seqs 16) for reasoning workloads — each request uses more compute than a standard chat

  • Use Q4 quantization to fit 32B on a single 24 GB GPU with minimal quality loss (the distill already compresses R1's knowledge)

Context length considerations

Reasoning models consume more context than standard chat models because of the <think> block:

Task Complexity
Typical Thinking Length
Total Context Needed

Simple arithmetic

~100 tokens

~300 tokens

Code generation

~500–1000 tokens

~2000 tokens

Competition math (AIME)

~2000–4000 tokens

~5000 tokens

Multi-step research analysis

~4000–8000 tokens

~10000 tokens

Troubleshooting

Out of memory (OOM)

Model produces no <think> block

Some system prompts suppress thinking. Avoid instructions like "be concise" or "don't explain your reasoning." Use a minimal system prompt or none at all:

Repetitive or looping <think> output

Lower the temperature to reduce randomness in the reasoning chain:

Slow first token (high TTFT)

This is expected — the model generates <think> tokens before the visible answer. For latency-sensitive applications where reasoning is not needed, use DeepSeek-V3 instead.

Download stalls on Clore instance

HuggingFace downloads can be slow on some providers. Pre-cache the model into a persistent volume:

Further Reading

Last updated

Was this helpful?