# DeepSeek-V3

Run DeepSeek-V3, the state-of-the-art open-source LLM with exceptional reasoning capabilities on CLORE.AI GPUs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

{% hint style="info" %}
**Updated: DeepSeek-V3-0324 (March 2024)** — The latest revision of DeepSeek-V3 brings significant improvements in code generation, mathematical reasoning, and general problem-solving. See the [changelog](#whats-new-in-deepseek-v3-0324) section for details.
{% endhint %}

## Why DeepSeek-V3?

* **State-of-the-art** - Competes with GPT-4o and Claude 3.5 Sonnet
* **671B MoE** - 671B total params, 37B active per token (efficient inference)
* **Improved reasoning** - DeepSeek-V3-0324 is significantly better at math and code
* **Efficient** - MoE architecture reduces compute costs vs dense models
* **Open source** - Fully open weights under MIT license
* **Long context** - 128K token context window

## What's New in DeepSeek-V3-0324

DeepSeek-V3-0324 (March 2024 revision) introduces meaningful improvements across key domains:

### Code Generation

* **+8-12% on HumanEval** compared to original V3
* Better at multi-file codebases and complex refactoring tasks
* Improved understanding of modern frameworks (FastAPI, Pydantic v2, LangChain v0.3)
* More reliable at generating complete, runnable code without omissions

### Mathematical Reasoning

* **+5% on MATH-500** benchmark
* Better step-by-step proof construction
* Improved numerical accuracy for multi-step problems
* Enhanced ability to identify and correct mistakes mid-solution

### General Reasoning

* Stronger logical deduction and causal inference
* Better at multi-step planning tasks
* More consistent performance on edge cases and ambiguous prompts
* Improved instruction following on complex, multi-constraint requests

## Quick Deploy on CLORE.AI

**Docker Image:**

```
vllm/vllm-openai:latest
```

**Ports:**

```
22/tcp
8000/http
```

**Command (Multi-GPU Required):**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V3-0324 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 8 \
    --trust-remote-code
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

### Verify It's Working

```bash
# Check if service is ready
curl https://your-http-pub.clorecloud.net/health

# List available models
curl https://your-http-pub.clorecloud.net/v1/models

# Get version
curl https://your-http-pub.clorecloud.net/version
```

{% hint style="warning" %}
**Important:** DeepSeek-V3 requires **8x A100 80GB** GPUs and significant download time. HTTP 502 may persist for 15-30 minutes while the model downloads.
{% endhint %}

## Model Variants

| Model             | Parameters | Active | VRAM Required | HuggingFace                                                                                             |
| ----------------- | ---------- | ------ | ------------- | ------------------------------------------------------------------------------------------------------- |
| DeepSeek-V3-0324  | 671B       | 37B    | 8x80GB        | [deepseek-ai/DeepSeek-V3-0324](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324)                     |
| DeepSeek-V3       | 671B       | 37B    | 8x80GB        | [deepseek-ai/DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3)                               |
| DeepSeek-V3-Base  | 671B       | 37B    | 8x80GB        | [deepseek-ai/DeepSeek-V3-Base](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base)                     |
| DeepSeek-V2.5     | 236B       | 21B    | 4x80GB        | [deepseek-ai/DeepSeek-V2.5](https://huggingface.co/deepseek-ai/DeepSeek-V2.5)                           |
| DeepSeek-V2-Lite  | 16B        | 2.4B   | 16GB          | [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)                     |
| DeepSeek-Coder-V2 | 236B       | 21B    | 4x80GB        | [deepseek-ai/DeepSeek-Coder-V2-Instruct](https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct) |

## Hardware Requirements

### Full Precision

| Model            | Minimum       | Recommended  |
| ---------------- | ------------- | ------------ |
| DeepSeek-V3-0324 | 8x A100 80GB  | 8x H100 80GB |
| DeepSeek-V2.5    | 4x A100 80GB  | 4x H100 80GB |
| DeepSeek-V2-Lite | RTX 4090 24GB | A100 40GB    |

### Quantized (AWQ/GPTQ)

| Model            | Quantization | VRAM   |
| ---------------- | ------------ | ------ |
| DeepSeek-V3-0324 | INT4         | 4x80GB |
| DeepSeek-V2.5    | INT4         | 2x80GB |
| DeepSeek-V2-Lite | INT4         | 8GB    |

## Installation

### Using vLLM (Recommended)

```bash
pip install vllm==0.7.3

# DeepSeek-V3-0324 (latest, 8 GPUs)
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V3-0324 \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

# Original V3 (still available)
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V3 \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000
```

### Using Transformers

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-V3-0324"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Using Ollama

```bash
# Pull DeepSeek-V3 (requires significant resources)
ollama pull deepseek-v3

# Or lighter variant
ollama pull deepseek-coder-v2:16b

# Run
ollama run deepseek-v3
```

## API Usage

### OpenAI-Compatible API (vLLM)

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Write a Python function to find prime numbers."}
    ],
    temperature=0.7,
    max_tokens=1000
)

print(response.choices[0].message.content)
```

### Streaming

```python
stream = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=[{"role": "user", "content": "Explain machine learning"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
```

### cURL

```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-ai/DeepSeek-V3-0324",
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "temperature": 0.7
    }'
```

## DeepSeek-V2-Lite (Single GPU)

For users with limited hardware:

```bash
# Using vLLM
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V2-Lite \
    --trust-remote-code \
    --host 0.0.0.0

# Using Ollama
ollama run deepseek-coder-v2:16b
```

```python
# Using Transformers on single GPU
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-V2-Lite",
    torch_dtype=torch.float16,
    device_map="cuda",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V2-Lite", trust_remote_code=True)
```

## Code Generation

DeepSeek-V3-0324 is best-in-class for code:

```python
prompt = """Write a Python class for a binary search tree with:
- insert
- search
- delete
- in-order traversal
Include type hints and docstrings."""

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.2  # Lower for code
)

print(response.choices[0].message.content)
```

Advanced code tasks where V3-0324 excels:

```python
# Multi-file refactoring
prompt = """I have a Flask application with all code in app.py (500 lines).
Refactor it to use the application factory pattern with blueprints for:
- auth (login, register, logout)
- api (REST endpoints)
- admin (dashboard)
Show the complete file structure and all files."""

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.1,
    max_tokens=4000
)
```

## Math & Reasoning

```python
# Complex math problem
prompt = """Prove that for any integer n >= 1, the sum 1^2 + 2^2 + ... + n^2 = n(n+1)(2n+1)/6.
Use mathematical induction and show all steps clearly."""

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.1  # Very low for math
)

print(response.choices[0].message.content)
```

## Multi-GPU Configuration

### 8x GPU (Full Model — V3-0324)

```bash
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V3-0324 \
    --tensor-parallel-size 8 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code
```

### 4x GPU (V2.5)

```bash
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V2.5 \
    --tensor-parallel-size 4 \
    --max-model-len 16384 \
    --trust-remote-code
```

## Performance

### Throughput (tokens/sec)

| Model                 | GPUs         | Context | Tokens/sec |
| --------------------- | ------------ | ------- | ---------- |
| DeepSeek-V3-0324      | 8x H100      | 32K     | \~85       |
| DeepSeek-V3-0324      | 8x A100 80GB | 32K     | \~52       |
| DeepSeek-V3-0324 INT4 | 4x A100 80GB | 16K     | \~38       |
| DeepSeek-V2.5         | 4x A100 80GB | 16K     | \~70       |
| DeepSeek-V2.5         | 2x A100 80GB | 8K      | \~45       |
| DeepSeek-V2-Lite      | RTX 4090     | 8K      | \~40       |
| DeepSeek-V2-Lite      | RTX 3090     | 4K      | \~25       |

### Time to First Token (TTFT)

| Model            | Configuration | TTFT     |
| ---------------- | ------------- | -------- |
| DeepSeek-V3-0324 | 8x H100       | \~750ms  |
| DeepSeek-V3-0324 | 8x A100       | \~1100ms |
| DeepSeek-V2.5    | 4x A100       | \~500ms  |
| DeepSeek-V2-Lite | RTX 4090      | \~150ms  |

### Memory Usage

| Model            | Precision | VRAM Required |
| ---------------- | --------- | ------------- |
| DeepSeek-V3-0324 | FP16      | 8x 80GB       |
| DeepSeek-V3-0324 | INT4      | 4x 80GB       |
| DeepSeek-V2.5    | FP16      | 4x 80GB       |
| DeepSeek-V2.5    | INT4      | 2x 80GB       |
| DeepSeek-V2-Lite | FP16      | 20GB          |
| DeepSeek-V2-Lite | INT4      | 10GB          |

## Benchmarks

### DeepSeek-V3-0324 vs Competition

| Benchmark         | V3-0324 | V3 (original) | GPT-4o | Claude 3.5 Sonnet |
| ----------------- | ------- | ------------- | ------ | ----------------- |
| MMLU              | 88.5%   | 87.1%         | 88.7%  | 88.3%             |
| HumanEval         | 90.2%   | 82.6%         | 90.2%  | 92.0%             |
| MATH-500          | 67.1%   | 61.6%         | 76.6%  | 71.1%             |
| GSM8K             | 92.1%   | 89.3%         | 95.8%  | 96.4%             |
| LiveCodeBench     | 72.4%   | 65.9%         | 71.3%  | 73.8%             |
| Codeforces Rating | 1850    | 1720          | 1780   | 1790              |

*Note: MATH-500 improvement from V3 → V3-0324 is +5.5 percentage points.*

## Docker Compose

```yaml
version: '3.8'

services:
  deepseek:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model deepseek-ai/DeepSeek-V2-Lite
      --host 0.0.0.0
      --port 8000
      --trust-remote-code
      --gpu-memory-utilization 0.9
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
```

## GPU Requirements Summary

| Use Case              | Recommended Setup  | Cost/Hour |
| --------------------- | ------------------ | --------- |
| Full DeepSeek-V3-0324 | 8x A100 80GB       | \~$2.00   |
| DeepSeek-V2.5         | 4x A100 80GB       | \~$1.00   |
| Development/Testing   | RTX 4090 (V2-Lite) | \~$0.10   |
| Production API        | 8x H100 80GB       | \~$3.00   |

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU Configuration | Hourly Rate | Daily Rate |
| ----------------- | ----------- | ---------- |
| RTX 4090 24GB     | \~$0.10     | \~$2.30    |
| A100 40GB         | \~$0.17     | \~$4.00    |
| A100 80GB         | \~$0.25     | \~$6.00    |
| 4x A100 80GB      | \~$1.00     | \~$24.00   |
| 8x A100 80GB      | \~$2.00     | \~$48.00   |

*Prices vary by provider. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for development (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Use DeepSeek-V2-Lite for testing before scaling up

## Troubleshooting

### Out of Memory

```bash
# Reduce context length
--max-model-len 8192

# Or use quantization
--quantization awq

# For V2-Lite on 12GB GPU
--gpu-memory-utilization 0.85
--max-model-len 4096
```

### Model Download Slow

```bash
# Pre-download
huggingface-cli download deepseek-ai/DeepSeek-V3-0324

# Or use mirror
export HF_ENDPOINT=https://hf-mirror.com
```

### trust\_remote\_code Error

```bash
# Always include this flag for DeepSeek models
--trust-remote-code
```

### Multi-GPU Not Working

```bash
# Check NCCL
nvidia-smi topo -m

# Set NCCL variables
export NCCL_DEBUG=INFO
export NCCL_P2P_DISABLE=0
```

## DeepSeek vs Others

| Feature    | DeepSeek-V3-0324  | Llama 3.1 405B | Mixtral 8x22B     |
| ---------- | ----------------- | -------------- | ----------------- |
| Parameters | 671B (37B active) | 405B           | 176B (44B active) |
| Context    | 128K              | 128K           | 64K               |
| Code       | **Excellent**     | Great          | Good              |
| Math       | **Excellent**     | Good           | Good              |
| Min VRAM   | 8x80GB            | 8x80GB         | 2x80GB            |
| License    | MIT               | Llama 3.1      | Apache 2.0        |

**Use DeepSeek-V3 when:**

* Best reasoning performance needed
* Code generation is primary use
* Math/logic tasks are important
* Have multi-GPU setup available
* Want fully open-source weights (MIT license)

## Next Steps

* [vLLM](/guides/language-models/vllm.md) - Deployment server
* [DeepSeek-R1](/guides/language-models/deepseek-r1.md) - Reasoning-specialized variant
* [DeepSeek Coder](/guides/language-models/deepseek-coder.md) - Code-specific variant
* [Ollama](/guides/language-models/ollama.md) - Simpler deployment
* [Fine-tune LLM](/guides/training/finetune-llm.md) - Custom training


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/deepseek-v3.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
