# DeepSeek-V3

Run DeepSeek-V3, the state-of-the-art open-source LLM with exceptional reasoning capabilities on CLORE.AI GPUs.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

{% hint style="info" %}
**Updated: DeepSeek-V3-0324 (March 2024)** — The latest revision of DeepSeek-V3 brings significant improvements in code generation, mathematical reasoning, and general problem-solving. See the [changelog](#whats-new-in-deepseek-v3-0324) section for details.
{% endhint %}

## Why DeepSeek-V3?

* **State-of-the-art** - Competes with GPT-4o and Claude 3.5 Sonnet
* **671B MoE** - 671B total params, 37B active per token (efficient inference)
* **Improved reasoning** - DeepSeek-V3-0324 is significantly better at math and code
* **Efficient** - MoE architecture reduces compute costs vs dense models
* **Open source** - Fully open weights under MIT license
* **Long context** - 128K token context window

## What's New in DeepSeek-V3-0324

DeepSeek-V3-0324 (March 2024 revision) introduces meaningful improvements across key domains:

### Code Generation

* **+8-12% on HumanEval** compared to original V3
* Better at multi-file codebases and complex refactoring tasks
* Improved understanding of modern frameworks (FastAPI, Pydantic v2, LangChain v0.3)
* More reliable at generating complete, runnable code without omissions

### Mathematical Reasoning

* **+5% on MATH-500** benchmark
* Better step-by-step proof construction
* Improved numerical accuracy for multi-step problems
* Enhanced ability to identify and correct mistakes mid-solution

### General Reasoning

* Stronger logical deduction and causal inference
* Better at multi-step planning tasks
* More consistent performance on edge cases and ambiguous prompts
* Improved instruction following on complex, multi-constraint requests

## Quick Deploy on CLORE.AI

**Docker Image:**

```
vllm/vllm-openai:latest
```

**Ports:**

```
22/tcp
8000/http
```

**Command (Multi-GPU Required):**

```bash
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V3-0324 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 8 \
    --trust-remote-code
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

### Verify It's Working

```bash
# Check if service is ready
curl https://your-http-pub.clorecloud.net/health

# List available models
curl https://your-http-pub.clorecloud.net/v1/models

# Get version
curl https://your-http-pub.clorecloud.net/version
```

{% hint style="warning" %}
**Important:** DeepSeek-V3 requires **8x A100 80GB** GPUs and significant download time. HTTP 502 may persist for 15-30 minutes while the model downloads.
{% endhint %}

## Model Variants

| Model             | Parameters | Active | VRAM Required | HuggingFace                                                                                             |
| ----------------- | ---------- | ------ | ------------- | ------------------------------------------------------------------------------------------------------- |
| DeepSeek-V3-0324  | 671B       | 37B    | 8x80GB        | [deepseek-ai/DeepSeek-V3-0324](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324)                     |
| DeepSeek-V3       | 671B       | 37B    | 8x80GB        | [deepseek-ai/DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3)                               |
| DeepSeek-V3-Base  | 671B       | 37B    | 8x80GB        | [deepseek-ai/DeepSeek-V3-Base](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base)                     |
| DeepSeek-V2.5     | 236B       | 21B    | 4x80GB        | [deepseek-ai/DeepSeek-V2.5](https://huggingface.co/deepseek-ai/DeepSeek-V2.5)                           |
| DeepSeek-V2-Lite  | 16B        | 2.4B   | 16GB          | [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)                     |
| DeepSeek-Coder-V2 | 236B       | 21B    | 4x80GB        | [deepseek-ai/DeepSeek-Coder-V2-Instruct](https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct) |

## Hardware Requirements

### Full Precision

| Model            | Minimum       | Recommended  |
| ---------------- | ------------- | ------------ |
| DeepSeek-V3-0324 | 8x A100 80GB  | 8x H100 80GB |
| DeepSeek-V2.5    | 4x A100 80GB  | 4x H100 80GB |
| DeepSeek-V2-Lite | RTX 4090 24GB | A100 40GB    |

### Quantized (AWQ/GPTQ)

| Model            | Quantization | VRAM   |
| ---------------- | ------------ | ------ |
| DeepSeek-V3-0324 | INT4         | 4x80GB |
| DeepSeek-V2.5    | INT4         | 2x80GB |
| DeepSeek-V2-Lite | INT4         | 8GB    |

## Installation

### Using vLLM (Recommended)

```bash
pip install vllm==0.7.3

# DeepSeek-V3-0324 (latest, 8 GPUs)
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V3-0324 \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

# Original V3 (still available)
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V3 \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000
```

### Using Transformers

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-V3-0324"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Using Ollama

```bash
# Pull DeepSeek-V3 (requires significant resources)
ollama pull deepseek-v3

# Or lighter variant
ollama pull deepseek-coder-v2:16b

# Run
ollama run deepseek-v3
```

## API Usage

### OpenAI-Compatible API (vLLM)

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Write a Python function to find prime numbers."}
    ],
    temperature=0.7,
    max_tokens=1000
)

print(response.choices[0].message.content)
```

### Streaming

```python
stream = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=[{"role": "user", "content": "Explain machine learning"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
```

### cURL

```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-ai/DeepSeek-V3-0324",
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "temperature": 0.7
    }'
```

## DeepSeek-V2-Lite (Single GPU)

For users with limited hardware:

```bash
# Using vLLM
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V2-Lite \
    --trust-remote-code \
    --host 0.0.0.0

# Using Ollama
ollama run deepseek-coder-v2:16b
```

```python
# Using Transformers on single GPU
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-V2-Lite",
    torch_dtype=torch.float16,
    device_map="cuda",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V2-Lite", trust_remote_code=True)
```

## Code Generation

DeepSeek-V3-0324 is best-in-class for code:

```python
prompt = """Write a Python class for a binary search tree with:
- insert
- search
- delete
- in-order traversal
Include type hints and docstrings."""

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.2  # Lower for code
)

print(response.choices[0].message.content)
```

Advanced code tasks where V3-0324 excels:

```python
# Multi-file refactoring
prompt = """I have a Flask application with all code in app.py (500 lines).
Refactor it to use the application factory pattern with blueprints for:
- auth (login, register, logout)
- api (REST endpoints)
- admin (dashboard)
Show the complete file structure and all files."""

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.1,
    max_tokens=4000
)
```

## Math & Reasoning

```python
# Complex math problem
prompt = """Prove that for any integer n >= 1, the sum 1^2 + 2^2 + ... + n^2 = n(n+1)(2n+1)/6.
Use mathematical induction and show all steps clearly."""

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.1  # Very low for math
)

print(response.choices[0].message.content)
```

## Multi-GPU Configuration

### 8x GPU (Full Model — V3-0324)

```bash
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V3-0324 \
    --tensor-parallel-size 8 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code
```

### 4x GPU (V2.5)

```bash
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V2.5 \
    --tensor-parallel-size 4 \
    --max-model-len 16384 \
    --trust-remote-code
```

## Performance

### Throughput (tokens/sec)

| Model                 | GPUs         | Context | Tokens/sec |
| --------------------- | ------------ | ------- | ---------- |
| DeepSeek-V3-0324      | 8x H100      | 32K     | \~85       |
| DeepSeek-V3-0324      | 8x A100 80GB | 32K     | \~52       |
| DeepSeek-V3-0324 INT4 | 4x A100 80GB | 16K     | \~38       |
| DeepSeek-V2.5         | 4x A100 80GB | 16K     | \~70       |
| DeepSeek-V2.5         | 2x A100 80GB | 8K      | \~45       |
| DeepSeek-V2-Lite      | RTX 4090     | 8K      | \~40       |
| DeepSeek-V2-Lite      | RTX 3090     | 4K      | \~25       |

### Time to First Token (TTFT)

| Model            | Configuration | TTFT     |
| ---------------- | ------------- | -------- |
| DeepSeek-V3-0324 | 8x H100       | \~750ms  |
| DeepSeek-V3-0324 | 8x A100       | \~1100ms |
| DeepSeek-V2.5    | 4x A100       | \~500ms  |
| DeepSeek-V2-Lite | RTX 4090      | \~150ms  |

### Memory Usage

| Model            | Precision | VRAM Required |
| ---------------- | --------- | ------------- |
| DeepSeek-V3-0324 | FP16      | 8x 80GB       |
| DeepSeek-V3-0324 | INT4      | 4x 80GB       |
| DeepSeek-V2.5    | FP16      | 4x 80GB       |
| DeepSeek-V2.5    | INT4      | 2x 80GB       |
| DeepSeek-V2-Lite | FP16      | 20GB          |
| DeepSeek-V2-Lite | INT4      | 10GB          |

## Benchmarks

### DeepSeek-V3-0324 vs Competition

| Benchmark         | V3-0324 | V3 (original) | GPT-4o | Claude 3.5 Sonnet |
| ----------------- | ------- | ------------- | ------ | ----------------- |
| MMLU              | 88.5%   | 87.1%         | 88.7%  | 88.3%             |
| HumanEval         | 90.2%   | 82.6%         | 90.2%  | 92.0%             |
| MATH-500          | 67.1%   | 61.6%         | 76.6%  | 71.1%             |
| GSM8K             | 92.1%   | 89.3%         | 95.8%  | 96.4%             |
| LiveCodeBench     | 72.4%   | 65.9%         | 71.3%  | 73.8%             |
| Codeforces Rating | 1850    | 1720          | 1780   | 1790              |

*Note: MATH-500 improvement from V3 → V3-0324 is +5.5 percentage points.*

## Docker Compose

```yaml
version: '3.8'

services:
  deepseek:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model deepseek-ai/DeepSeek-V2-Lite
      --host 0.0.0.0
      --port 8000
      --trust-remote-code
      --gpu-memory-utilization 0.9
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
```

## GPU Requirements Summary

| Use Case              | Recommended Setup  | Cost/Hour |
| --------------------- | ------------------ | --------- |
| Full DeepSeek-V3-0324 | 8x A100 80GB       | \~$2.00   |
| DeepSeek-V2.5         | 4x A100 80GB       | \~$1.00   |
| Development/Testing   | RTX 4090 (V2-Lite) | \~$0.10   |
| Production API        | 8x H100 80GB       | \~$3.00   |

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU Configuration | Hourly Rate | Daily Rate |
| ----------------- | ----------- | ---------- |
| RTX 4090 24GB     | \~$0.10     | \~$2.30    |
| A100 40GB         | \~$0.17     | \~$4.00    |
| A100 80GB         | \~$0.25     | \~$6.00    |
| 4x A100 80GB      | \~$1.00     | \~$24.00   |
| 8x A100 80GB      | \~$2.00     | \~$48.00   |

*Prices vary by provider. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for development (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Use DeepSeek-V2-Lite for testing before scaling up

## Troubleshooting

### Out of Memory

```bash
# Reduce context length
--max-model-len 8192

# Or use quantization
--quantization awq

# For V2-Lite on 12GB GPU
--gpu-memory-utilization 0.85
--max-model-len 4096
```

### Model Download Slow

```bash
# Pre-download
huggingface-cli download deepseek-ai/DeepSeek-V3-0324

# Or use mirror
export HF_ENDPOINT=https://hf-mirror.com
```

### trust\_remote\_code Error

```bash
# Always include this flag for DeepSeek models
--trust-remote-code
```

### Multi-GPU Not Working

```bash
# Check NCCL
nvidia-smi topo -m

# Set NCCL variables
export NCCL_DEBUG=INFO
export NCCL_P2P_DISABLE=0
```

## DeepSeek vs Others

| Feature    | DeepSeek-V3-0324  | Llama 3.1 405B | Mixtral 8x22B     |
| ---------- | ----------------- | -------------- | ----------------- |
| Parameters | 671B (37B active) | 405B           | 176B (44B active) |
| Context    | 128K              | 128K           | 64K               |
| Code       | **Excellent**     | Great          | Good              |
| Math       | **Excellent**     | Good           | Good              |
| Min VRAM   | 8x80GB            | 8x80GB         | 2x80GB            |
| License    | MIT               | Llama 3.1      | Apache 2.0        |

**Use DeepSeek-V3 when:**

* Best reasoning performance needed
* Code generation is primary use
* Math/logic tasks are important
* Have multi-GPU setup available
* Want fully open-source weights (MIT license)

## Next Steps

* [vLLM](https://docs.clore.ai/guides/language-models/vllm) - Deployment server
* [DeepSeek-R1](https://docs.clore.ai/guides/language-models/deepseek-r1) - Reasoning-specialized variant
* [DeepSeek Coder](https://docs.clore.ai/guides/language-models/deepseek-coder) - Code-specific variant
* [Ollama](https://docs.clore.ai/guides/language-models/ollama) - Simpler deployment
* [Fine-tune LLM](https://docs.clore.ai/guides/training/finetune-llm) - Custom training
