DeepSeek-V3

Run DeepSeek-V3 with exceptional reasoning on Clore.ai GPUs

Run DeepSeek-V3, the state-of-the-art open-source LLM with exceptional reasoning capabilities on CLORE.AI GPUs.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Updated: DeepSeek-V3-0324 (March 2024) — The latest revision of DeepSeek-V3 brings significant improvements in code generation, mathematical reasoning, and general problem-solving. See the changelog section for details.

Why DeepSeek-V3?

State-of-the-art - Competes with GPT-4o and Claude 3.5 Sonnet
671B MoE - 671B total params, 37B active per token (efficient inference)
Improved reasoning - DeepSeek-V3-0324 is significantly better at math and code
Efficient - MoE architecture reduces compute costs vs dense models
Open source - Fully open weights under MIT license
Long context - 128K token context window

What's New in DeepSeek-V3-0324

DeepSeek-V3-0324 (March 2024 revision) introduces meaningful improvements across key domains:

Code Generation

+8-12% on HumanEval compared to original V3
Better at multi-file codebases and complex refactoring tasks
Improved understanding of modern frameworks (FastAPI, Pydantic v2, LangChain v0.3)
More reliable at generating complete, runnable code without omissions

Mathematical Reasoning

+5% on MATH-500 benchmark
Better step-by-step proof construction
Improved numerical accuracy for multi-step problems
Enhanced ability to identify and correct mistakes mid-solution

General Reasoning

Stronger logical deduction and causal inference
Better at multi-step planning tasks
More consistent performance on edge cases and ambiguous prompts
Improved instruction following on complex, multi-constraint requests

Quick Deploy on CLORE.AI

Docker Image:

vllm/vllm-openai:latest

Ports:

22/tcp
8000/http

Command (Multi-GPU Required):

python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V3-0324 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 8 \
    --trust-remote-code

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Verify It's Working

# Check if service is ready
curl https://your-http-pub.clorecloud.net/health

# List available models
curl https://your-http-pub.clorecloud.net/v1/models

# Get version
curl https://your-http-pub.clorecloud.net/version

Important: DeepSeek-V3 requires 8x A100 80GB GPUs and significant download time. HTTP 502 may persist for 15-30 minutes while the model downloads.

Model Variants

Model

Parameters

Active

VRAM Required

HuggingFace

DeepSeek-V3-0324

671B

37B

8x80GB

deepseek-ai/DeepSeek-V3-0324

DeepSeek-V3

671B

37B

8x80GB

deepseek-ai/DeepSeek-V3

DeepSeek-V3-Base

671B

37B

8x80GB

deepseek-ai/DeepSeek-V3-Base

DeepSeek-V2.5

236B

21B

4x80GB

deepseek-ai/DeepSeek-V2.5

DeepSeek-V2-Lite

16B

2.4B

16GB

deepseek-ai/DeepSeek-V2-Lite

DeepSeek-Coder-V2

236B

21B

4x80GB

deepseek-ai/DeepSeek-Coder-V2-Instruct

Hardware Requirements

Full Precision

Model

Minimum

Recommended

DeepSeek-V3-0324

8x A100 80GB

8x H100 80GB

DeepSeek-V2.5

4x A100 80GB

4x H100 80GB

DeepSeek-V2-Lite

RTX 4090 24GB

A100 40GB

Quantized (AWQ/GPTQ)

Model

Quantization

VRAM

DeepSeek-V3-0324

INT4

4x80GB

DeepSeek-V2.5

INT4

2x80GB

DeepSeek-V2-Lite

INT4

8GB

Installation

Using vLLM (Recommended)

pip install vllm==0.7.3

# DeepSeek-V3-0324 (latest, 8 GPUs)
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V3-0324 \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

# Original V3 (still available)
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V3 \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

Using Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-V3-0324"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using Ollama

# Pull DeepSeek-V3 (requires significant resources)
ollama pull deepseek-v3

# Or lighter variant
ollama pull deepseek-coder-v2:16b

# Run
ollama run deepseek-v3

API Usage

OpenAI-Compatible API (vLLM)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Write a Python function to find prime numbers."}
    ],
    temperature=0.7,
    max_tokens=1000
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=[{"role": "user", "content": "Explain machine learning"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

cURL

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-ai/DeepSeek-V3-0324",
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "temperature": 0.7
    }'

DeepSeek-V2-Lite (Single GPU)

For users with limited hardware:

# Using vLLM
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V2-Lite \
    --trust-remote-code \
    --host 0.0.0.0

# Using Ollama
ollama run deepseek-coder-v2:16b

# Using Transformers on single GPU
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-V2-Lite",
    torch_dtype=torch.float16,
    device_map="cuda",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V2-Lite", trust_remote_code=True)

Code Generation

DeepSeek-V3-0324 is best-in-class for code:

prompt = """Write a Python class for a binary search tree with:
- insert
- search
- delete
- in-order traversal
Include type hints and docstrings."""

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.2  # Lower for code
)

print(response.choices[0].message.content)

Advanced code tasks where V3-0324 excels:

# Multi-file refactoring
prompt = """I have a Flask application with all code in app.py (500 lines).
Refactor it to use the application factory pattern with blueprints for:
- auth (login, register, logout)
- api (REST endpoints)
- admin (dashboard)
Show the complete file structure and all files."""

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.1,
    max_tokens=4000
)

Math & Reasoning

# Complex math problem
prompt = """Prove that for any integer n >= 1, the sum 1^2 + 2^2 + ... + n^2 = n(n+1)(2n+1)/6.
Use mathematical induction and show all steps clearly."""

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.1  # Very low for math
)

print(response.choices[0].message.content)

Multi-GPU Configuration

8x GPU (Full Model — V3-0324)

python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V3-0324 \
    --tensor-parallel-size 8 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code

4x GPU (V2.5)

python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V2.5 \
    --tensor-parallel-size 4 \
    --max-model-len 16384 \
    --trust-remote-code

Performance

Throughput (tokens/sec)

Model

GPUs

Context

Tokens/sec

DeepSeek-V3-0324

8x H100

32K

~85

DeepSeek-V3-0324

8x A100 80GB

32K

~52

DeepSeek-V3-0324 INT4

4x A100 80GB

16K

~38

DeepSeek-V2.5

4x A100 80GB

16K

~70

DeepSeek-V2.5

2x A100 80GB

~45

DeepSeek-V2-Lite

RTX 4090

~40

DeepSeek-V2-Lite

RTX 3090

~25

Time to First Token (TTFT)

Model

Configuration

TTFT

DeepSeek-V3-0324

8x H100

~750ms

DeepSeek-V3-0324

8x A100

~1100ms

DeepSeek-V2.5

4x A100

~500ms

DeepSeek-V2-Lite

RTX 4090

~150ms

Memory Usage

Model

Precision

VRAM Required

DeepSeek-V3-0324

FP16

8x 80GB

DeepSeek-V3-0324

INT4

4x 80GB

DeepSeek-V2.5

FP16

4x 80GB

DeepSeek-V2.5

INT4

2x 80GB

DeepSeek-V2-Lite

FP16

20GB

DeepSeek-V2-Lite

INT4

10GB

Benchmarks

DeepSeek-V3-0324 vs Competition

Benchmark

V3-0324

V3 (original)

GPT-4o

Claude 3.5 Sonnet

MMLU

88.5%

87.1%

88.7%

88.3%

HumanEval

90.2%

82.6%

90.2%

92.0%

MATH-500

67.1%

61.6%

76.6%

71.1%

GSM8K

92.1%

89.3%

95.8%

96.4%

LiveCodeBench

72.4%

65.9%

71.3%

73.8%

Codeforces Rating

1850

1720

1780

1790

Note: MATH-500 improvement from V3 → V3-0324 is +5.5 percentage points.

Docker Compose

version: '3.8'

services:
  deepseek:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model deepseek-ai/DeepSeek-V2-Lite
      --host 0.0.0.0
      --port 8000
      --trust-remote-code
      --gpu-memory-utilization 0.9
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

GPU Requirements Summary

Use Case

Recommended Setup

Cost/Hour

Full DeepSeek-V3-0324

8x A100 80GB

~$2.00

DeepSeek-V2.5

4x A100 80GB

~$1.00

Development/Testing

RTX 4090 (V2-Lite)

~$0.10

Production API

8x H100 80GB

~$3.00

Cost Estimate

Typical CLORE.AI marketplace rates:

GPU Configuration

Hourly Rate

Daily Rate

RTX 4090 24GB

~$0.10

~$2.30

A100 40GB

~$0.17

~$4.00

A100 80GB

~$0.25

~$6.00

4x A100 80GB

~$1.00

~$24.00

8x A100 80GB

~$2.00

~$48.00

Prices vary by provider. Check CLORE.AI Marketplace for current rates.

Save money:

Use Spot market for development (often 30-50% cheaper)
Pay with CLORE tokens
Use DeepSeek-V2-Lite for testing before scaling up

Troubleshooting

Out of Memory

# Reduce context length
--max-model-len 8192

# Or use quantization
--quantization awq

# For V2-Lite on 12GB GPU
--gpu-memory-utilization 0.85
--max-model-len 4096

Model Download Slow

# Pre-download
huggingface-cli download deepseek-ai/DeepSeek-V3-0324

# Or use mirror
export HF_ENDPOINT=https://hf-mirror.com

trust_remote_code Error

# Always include this flag for DeepSeek models
--trust-remote-code

Multi-GPU Not Working

# Check NCCL
nvidia-smi topo -m

# Set NCCL variables
export NCCL_DEBUG=INFO
export NCCL_P2P_DISABLE=0

DeepSeek vs Others

Feature

DeepSeek-V3-0324

Llama 3.1 405B

Mixtral 8x22B

Parameters

671B (37B active)

405B

176B (44B active)

Context

128K

64K

Code

Excellent

Great

Good

Math

Excellent

Good

Min VRAM

8x80GB

2x80GB

License

MIT

Llama 3.1

Apache 2.0

Use DeepSeek-V3 when:

Best reasoning performance needed
Code generation is primary use
Math/logic tasks are important
Have multi-GPU setup available
Want fully open-source weights (MIT license)

Next Steps

vLLM - Deployment server
DeepSeek-R1 - Reasoning-specialized variant
DeepSeek Coder - Code-specific variant
Ollama - Simpler deployment
Fine-tune LLM - Custom training

PreviousDeepSeek Coder NextDeepSeek-R1 Reasoning Model

Last updated 7 days ago

Was this helpful?

hashtagWhy DeepSeek-V3?

hashtagWhat's New in DeepSeek-V3-0324

hashtagCode Generation

hashtagMathematical Reasoning

hashtagGeneral Reasoning

hashtagQuick Deploy on CLORE.AI

hashtagAccessing Your Service

hashtagVerify It's Working

hashtagModel Variants

hashtagHardware Requirements

hashtagFull Precision

hashtagQuantized (AWQ/GPTQ)

hashtagInstallation

hashtagUsing vLLM (Recommended)

hashtagUsing Transformers

hashtagUsing Ollama

hashtagAPI Usage

hashtagOpenAI-Compatible API (vLLM)

hashtagStreaming

hashtagcURL

hashtagDeepSeek-V2-Lite (Single GPU)

hashtagCode Generation

hashtagMath & Reasoning

hashtagMulti-GPU Configuration

hashtag8x GPU (Full Model — V3-0324)

hashtag4x GPU (V2.5)

hashtagPerformance

hashtagThroughput (tokens/sec)

hashtagTime to First Token (TTFT)

hashtagMemory Usage

hashtagBenchmarks

hashtagDeepSeek-V3-0324 vs Competition

hashtagDocker Compose

hashtagGPU Requirements Summary

hashtagCost Estimate

hashtagTroubleshooting

hashtagOut of Memory

hashtagModel Download Slow

hashtagtrust_remote_code Error

hashtagMulti-GPU Not Working

hashtagDeepSeek vs Others

hashtagNext Steps

Why DeepSeek-V3?

What's New in DeepSeek-V3-0324

Code Generation

Mathematical Reasoning

General Reasoning

Quick Deploy on CLORE.AI

Accessing Your Service

Verify It's Working

Model Variants

Hardware Requirements

Full Precision

Quantized (AWQ/GPTQ)

Installation

Using vLLM (Recommended)

Using Transformers

Using Ollama

API Usage

OpenAI-Compatible API (vLLM)

Streaming

cURL

DeepSeek-V2-Lite (Single GPU)

Code Generation

Math & Reasoning

Multi-GPU Configuration

8x GPU (Full Model — V3-0324)

4x GPU (V2.5)

Performance

Throughput (tokens/sec)

Time to First Token (TTFT)

Memory Usage

Benchmarks

DeepSeek-V3-0324 vs Competition

Docker Compose

GPU Requirements Summary

Cost Estimate

Troubleshooting

Out of Memory

Model Download Slow

trust_remote_code Error

Multi-GPU Not Working

DeepSeek vs Others

Next Steps