# DeepSeek V4 (1T MoE, Multimodal)

{% hint style="info" %}
**Status (March 4, 2026):** DeepSeek V4 release is imminent — expected first week of March 2026. This guide covers setup using vLLM/Ollama once weights drop on HuggingFace. Check [huggingface.co/deepseek-ai](https://huggingface.co/deepseek-ai) for the latest release.
{% endhint %}

DeepSeek V4 is the most anticipated open-weight model of early 2026 — a **\~1 trillion parameter multimodal MoE** from DeepSeek AI, trained on NVIDIA's latest chips and optimized for Huawei Ascend hardware. With \~32B active parameters per token, it delivers frontier-class performance at a fraction of the compute cost.

### Key Specs

| Property          | Value                                    |
| ----------------- | ---------------------------------------- |
| Total Parameters  | \~1 Trillion (MoE)                       |
| Active Parameters | \~32B per forward pass                   |
| Context Window    | 1M tokens                                |
| Modalities        | Text + Image + Video                     |
| License           | Expected MIT (like V3)                   |
| Benchmark         | Expected to top open-source leaderboards |

### Why DeepSeek V4?

* **#1 open-weight model** — designed to surpass V3 and rival GPT-4.5/Claude Opus
* **Multimodal** — natively handles text, image, and video inputs
* **1M context** — long-document RAG, entire codebases in context
* **MIT license** — commercial use allowed, no restrictions
* **Massive efficiency** — only 32B active params despite 1T total

***

## Requirements

| Component | Minimum                   | Recommended           |
| --------- | ------------------------- | --------------------- |
| GPU VRAM  | 2× RTX 4090 (48GB) for Q4 | 4× A100 80GB for FP16 |
| RAM       | 64GB                      | 128GB                 |
| Disk      | 500GB (quantized)         | 2TB (FP16)            |
| CUDA      | 12.4+                     | 12.6+                 |

{% hint style="warning" %}
**Large model alert:** DeepSeek V4 at FP16 requires \~2TB of VRAM across multiple A100/H100s. For practical single/dual-node use, wait for GGUF Q4 quantization (expected within days of release). Q4\_K\_M at \~1T params ≈ \~500GB — use multi-node or 8-bit quant on 4× RTX 4090.
{% endhint %}

***

## Option A — Quantized via Ollama (Easiest, once available)

Ollama will add DeepSeek V4 models within hours of weights dropping.

```yaml
version: "3.8"
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - OLLAMA_KEEP_ALIVE=24h
      - OLLAMA_MAX_LOADED_MODELS=1

volumes:
  ollama_data:
```

```bash
# Pull and run DeepSeek V4 (update tag once released)
docker exec ollama ollama pull deepseek-v4:32b-q4_K_M
docker exec ollama ollama run deepseek-v4:32b-q4_K_M

# Or via Open WebUI for a full chat interface
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  ghcr.io/open-webui/open-webui:main
```

***

## Option B — vLLM (Production API, high throughput)

```yaml
version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - hf_cache:/root/.cache/huggingface
    command: >
      --model deepseek-ai/DeepSeek-V4
      --tensor-parallel-size 4
      --max-model-len 32768
      --dtype bfloat16
      --gpu-memory-utilization 0.92
      --served-model-name deepseek-v4
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    shm_size: "16gb"

volumes:
  hf_cache:
```

```bash
# Test the API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4",
    "messages": [{"role": "user", "content": "Explain quantum entanglement simply"}],
    "max_tokens": 512
  }'
```

***

## Option C — llama.cpp (CPU+GPU, quantized)

```bash
# Once GGUF files are available on HuggingFace
docker run --gpus all -it --rm \
  -p 8080:8080 \
  -v $(pwd)/models:/models \
  ghcr.io/ggerganov/llama.cpp:server-cuda \
  -m /models/deepseek-v4-q4_k_m.gguf \
  --n-gpu-layers 80 \
  --threads 8 \
  --ctx-size 8192 \
  --port 8080 \
  --host 0.0.0.0
```

***

## GPU Recommendations on Clore.ai

| Setup        | VRAM  | Expected Performance        | Clore.ai Cost |
| ------------ | ----- | --------------------------- | ------------- |
| 2× RTX 4090  | 48GB  | Q4 quantized, \~15 tok/s    | \~$4–5/day    |
| 4× RTX 4090  | 96GB  | Q5/Q8 quantized, \~25 tok/s | \~$8–10/day   |
| 4× A100 80GB | 320GB | BF16 MoE sharding, fast     | \~$15–20/day  |
| 8× H100 80GB | 640GB | Full FP16, maximum speed    | \~$50+/day    |

{% hint style="success" %}
**Best value on Clore.ai:** Rent 2× RTX 4090 (available from \~$4/day) for Q4 quantized DeepSeek V4. Expect 10–20 tokens/second — perfect for personal use and development.
{% endhint %}

***

## Clore.ai Port Forwarding

Add these to your Clore.ai container port configuration:

| Port  | Service                       |
| ----- | ----------------------------- |
| 11434 | Ollama API                    |
| 8000  | vLLM OpenAI-compatible API    |
| 8080  | llama.cpp server / Open WebUI |
| 3000  | Open WebUI chat interface     |

***

## Performance Tips

1. **Use Q4\_K\_M quantization** for best quality/VRAM tradeoff — still beats most 70B models
2. **Enable flash attention**: add `--enable-chunked-prefill` in vLLM for long contexts
3. **Tensor parallelism**: vLLM's `--tensor-parallel-size N` across N GPUs is seamless
4. **Context length**: Start with 8192 ctx on 2× 4090, increase if VRAM allows
5. **BF16 > FP16** for MoE models — less precision loss on sparse activations

***

## What to Expect

Based on DeepSeek V3 patterns and pre-release benchmarks:

* **Coding:** Expected top-tier on SWE-bench (rivaling Claude 3.7 Sonnet)
* **Math/Reasoning:** MATH-500 and AIME scores above all open-weight predecessors
* **Multimodal:** Image and video understanding comparable to GPT-4V
* **Long context:** 1M token window for entire codebase analysis

***

## Links

* **HuggingFace:** [huggingface.co/deepseek-ai](https://huggingface.co/deepseek-ai) (weights will appear here)
* **GitHub:** [github.com/deepseek-ai](https://github.com/deepseek-ai)
* **DeepSeek V3 guide (current):** [DeepSeek-V3](https://docs.clore.ai/guides/language-models/deepseek-v3)
* **DeepSeek-R1 guide:** [DeepSeek-R1](https://docs.clore.ai/guides/language-models/deepseek-r1)
* **Clore.ai Marketplace:** [clore.ai/marketplace](https://clore.ai/marketplace)
