# Continue.dev AI Coding

Continue.dev is an open-source AI coding assistant for VS Code and JetBrains with 25K+ GitHub stars. The **extension runs on your local machine** (or in your IDE), but it connects to a backend model server for inference. By pointing Continue.dev at a powerful GPU rented from Clore.ai, you get:

* **Top-tier coding models** (34B+ parameters) that won't fit on your laptop
* **Full privacy** — code stays on infrastructure you control
* **Flexible costs** — pay only while you're coding (\~$0.20–0.50/hr vs. $19/mo for Copilot)
* **OpenAI-compatible API** — Continue.dev connects to Ollama, vLLM, or TabbyML seamlessly

This guide focuses on setting up the **Clore.ai GPU backend** (Ollama or vLLM) that your local Continue.dev extension connects to.

{% hint style="success" %}
All GPU server examples use servers rented through the [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

{% hint style="info" %}
**Architecture**: Your IDE (with Continue.dev extension) → Internet → Clore.ai GPU server (running Ollama / vLLM / TabbyML) → local model inference. No code ever touches a third-party API.
{% endhint %}

## Overview

| Property            | Details                                                                      |
| ------------------- | ---------------------------------------------------------------------------- |
| **Project**         | [continuedev/continue](https://github.com/continuedev/continue)              |
| **License**         | Apache 2.0                                                                   |
| **GitHub Stars**    | 25K+                                                                         |
| **IDE Support**     | VS Code, JetBrains (IntelliJ, PyCharm, WebStorm, GoLand, etc.)               |
| **Config File**     | `~/.continue/config.json`                                                    |
| **Backend Options** | Ollama, vLLM, TabbyML, LM Studio, llama.cpp, OpenAI-compatible APIs          |
| **Difficulty**      | Easy (extension install) / Medium (self-hosted backend)                      |
| **GPU Required?**   | On the Clore.ai server (yes); on your laptop (no)                            |
| **Key Features**    | Autocomplete, chat, edit mode, codebase context (RAG), custom slash commands |

### Recommended Models for Coding

| Model                 | VRAM    | Strength                   | Notes                               |
| --------------------- | ------- | -------------------------- | ----------------------------------- |
| `codellama:7b`        | \~6 GB  | Fast autocomplete          | Good starting point                 |
| `codellama:13b`       | \~10 GB | Balanced                   | Best quality/speed for autocomplete |
| `codellama:34b`       | \~22 GB | Best CodeLlama quality     | Needs RTX 3090 / A100               |
| `deepseek-coder:6.7b` | \~5 GB  | Python/JS specialist       | Excellent for web dev               |
| `deepseek-coder:33b`  | \~22 GB | Top-tier open source       | Rivals GPT-4 on code                |
| `qwen2.5-coder:7b`    | \~6 GB  | Multilingual code          | Strong on 40+ languages             |
| `qwen2.5-coder:32b`   | \~22 GB | State-of-the-art           | Best open coding model 2024         |
| `starcoder2:15b`      | \~12 GB | Code completion specialist | FIM (fill-in-the-middle) support    |

## Requirements

### Clore.ai Server Requirements

| Tier            | GPU       | VRAM  | RAM   | Disk   | Price      | Models                                         |
| --------------- | --------- | ----- | ----- | ------ | ---------- | ---------------------------------------------- |
| **Budget**      | RTX 3060  | 12 GB | 16 GB | 40 GB  | \~$0.10/hr | CodeLlama 7B, DeepSeek 6.7B, Qwen2.5-Coder 7B  |
| **Recommended** | RTX 3090  | 24 GB | 32 GB | 80 GB  | \~$0.20/hr | CodeLlama 34B, DeepSeek 33B, Qwen2.5-Coder 32B |
| **Performance** | RTX 4090  | 24 GB | 32 GB | 80 GB  | \~$0.35/hr | Same models as above, faster inference         |
| **Power**       | A100 40GB | 40 GB | 64 GB | 120 GB | \~$0.60/hr | Multiple 34B models concurrently               |
| **Maximum**     | A100 80GB | 80 GB | 80 GB | 200 GB | \~$1.10/hr | 70B models (CodeLlama 70B)                     |

### Local Requirements (Your Machine)

* VS Code or any JetBrains IDE
* Continue.dev extension installed
* Stable internet connection to your Clore.ai server
* **No local GPU needed** — all inference happens on Clore.ai

## Quick Start

### Part 1: Set Up the Clore.ai Backend

#### Option A — Ollama Backend (Recommended for Most Users)

Ollama is the easiest backend for Continue.dev — simple setup, excellent model management, OpenAI-compatible API.

```bash
# 1. SSH into your Clore.ai server
ssh root@<clore-server-ip> -p <port>

# 2. Start Ollama with GPU support
docker run -d \
  --name ollama \
  --gpus all \
  -p 11434:11434 \
  -v /workspace/ollama:/root/.ollama \
  --restart unless-stopped \
  ollama/ollama

# 3. Verify Ollama is running
curl http://localhost:11434/

# 4. Pull your coding model (choose based on your VRAM)
# For 12GB VRAM (RTX 3060):
docker exec ollama ollama pull codellama:13b

# For 24GB VRAM (RTX 3090 / RTX 4090):
docker exec ollama ollama pull qwen2.5-coder:32b
# or:
docker exec ollama ollama pull deepseek-coder:33b

# 5. Pull a fast autocomplete model (separate from chat model)
docker exec ollama ollama pull starcoder2:3b   # Very fast, great for FIM autocomplete

# 6. Verify models are available
docker exec ollama ollama list

# 7. Test inference
docker exec ollama ollama run qwen2.5-coder:32b "Write a Python function to binary search a sorted list"
```

To expose Ollama externally (so your local IDE can connect):

```bash
# Restart Ollama with external access enabled
docker stop ollama && docker rm ollama

docker run -d \
  --name ollama \
  --gpus all \
  -p 11434:11434 \
  -v /workspace/ollama:/root/.ollama \
  -e OLLAMA_HOST=0.0.0.0 \
  --restart unless-stopped \
  ollama/ollama

# Test from your LOCAL machine:
curl http://<clore-server-ip>:11434/api/tags
```

{% hint style="warning" %}
Exposing port 11434 publicly has no authentication by default. For production use, set up an SSH tunnel instead (see [Tips & Best Practices](#tips--best-practices)).
{% endhint %}

#### Option B — vLLM Backend (High-Throughput / OpenAI-Compatible)

vLLM offers faster inference and multi-user support. Ideal if multiple developers share one Clore.ai server.

```bash
# Start vLLM with OpenAI-compatible API
docker run -d \
  --name vllm \
  --gpus all \
  -p 8000:8000 \
  -v /workspace/hf-models:/root/.cache/huggingface \
  -e HF_TOKEN="your-huggingface-token" \
  --restart unless-stopped \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-Coder-32B-Instruct \
  --dtype auto \
  --max-model-len 32768 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --served-model-name qwen2.5-coder-32b

# For multi-GPU (e.g., two RTX 3090s):
docker run -d \
  --name vllm \
  --gpus all \
  -p 8000:8000 \
  -v /workspace/hf-models:/root/.cache/huggingface \
  -e HF_TOKEN="your-huggingface-token" \
  vllm/vllm-openai:latest \
  --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
  --tensor-parallel-size 2 \
  --dtype auto \
  --max-model-len 65536 \
  --served-model-name deepseek-coder-v2

# Test the API
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder-32b",
    "messages": [{"role": "user", "content": "Write a hello world in Rust"}],
    "max_tokens": 200
  }'
```

#### Option C — TabbyML Backend (FIM Autocomplete Specialist)

TabbyML provides superior fill-in-the-middle (FIM) autocomplete — the inline ghost-text suggestions. See the [TabbyML documentation](https://tabby.tabbyml.com/) for full setup details.

```bash
# Quick TabbyML setup for Continue.dev autocomplete
docker run -d \
  --name tabby \
  --gpus all \
  -p 8080:8080 \
  -v /workspace/tabby-data:/data \
  --restart unless-stopped \
  tabbyml/tabby serve \
  --model StarCoder2-7B \
  --chat-model Mistral-7B \
  --device cuda

# Verify
curl http://localhost:8080/v1/health
```

### Part 2: Install Continue.dev Extension

**VS Code:**

1. Open the Extensions panel (`Ctrl+Shift+X` / `Cmd+Shift+X`)
2. Search **"Continue"** — install the official extension by Continue (continuedev)
3. Click the Continue icon in the sidebar (or `Ctrl+Shift+I`)

**JetBrains (IntelliJ, PyCharm, WebStorm, GoLand):**

1. `File → Settings → Plugins → Marketplace`
2. Search **"Continue"** and install
3. Restart the IDE; the Continue panel appears on the right sidebar

### Part 3: Configure Continue.dev to Use Clore.ai

Edit `~/.continue/config.json` on your **local machine**:

```json
{
  "models": [
    {
      "title": "Clore.ai — Qwen2.5-Coder 32B",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b",
      "apiBase": "http://<clore-server-ip>:11434",
      "contextLength": 32768,
      "completionOptions": {
        "temperature": 0.1,
        "topP": 0.95,
        "maxTokens": 4096
      }
    },
    {
      "title": "Clore.ai — CodeLlama 13B (fast)",
      "provider": "ollama",
      "model": "codellama:13b",
      "apiBase": "http://<clore-server-ip>:11434",
      "contextLength": 16384
    }
  ],
  "tabAutocompleteModel": {
    "title": "StarCoder2 3B (autocomplete)",
    "provider": "ollama",
    "model": "starcoder2:3b",
    "apiBase": "http://<clore-server-ip>:11434"
  },
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "nomic-embed-text",
    "apiBase": "http://<clore-server-ip>:11434"
  },
  "contextProviders": [
    { "name": "code" },
    { "name": "docs" },
    { "name": "diff" },
    { "name": "terminal" },
    { "name": "problems" },
    { "name": "folder" },
    { "name": "codebase" }
  ],
  "slashCommands": [
    { "name": "edit", "description": "Edit selected code" },
    { "name": "comment", "description": "Add comments to code" },
    { "name": "share", "description": "Export conversation as markdown" },
    { "name": "cmd", "description": "Generate terminal command" },
    { "name": "commit", "description": "Generate git commit message" }
  ]
}
```

For **vLLM backend** instead of Ollama:

```json
{
  "models": [
    {
      "title": "Clore.ai — DeepSeek Coder 33B (vLLM)",
      "provider": "openai",
      "model": "deepseek-coder-v2",
      "apiBase": "http://<clore-server-ip>:8000/v1",
      "apiKey": "not-required",
      "contextLength": 65536,
      "completionOptions": {
        "temperature": 0.0,
        "maxTokens": 8192
      }
    }
  ]
}
```

For **TabbyML backend** (autocomplete only):

```json
{
  "tabAutocompleteModel": {
    "title": "Clore.ai — TabbyML StarCoder2",
    "provider": "openai",
    "model": "StarCoder2-7B",
    "apiBase": "http://<clore-server-ip>:8080/v1",
    "apiKey": "auth-token-if-set"
  }
}
```

## Configuration

### SSH Tunnel Setup (Secure Remote Access)

Instead of exposing ports publicly, use an SSH tunnel from your local machine:

```bash
# Open SSH tunnel: local port 11434 → Clore.ai server port 11434
ssh -N -L 11434:localhost:11434 root@<clore-server-ip> -p <clore-ssh-port>

# Keep the tunnel alive (add to ~/.ssh/config):
Host clore-coding
  HostName <clore-server-ip>
  Port <clore-ssh-port>
  User root
  LocalForward 11434 localhost:11434
  LocalForward 8000 localhost:8000
  ServerAliveInterval 60
  ServerAliveCountMax 3

# Connect with:
ssh -N clore-coding

# Then in config.json use localhost:
# "apiBase": "http://localhost:11434"
```

### Persistent Tunnel with autossh

```bash
# Install autossh on your local machine (Linux/macOS)
sudo apt install autossh   # Ubuntu/Debian
brew install autossh       # macOS

# Run persistent tunnel that auto-reconnects
autossh -M 0 -N \
  -o "ServerAliveInterval 30" \
  -o "ServerAliveCountMax 3" \
  -L 11434:localhost:11434 \
  root@<clore-server-ip> -p <clore-ssh-port>

# Add to systemd for automatic start on boot (Linux)
cat > ~/.config/systemd/user/clore-tunnel.service << 'EOF'
[Unit]
Description=SSH tunnel to Clore.ai coding server
After=network.target

[Service]
ExecStart=autossh -M 0 -N \
  -o StrictHostKeyChecking=accept-new \
  -o ServerAliveInterval=30 \
  -o ServerAliveCountMax=3 \
  -L 11434:localhost:11434 \
  root@CLORE_IP -p CLORE_PORT
Restart=always
RestartSec=10

[Install]
WantedBy=default.target
EOF

systemctl --user enable clore-tunnel
systemctl --user start clore-tunnel
```

### Load Multiple Models for Different Tasks

For an RTX 3090 (24 GB), you can run a large chat model and a small autocomplete model simultaneously:

```bash
# On the Clore.ai server:

# Pull the models
docker exec ollama ollama pull qwen2.5-coder:32b      # Chat (22 GB)
docker exec ollama ollama pull starcoder2:3b           # Autocomplete (2 GB)
docker exec ollama ollama pull nomic-embed-text        # Embeddings (0.5 GB)

# Ollama handles model swapping automatically
# All three fit within 24 GB VRAM with smart caching

# Monitor VRAM usage
nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 5
```

### Codebase Indexing (RAG for Your Repo)

Continue.dev can index your codebase for context-aware suggestions. Pull an embedding model:

```bash
# On Clore.ai server — add embedding model to Ollama
docker exec ollama ollama pull nomic-embed-text

# In config.json (local), embeddings are already configured above.
# Continue.dev will index your open workspace automatically.
# Trigger manual re-index: Ctrl+Shift+P → "Continue: Index Codebase"
```

## GPU Acceleration

### Monitor Inference Performance

```bash
# On your Clore.ai server — watch GPU during coding sessions
watch -n 1 nvidia-smi

# Check tokens per second (Ollama logs)
docker logs ollama --tail 20 -f

# Detailed GPU stats
nvidia-smi dmon -s u -d 2

# Memory breakdown
nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu \
  --format=csv,noheader -l 5
```

### Expected Performance by GPU

| GPU           | Model                    | Context | Tokens/sec (approx.) |
| ------------- | ------------------------ | ------- | -------------------- |
| RTX 3060 12GB | CodeLlama 7B             | 8K      | \~40–60 t/s          |
| RTX 3060 12GB | DeepSeek-Coder 6.7B      | 8K      | \~45–65 t/s          |
| RTX 3090 24GB | Qwen2.5-Coder 32B (Q4)   | 16K     | \~15–25 t/s          |
| RTX 3090 24GB | DeepSeek-Coder 33B (Q4)  | 16K     | \~15–22 t/s          |
| RTX 4090 24GB | Qwen2.5-Coder 32B (Q4)   | 16K     | \~25–40 t/s          |
| A100 40GB     | Qwen2.5-Coder 32B (FP16) | 32K     | \~35–50 t/s          |
| A100 80GB     | CodeLlama 70B (Q4)       | 32K     | \~20–30 t/s          |

For autocomplete (fill-in-the-middle), **starcoder2:3b** or **codellama:7b** achieve 50–100 t/s — fast enough to feel instant in the IDE.

### Tune Ollama for Better Performance

```bash
# On the Clore.ai server — optimize Ollama settings
docker stop ollama && docker rm ollama

docker run -d \
  --name ollama \
  --gpus all \
  -p 11434:11434 \
  -v /workspace/ollama:/root/.ollama \
  -e OLLAMA_HOST=0.0.0.0 \
  -e OLLAMA_NUM_PARALLEL=2 \
  -e OLLAMA_MAX_LOADED_MODELS=2 \
  -e OLLAMA_FLASH_ATTENTION=1 \
  --restart unless-stopped \
  ollama/ollama

# OLLAMA_NUM_PARALLEL=2: serve 2 requests simultaneously
# OLLAMA_MAX_LOADED_MODELS=2: keep 2 models in GPU memory
# OLLAMA_FLASH_ATTENTION=1: enable flash attention (faster, less memory)
```

## Tips & Best Practices

### Use Different Models for Different Tasks

Configure Continue.dev with specialized models per task type — the UI lets you switch models mid-conversation:

```json
{
  "models": [
    {
      "title": "Chat — Qwen2.5-Coder 32B",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b",
      "apiBase": "http://localhost:11434",
      "contextLength": 32768,
      "description": "Best for complex questions, code review, architecture decisions"
    },
    {
      "title": "Fast — CodeLlama 7B",
      "provider": "ollama",
      "model": "codellama:7b",
      "apiBase": "http://localhost:11434",
      "contextLength": 8192,
      "description": "Quick answers, simple completions, low latency"
    },
    {
      "title": "Autocomplete — StarCoder2 3B",
      "provider": "ollama",
      "model": "starcoder2:3b",
      "apiBase": "http://localhost:11434",
      "contextLength": 4096,
      "description": "Inline ghost-text suggestions"
    }
  ]
}
```

### Cost Comparison

| Solution              | Monthly Cost (8hr/day usage) | Privacy           | Model Quality       |
| --------------------- | ---------------------------- | ----------------- | ------------------- |
| GitHub Copilot        | $19/user/mo                  | ❌ Microsoft cloud | GPT-4o (closed)     |
| Cursor Pro            | $20/user/mo                  | ❌ Cursor cloud    | Claude 3.5 (closed) |
| RTX 3060 on Clore.ai  | \~$24/mo                     | ✅ Your server     | CodeLlama 13B       |
| RTX 3090 on Clore.ai  | \~$48/mo                     | ✅ Your server     | Qwen2.5-Coder 32B   |
| RTX 4090 on Clore.ai  | \~$84/mo                     | ✅ Your server     | Qwen2.5-Coder 32B   |
| A100 80GB on Clore.ai | \~$264/mo                    | ✅ Your server     | CodeLlama 70B       |

For a team of 3+ developers sharing one Clore.ai RTX 3090 (\~$48/mo total), the per-user cost beats Copilot while providing a larger, private model.

### Shut Down When Not Coding

Clore.ai charges per hour. Use a simple script to start/stop the server:

```bash
# Save these as local scripts

# start-coding-server.sh
#!/bin/bash
echo "Opening SSH tunnel to Clore.ai..."
ssh -N -f -L 11434:localhost:11434 clore-coding
echo "Tunnel open. Continue.dev is ready."

# stop-coding-server.sh
#!/bin/bash
echo "Closing SSH tunnel..."
pkill -f "ssh.*clore-coding"
echo "Tunnel closed. Remember to stop your Clore.ai order to stop billing!"
```

### Use Continue.dev Custom Commands

Add custom slash commands to `config.json` for common coding workflows:

```json
{
  "customCommands": [
    {
      "name": "review",
      "prompt": "Review this code for bugs, security issues, and performance problems. Be specific and actionable.",
      "description": "Code review"
    },
    {
      "name": "test",
      "prompt": "Write comprehensive unit tests for this code. Include edge cases. Use the same language/framework as the code.",
      "description": "Generate tests"
    },
    {
      "name": "docstring",
      "prompt": "Add clear, comprehensive docstrings/comments to this code following best practices for the language.",
      "description": "Add documentation"
    },
    {
      "name": "optimize",
      "prompt": "Optimize this code for performance. Explain what you changed and why.",
      "description": "Optimize code"
    }
  ]
}
```

## Troubleshooting

| Problem                                 | Likely Cause                   | Solution                                                                       |
| --------------------------------------- | ------------------------------ | ------------------------------------------------------------------------------ |
| Continue.dev shows "Connection refused" | Ollama not reachable           | Check SSH tunnel is active; verify `curl http://localhost:11434/` works        |
| Autocomplete not triggering             | Tab autocomplete model not set | Add `tabAutocompleteModel` to config.json; enable in Continue settings         |
| Very slow responses (>30s first token)  | Model loading from disk        | First request loads model into VRAM — subsequent requests are fast             |
| "Model not found" error                 | Model not pulled               | Run `docker exec ollama ollama pull <model-name>` on Clore.ai server           |
| High latency between tokens             | Network lag or model too large | Use SSH tunnel; switch to smaller model; check server GPU utilization          |
| Codebase context not working            | Embeddings model missing       | Pull `nomic-embed-text` via Ollama; check `embeddingsProvider` in config.json  |
| SSH tunnel drops frequently             | Unstable connection            | Use `autossh` for persistent reconnection; add `ServerAliveInterval 30`        |
| Context window exceeded                 | Long files/conversations       | Reduce `contextLength` in config.json; use a model with longer context         |
| JetBrains plugin not loading            | IDE version incompatibility    | Update JetBrains IDE to latest; check Continue.dev plugin compatibility matrix |
| vLLM OOM during loading                 | Not enough VRAM                | Add `--gpu-memory-utilization 0.85`; use smaller model or quantized version    |

### Debug Commands

```bash
# On your LOCAL machine — test connectivity
curl http://localhost:11434/api/tags          # if using SSH tunnel
curl http://<clore-ip>:11434/api/tags        # if port is open directly

# On the CLORE.AI server — check Ollama
docker logs ollama --tail 30 -f
docker exec ollama ollama list
docker exec ollama ollama ps                  # show currently loaded models

# Test model response time
time curl http://localhost:11434/api/generate \
  -d '{"model": "codellama:7b", "prompt": "def hello():", "stream": false}'

# Check GPU memory
nvidia-smi --query-gpu=memory.used,memory.free --format=csv

# Check vLLM logs
docker logs vllm --tail 50 -f

# Restart Ollama without losing models
docker restart ollama
```

### Continue.dev Config Validation

```bash
# Validate config.json syntax on your local machine
python3 -c "
import json, sys
try:
    config = json.load(open(sys.argv[1]))
    print('✅ Config is valid JSON')
    print(f'Models: {[m[\"title\"] for m in config.get(\"models\", [])]}')
except Exception as e:
    print(f'❌ Error: {e}')
" ~/.continue/config.json
```

## Further Reading

* [Continue.dev Documentation](https://docs.continue.dev/) — official docs for all IDE integrations and config options
* [Continue.dev GitHub](https://github.com/continuedev/continue) — source code, issues, model compatibility
* [Continue.dev Config Reference](https://docs.continue.dev/reference) — full `config.json` schema
* [Ollama on Clore.ai](https://docs.clore.ai/guides/language-models/ollama) — detailed Ollama setup guide (recommended backend)
* [vLLM on Clore.ai](https://docs.clore.ai/guides/language-models/vllm) — high-performance alternative backend for teams
* [TabbyML](https://tabby.tabbyml.com/) — specialized autocomplete backend with FIM optimization
* [GPU Comparison Guide](https://docs.clore.ai/guides/getting-started/gpu-comparison) — choose the right GPU for your coding workload
* [Model Compatibility](https://docs.clore.ai/guides/getting-started/model-compatibility) — which models fit in which VRAM sizes
* [Qwen2.5-Coder](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) — currently the best open coding model
* [DeepSeek-Coder-V2](https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct) — strong alternative with long context
* [CLORE.AI Marketplace](https://clore.ai/marketplace) — rent GPU servers
