# Continue.dev AI Coding

Continue.dev is an open-source AI coding assistant for VS Code and JetBrains with 25K+ GitHub stars. The **extension runs on your local machine** (or in your IDE), but it connects to a backend model server for inference. By pointing Continue.dev at a powerful GPU rented from Clore.ai, you get:

* **Top-tier coding models** (34B+ parameters) that won't fit on your laptop
* **Full privacy** — code stays on infrastructure you control
* **Flexible costs** — pay only while you're coding (\~$0.20–0.50/hr vs. $19/mo for Copilot)
* **OpenAI-compatible API** — Continue.dev connects to Ollama, vLLM, or TabbyML seamlessly

This guide focuses on setting up the **Clore.ai GPU backend** (Ollama or vLLM) that your local Continue.dev extension connects to.

{% hint style="success" %}
All GPU server examples use servers rented through the [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

{% hint style="info" %}
**Architecture**: Your IDE (with Continue.dev extension) → Internet → Clore.ai GPU server (running Ollama / vLLM / TabbyML) → local model inference. No code ever touches a third-party API.
{% endhint %}

## Overview

| Property            | Details                                                                      |
| ------------------- | ---------------------------------------------------------------------------- |
| **Project**         | [continuedev/continue](https://github.com/continuedev/continue)              |
| **License**         | Apache 2.0                                                                   |
| **GitHub Stars**    | 25K+                                                                         |
| **IDE Support**     | VS Code, JetBrains (IntelliJ, PyCharm, WebStorm, GoLand, etc.)               |
| **Config File**     | `~/.continue/config.json`                                                    |
| **Backend Options** | Ollama, vLLM, TabbyML, LM Studio, llama.cpp, OpenAI-compatible APIs          |
| **Difficulty**      | Easy (extension install) / Medium (self-hosted backend)                      |
| **GPU Required?**   | On the Clore.ai server (yes); on your laptop (no)                            |
| **Key Features**    | Autocomplete, chat, edit mode, codebase context (RAG), custom slash commands |

### Recommended Models for Coding

| Model                 | VRAM    | Strength                   | Notes                               |
| --------------------- | ------- | -------------------------- | ----------------------------------- |
| `codellama:7b`        | \~6 GB  | Fast autocomplete          | Good starting point                 |
| `codellama:13b`       | \~10 GB | Balanced                   | Best quality/speed for autocomplete |
| `codellama:34b`       | \~22 GB | Best CodeLlama quality     | Needs RTX 3090 / A100               |
| `deepseek-coder:6.7b` | \~5 GB  | Python/JS specialist       | Excellent for web dev               |
| `deepseek-coder:33b`  | \~22 GB | Top-tier open source       | Rivals GPT-4 on code                |
| `qwen2.5-coder:7b`    | \~6 GB  | Multilingual code          | Strong on 40+ languages             |
| `qwen2.5-coder:32b`   | \~22 GB | State-of-the-art           | Best open coding model 2024         |
| `starcoder2:15b`      | \~12 GB | Code completion specialist | FIM (fill-in-the-middle) support    |

## Requirements

### Clore.ai Server Requirements

| Tier            | GPU       | VRAM  | RAM   | Disk   | Price      | Models                                         |
| --------------- | --------- | ----- | ----- | ------ | ---------- | ---------------------------------------------- |
| **Budget**      | RTX 3060  | 12 GB | 16 GB | 40 GB  | \~$0.10/hr | CodeLlama 7B, DeepSeek 6.7B, Qwen2.5-Coder 7B  |
| **Recommended** | RTX 3090  | 24 GB | 32 GB | 80 GB  | \~$0.20/hr | CodeLlama 34B, DeepSeek 33B, Qwen2.5-Coder 32B |
| **Performance** | RTX 4090  | 24 GB | 32 GB | 80 GB  | \~$0.35/hr | Same models as above, faster inference         |
| **Power**       | A100 40GB | 40 GB | 64 GB | 120 GB | \~$0.60/hr | Multiple 34B models concurrently               |
| **Maximum**     | A100 80GB | 80 GB | 80 GB | 200 GB | \~$1.10/hr | 70B models (CodeLlama 70B)                     |

### Local Requirements (Your Machine)

* VS Code or any JetBrains IDE
* Continue.dev extension installed
* Stable internet connection to your Clore.ai server
* **No local GPU needed** — all inference happens on Clore.ai

## Quick Start

### Part 1: Set Up the Clore.ai Backend

#### Option A — Ollama Backend (Recommended for Most Users)

Ollama is the easiest backend for Continue.dev — simple setup, excellent model management, OpenAI-compatible API.

```bash
# 1. SSH into your Clore.ai server
ssh root@<clore-server-ip> -p <port>

# 2. Start Ollama with GPU support
docker run -d \
  --name ollama \
  --gpus all \
  -p 11434:11434 \
  -v /workspace/ollama:/root/.ollama \
  --restart unless-stopped \
  ollama/ollama

# 3. Verify Ollama is running
curl http://localhost:11434/

# 4. Pull your coding model (choose based on your VRAM)
# For 12GB VRAM (RTX 3060):
docker exec ollama ollama pull codellama:13b

# For 24GB VRAM (RTX 3090 / RTX 4090):
docker exec ollama ollama pull qwen2.5-coder:32b
# or:
docker exec ollama ollama pull deepseek-coder:33b

# 5. Pull a fast autocomplete model (separate from chat model)
docker exec ollama ollama pull starcoder2:3b   # Very fast, great for FIM autocomplete

# 6. Verify models are available
docker exec ollama ollama list

# 7. Test inference
docker exec ollama ollama run qwen2.5-coder:32b "Write a Python function to binary search a sorted list"
```

To expose Ollama externally (so your local IDE can connect):

```bash
# Restart Ollama with external access enabled
docker stop ollama && docker rm ollama

docker run -d \
  --name ollama \
  --gpus all \
  -p 11434:11434 \
  -v /workspace/ollama:/root/.ollama \
  -e OLLAMA_HOST=0.0.0.0 \
  --restart unless-stopped \
  ollama/ollama

# Test from your LOCAL machine:
curl http://<clore-server-ip>:11434/api/tags
```

{% hint style="warning" %}
Exposing port 11434 publicly has no authentication by default. For production use, set up an SSH tunnel instead (see [Tips & Best Practices](#tips--best-practices)).
{% endhint %}

#### Option B — vLLM Backend (High-Throughput / OpenAI-Compatible)

vLLM offers faster inference and multi-user support. Ideal if multiple developers share one Clore.ai server.

```bash
# Start vLLM with OpenAI-compatible API
docker run -d \
  --name vllm \
  --gpus all \
  -p 8000:8000 \
  -v /workspace/hf-models:/root/.cache/huggingface \
  -e HF_TOKEN="your-huggingface-token" \
  --restart unless-stopped \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-Coder-32B-Instruct \
  --dtype auto \
  --max-model-len 32768 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --served-model-name qwen2.5-coder-32b

# For multi-GPU (e.g., two RTX 3090s):
docker run -d \
  --name vllm \
  --gpus all \
  -p 8000:8000 \
  -v /workspace/hf-models:/root/.cache/huggingface \
  -e HF_TOKEN="your-huggingface-token" \
  vllm/vllm-openai:latest \
  --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
  --tensor-parallel-size 2 \
  --dtype auto \
  --max-model-len 65536 \
  --served-model-name deepseek-coder-v2

# Test the API
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder-32b",
    "messages": [{"role": "user", "content": "Write a hello world in Rust"}],
    "max_tokens": 200
  }'
```

#### Option C — TabbyML Backend (FIM Autocomplete Specialist)

TabbyML provides superior fill-in-the-middle (FIM) autocomplete — the inline ghost-text suggestions. See the [TabbyML documentation](https://tabby.tabbyml.com/) for full setup details.

```bash
# Quick TabbyML setup for Continue.dev autocomplete
docker run -d \
  --name tabby \
  --gpus all \
  -p 8080:8080 \
  -v /workspace/tabby-data:/data \
  --restart unless-stopped \
  tabbyml/tabby serve \
  --model StarCoder2-7B \
  --chat-model Mistral-7B \
  --device cuda

# Verify
curl http://localhost:8080/v1/health
```

### Part 2: Install Continue.dev Extension

**VS Code:**

1. Open the Extensions panel (`Ctrl+Shift+X` / `Cmd+Shift+X`)
2. Search **"Continue"** — install the official extension by Continue (continuedev)
3. Click the Continue icon in the sidebar (or `Ctrl+Shift+I`)

**JetBrains (IntelliJ, PyCharm, WebStorm, GoLand):**

1. `File → Settings → Plugins → Marketplace`
2. Search **"Continue"** and install
3. Restart the IDE; the Continue panel appears on the right sidebar

### Part 3: Configure Continue.dev to Use Clore.ai

Edit `~/.continue/config.json` on your **local machine**:

```json
{
  "models": [
    {
      "title": "Clore.ai — Qwen2.5-Coder 32B",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b",
      "apiBase": "http://<clore-server-ip>:11434",
      "contextLength": 32768,
      "completionOptions": {
        "temperature": 0.1,
        "topP": 0.95,
        "maxTokens": 4096
      }
    },
    {
      "title": "Clore.ai — CodeLlama 13B (fast)",
      "provider": "ollama",
      "model": "codellama:13b",
      "apiBase": "http://<clore-server-ip>:11434",
      "contextLength": 16384
    }
  ],
  "tabAutocompleteModel": {
    "title": "StarCoder2 3B (autocomplete)",
    "provider": "ollama",
    "model": "starcoder2:3b",
    "apiBase": "http://<clore-server-ip>:11434"
  },
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "nomic-embed-text",
    "apiBase": "http://<clore-server-ip>:11434"
  },
  "contextProviders": [
    { "name": "code" },
    { "name": "docs" },
    { "name": "diff" },
    { "name": "terminal" },
    { "name": "problems" },
    { "name": "folder" },
    { "name": "codebase" }
  ],
  "slashCommands": [
    { "name": "edit", "description": "Edit selected code" },
    { "name": "comment", "description": "Add comments to code" },
    { "name": "share", "description": "Export conversation as markdown" },
    { "name": "cmd", "description": "Generate terminal command" },
    { "name": "commit", "description": "Generate git commit message" }
  ]
}
```

For **vLLM backend** instead of Ollama:

```json
{
  "models": [
    {
      "title": "Clore.ai — DeepSeek Coder 33B (vLLM)",
      "provider": "openai",
      "model": "deepseek-coder-v2",
      "apiBase": "http://<clore-server-ip>:8000/v1",
      "apiKey": "not-required",
      "contextLength": 65536,
      "completionOptions": {
        "temperature": 0.0,
        "maxTokens": 8192
      }
    }
  ]
}
```

For **TabbyML backend** (autocomplete only):

```json
{
  "tabAutocompleteModel": {
    "title": "Clore.ai — TabbyML StarCoder2",
    "provider": "openai",
    "model": "StarCoder2-7B",
    "apiBase": "http://<clore-server-ip>:8080/v1",
    "apiKey": "auth-token-if-set"
  }
}
```

## Configuration

### SSH Tunnel Setup (Secure Remote Access)

Instead of exposing ports publicly, use an SSH tunnel from your local machine:

```bash
# Open SSH tunnel: local port 11434 → Clore.ai server port 11434
ssh -N -L 11434:localhost:11434 root@<clore-server-ip> -p <clore-ssh-port>

# Keep the tunnel alive (add to ~/.ssh/config):
Host clore-coding
  HostName <clore-server-ip>
  Port <clore-ssh-port>
  User root
  LocalForward 11434 localhost:11434
  LocalForward 8000 localhost:8000
  ServerAliveInterval 60
  ServerAliveCountMax 3

# Connect with:
ssh -N clore-coding

# Then in config.json use localhost:
# "apiBase": "http://localhost:11434"
```

### Persistent Tunnel with autossh

```bash
# Install autossh on your local machine (Linux/macOS)
sudo apt install autossh   # Ubuntu/Debian
brew install autossh       # macOS

# Run persistent tunnel that auto-reconnects
autossh -M 0 -N \
  -o "ServerAliveInterval 30" \
  -o "ServerAliveCountMax 3" \
  -L 11434:localhost:11434 \
  root@<clore-server-ip> -p <clore-ssh-port>

# Add to systemd for automatic start on boot (Linux)
cat > ~/.config/systemd/user/clore-tunnel.service << 'EOF'
[Unit]
Description=SSH tunnel to Clore.ai coding server
After=network.target

[Service]
ExecStart=autossh -M 0 -N \
  -o StrictHostKeyChecking=accept-new \
  -o ServerAliveInterval=30 \
  -o ServerAliveCountMax=3 \
  -L 11434:localhost:11434 \
  root@CLORE_IP -p CLORE_PORT
Restart=always
RestartSec=10

[Install]
WantedBy=default.target
EOF

systemctl --user enable clore-tunnel
systemctl --user start clore-tunnel
```

### Load Multiple Models for Different Tasks

For an RTX 3090 (24 GB), you can run a large chat model and a small autocomplete model simultaneously:

```bash
# On the Clore.ai server:

# Pull the models
docker exec ollama ollama pull qwen2.5-coder:32b      # Chat (22 GB)
docker exec ollama ollama pull starcoder2:3b           # Autocomplete (2 GB)
docker exec ollama ollama pull nomic-embed-text        # Embeddings (0.5 GB)

# Ollama handles model swapping automatically
# All three fit within 24 GB VRAM with smart caching

# Monitor VRAM usage
nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 5
```

### Codebase Indexing (RAG for Your Repo)

Continue.dev can index your codebase for context-aware suggestions. Pull an embedding model:

```bash
# On Clore.ai server — add embedding model to Ollama
docker exec ollama ollama pull nomic-embed-text

# In config.json (local), embeddings are already configured above.
# Continue.dev will index your open workspace automatically.
# Trigger manual re-index: Ctrl+Shift+P → "Continue: Index Codebase"
```

## GPU Acceleration

### Monitor Inference Performance

```bash
# On your Clore.ai server — watch GPU during coding sessions
watch -n 1 nvidia-smi

# Check tokens per second (Ollama logs)
docker logs ollama --tail 20 -f

# Detailed GPU stats
nvidia-smi dmon -s u -d 2

# Memory breakdown
nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu \
  --format=csv,noheader -l 5
```

### Expected Performance by GPU

| GPU           | Model                    | Context | Tokens/sec (approx.) |
| ------------- | ------------------------ | ------- | -------------------- |
| RTX 3060 12GB | CodeLlama 7B             | 8K      | \~40–60 t/s          |
| RTX 3060 12GB | DeepSeek-Coder 6.7B      | 8K      | \~45–65 t/s          |
| RTX 3090 24GB | Qwen2.5-Coder 32B (Q4)   | 16K     | \~15–25 t/s          |
| RTX 3090 24GB | DeepSeek-Coder 33B (Q4)  | 16K     | \~15–22 t/s          |
| RTX 4090 24GB | Qwen2.5-Coder 32B (Q4)   | 16K     | \~25–40 t/s          |
| A100 40GB     | Qwen2.5-Coder 32B (FP16) | 32K     | \~35–50 t/s          |
| A100 80GB     | CodeLlama 70B (Q4)       | 32K     | \~20–30 t/s          |

For autocomplete (fill-in-the-middle), **starcoder2:3b** or **codellama:7b** achieve 50–100 t/s — fast enough to feel instant in the IDE.

### Tune Ollama for Better Performance

```bash
# On the Clore.ai server — optimize Ollama settings
docker stop ollama && docker rm ollama

docker run -d \
  --name ollama \
  --gpus all \
  -p 11434:11434 \
  -v /workspace/ollama:/root/.ollama \
  -e OLLAMA_HOST=0.0.0.0 \
  -e OLLAMA_NUM_PARALLEL=2 \
  -e OLLAMA_MAX_LOADED_MODELS=2 \
  -e OLLAMA_FLASH_ATTENTION=1 \
  --restart unless-stopped \
  ollama/ollama

# OLLAMA_NUM_PARALLEL=2: serve 2 requests simultaneously
# OLLAMA_MAX_LOADED_MODELS=2: keep 2 models in GPU memory
# OLLAMA_FLASH_ATTENTION=1: enable flash attention (faster, less memory)
```

## Tips & Best Practices

### Use Different Models for Different Tasks

Configure Continue.dev with specialized models per task type — the UI lets you switch models mid-conversation:

```json
{
  "models": [
    {
      "title": "Chat — Qwen2.5-Coder 32B",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b",
      "apiBase": "http://localhost:11434",
      "contextLength": 32768,
      "description": "Best for complex questions, code review, architecture decisions"
    },
    {
      "title": "Fast — CodeLlama 7B",
      "provider": "ollama",
      "model": "codellama:7b",
      "apiBase": "http://localhost:11434",
      "contextLength": 8192,
      "description": "Quick answers, simple completions, low latency"
    },
    {
      "title": "Autocomplete — StarCoder2 3B",
      "provider": "ollama",
      "model": "starcoder2:3b",
      "apiBase": "http://localhost:11434",
      "contextLength": 4096,
      "description": "Inline ghost-text suggestions"
    }
  ]
}
```

### Cost Comparison

| Solution              | Monthly Cost (8hr/day usage) | Privacy           | Model Quality       |
| --------------------- | ---------------------------- | ----------------- | ------------------- |
| GitHub Copilot        | $19/user/mo                  | ❌ Microsoft cloud | GPT-4o (closed)     |
| Cursor Pro            | $20/user/mo                  | ❌ Cursor cloud    | Claude 3.5 (closed) |
| RTX 3060 on Clore.ai  | \~$24/mo                     | ✅ Your server     | CodeLlama 13B       |
| RTX 3090 on Clore.ai  | \~$48/mo                     | ✅ Your server     | Qwen2.5-Coder 32B   |
| RTX 4090 on Clore.ai  | \~$84/mo                     | ✅ Your server     | Qwen2.5-Coder 32B   |
| A100 80GB on Clore.ai | \~$264/mo                    | ✅ Your server     | CodeLlama 70B       |

For a team of 3+ developers sharing one Clore.ai RTX 3090 (\~$48/mo total), the per-user cost beats Copilot while providing a larger, private model.

### Shut Down When Not Coding

Clore.ai charges per hour. Use a simple script to start/stop the server:

```bash
# Save these as local scripts

# start-coding-server.sh
#!/bin/bash
echo "Opening SSH tunnel to Clore.ai..."
ssh -N -f -L 11434:localhost:11434 clore-coding
echo "Tunnel open. Continue.dev is ready."

# stop-coding-server.sh
#!/bin/bash
echo "Closing SSH tunnel..."
pkill -f "ssh.*clore-coding"
echo "Tunnel closed. Remember to stop your Clore.ai order to stop billing!"
```

### Use Continue.dev Custom Commands

Add custom slash commands to `config.json` for common coding workflows:

```json
{
  "customCommands": [
    {
      "name": "review",
      "prompt": "Review this code for bugs, security issues, and performance problems. Be specific and actionable.",
      "description": "Code review"
    },
    {
      "name": "test",
      "prompt": "Write comprehensive unit tests for this code. Include edge cases. Use the same language/framework as the code.",
      "description": "Generate tests"
    },
    {
      "name": "docstring",
      "prompt": "Add clear, comprehensive docstrings/comments to this code following best practices for the language.",
      "description": "Add documentation"
    },
    {
      "name": "optimize",
      "prompt": "Optimize this code for performance. Explain what you changed and why.",
      "description": "Optimize code"
    }
  ]
}
```

## Troubleshooting

| Problem                                 | Likely Cause                   | Solution                                                                       |
| --------------------------------------- | ------------------------------ | ------------------------------------------------------------------------------ |
| Continue.dev shows "Connection refused" | Ollama not reachable           | Check SSH tunnel is active; verify `curl http://localhost:11434/` works        |
| Autocomplete not triggering             | Tab autocomplete model not set | Add `tabAutocompleteModel` to config.json; enable in Continue settings         |
| Very slow responses (>30s first token)  | Model loading from disk        | First request loads model into VRAM — subsequent requests are fast             |
| "Model not found" error                 | Model not pulled               | Run `docker exec ollama ollama pull <model-name>` on Clore.ai server           |
| High latency between tokens             | Network lag or model too large | Use SSH tunnel; switch to smaller model; check server GPU utilization          |
| Codebase context not working            | Embeddings model missing       | Pull `nomic-embed-text` via Ollama; check `embeddingsProvider` in config.json  |
| SSH tunnel drops frequently             | Unstable connection            | Use `autossh` for persistent reconnection; add `ServerAliveInterval 30`        |
| Context window exceeded                 | Long files/conversations       | Reduce `contextLength` in config.json; use a model with longer context         |
| JetBrains plugin not loading            | IDE version incompatibility    | Update JetBrains IDE to latest; check Continue.dev plugin compatibility matrix |
| vLLM OOM during loading                 | Not enough VRAM                | Add `--gpu-memory-utilization 0.85`; use smaller model or quantized version    |

### Debug Commands

```bash
# On your LOCAL machine — test connectivity
curl http://localhost:11434/api/tags          # if using SSH tunnel
curl http://<clore-ip>:11434/api/tags        # if port is open directly

# On the CLORE.AI server — check Ollama
docker logs ollama --tail 30 -f
docker exec ollama ollama list
docker exec ollama ollama ps                  # show currently loaded models

# Test model response time
time curl http://localhost:11434/api/generate \
  -d '{"model": "codellama:7b", "prompt": "def hello():", "stream": false}'

# Check GPU memory
nvidia-smi --query-gpu=memory.used,memory.free --format=csv

# Check vLLM logs
docker logs vllm --tail 50 -f

# Restart Ollama without losing models
docker restart ollama
```

### Continue.dev Config Validation

```bash
# Validate config.json syntax on your local machine
python3 -c "
import json, sys
try:
    config = json.load(open(sys.argv[1]))
    print('✅ Config is valid JSON')
    print(f'Models: {[m[\"title\"] for m in config.get(\"models\", [])]}')
except Exception as e:
    print(f'❌ Error: {e}')
" ~/.continue/config.json
```

## Further Reading

* [Continue.dev Documentation](https://docs.continue.dev/) — official docs for all IDE integrations and config options
* [Continue.dev GitHub](https://github.com/continuedev/continue) — source code, issues, model compatibility
* [Continue.dev Config Reference](https://docs.continue.dev/reference) — full `config.json` schema
* [Ollama on Clore.ai](/guides/language-models/ollama.md) — detailed Ollama setup guide (recommended backend)
* [vLLM on Clore.ai](/guides/language-models/vllm.md) — high-performance alternative backend for teams
* [TabbyML](https://tabby.tabbyml.com/) — specialized autocomplete backend with FIM optimization
* [GPU Comparison Guide](/guides/getting-started/gpu-comparison.md) — choose the right GPU for your coding workload
* [Model Compatibility](/guides/getting-started/model-compatibility.md) — which models fit in which VRAM sizes
* [Qwen2.5-Coder](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) — currently the best open coding model
* [DeepSeek-Coder-V2](https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct) — strong alternative with long context
* [CLORE.AI Marketplace](https://clore.ai/marketplace) — rent GPU servers


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/ai-platforms-and-agents/continue-dev.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
