# Jan.ai Offline Assistant

## Overview

[Jan.ai](https://github.com/janhq/jan) is an open-source, privacy-first ChatGPT alternative with over 40,000 GitHub stars. While Jan is best known as a desktop application, its server component — **Jan Server** — exposes a fully OpenAI-compatible REST API that can be deployed on cloud GPU infrastructure like Clore.ai.

Jan Server is built on the [Cortex.cpp](https://github.com/janhq/cortex.cpp) inference engine, a high-performance runtime that supports `llama.cpp`, `TensorRT-LLM`, and ONNX backends. On Clore.ai you can rent a GPU server for as little as **$0.20/hr**, run Jan Server with Docker Compose, load any GGUF or GPTQ model, and serve it over an OpenAI-compatible API — all without your data leaving the machine.

**Key features:**

* 🔒 100% offline — no data ever leaves your server
* 🤖 OpenAI-compatible API (`/v1/chat/completions`, `/v1/models`, etc.)
* 📦 Model hub with one-command model downloads
* 🚀 GPU acceleration via CUDA (llama.cpp + TensorRT-LLM backends)
* 💬 Built-in conversation management and thread history
* 🔌 Drop-in replacement for OpenAI in existing applications

***

## Requirements

### Hardware Requirements

| Tier             | GPU           | VRAM  | RAM    | Storage    | Clore.ai Price |
| ---------------- | ------------- | ----- | ------ | ---------- | -------------- |
| **Minimum**      | RTX 3060 12GB | 12 GB | 16 GB  | 50 GB SSD  | \~$0.10/hr     |
| **Recommended**  | RTX 3090      | 24 GB | 32 GB  | 100 GB SSD | \~$0.20/hr     |
| **High-end**     | RTX 4090      | 24 GB | 64 GB  | 200 GB SSD | \~$0.35/hr     |
| **Large models** | A100 80GB     | 80 GB | 128 GB | 500 GB SSD | \~$1.10/hr     |

### Model VRAM Reference

| Model               | VRAM Required | Recommended GPU |
| ------------------- | ------------- | --------------- |
| Llama 3.1 8B (Q4)   | \~5 GB        | RTX 3060        |
| Llama 3.1 8B (FP16) | \~16 GB       | RTX 3090        |
| Llama 3.3 70B (Q4)  | \~40 GB       | A100 40GB       |
| Llama 3.1 405B (Q4) | \~220 GB      | 4× A100 80GB    |
| Mistral 7B (Q4)     | \~4 GB        | RTX 3060        |
| Qwen2.5 72B (Q4)    | \~45 GB       | A100 80GB       |

### Software Prerequisites

* Clore.ai account with funded wallet
* Basic Docker knowledge
* (Optional) OpenSSH client for port forwarding

***

## Quick Start

### Step 1 — Rent a GPU Server on Clore.ai

1. Browse to [clore.ai](https://clore.ai) and log in
2. Filter servers: **GPU Type** → RTX 3090 or better, **Docker** → enabled
3. Select a server and choose the **Docker** deployment option
4. Use the official `nvidia/cuda:12.1.0-devel-ubuntu22.04` base image or any CUDA image
5. Open ports: **1337** (Jan Server API), **39281** (Cortex API), **22** (SSH)

### Step 2 — Connect to Your Server

```bash
# SSH into your Clore.ai server
ssh -p <CLORE_SSH_PORT> root@<CLORE_SERVER_IP>

# Verify GPU is available
nvidia-smi
```

### Step 3 — Install Docker Compose (if not present)

```bash
# Check if Docker Compose is available
docker compose version

# Install if missing (Ubuntu/Debian)
apt-get update && apt-get install -y docker-compose-plugin

# Verify
docker compose version
```

### Step 4 — Deploy Jan Server with Docker Compose

```bash
# Create working directory
mkdir -p /workspace/jan-server && cd /workspace/jan-server

# Download the official Jan Server docker-compose.yml
curl -fsSL https://raw.githubusercontent.com/janhq/jan-server/main/docker-compose.yml \
  -o docker-compose.yml

# Review and edit configuration
cat docker-compose.yml
```

If the upstream compose file is unavailable or you want full control, create it manually:

```yaml
# /workspace/jan-server/docker-compose.yml
version: '3.8'

services:
  jan-server:
    image: ghcr.io/janhq/cortex:latest
    container_name: jan-server
    restart: unless-stopped
    ports:
      - "1337:1337"
      - "39281:39281"
    volumes:
      - jan-data:/root/jan
      - jan-models:/root/cortex/models
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - JAN_API_HOST=0.0.0.0
      - JAN_API_PORT=1337
      - CORTEX_API_PORT=39281
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:1337/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 60s

volumes:
  jan-data:
    driver: local
  jan-models:
    driver: local
```

```bash
# Start Jan Server
docker compose up -d

# Follow startup logs (wait for "Server started" message)
docker compose logs -f jan-server
```

### Step 5 — Verify the Server is Running

```bash
# Check server health
curl http://localhost:1337/health

# List available models (initially empty)
curl http://localhost:1337/v1/models

# Expected response:
# {"object":"list","data":[]}
```

### Step 6 — Pull Your First Model

```bash
# Pull Llama 3.2 3B (good starter, ~2GB)
curl -X POST http://localhost:1337/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b-gguf-q4-km"}'

# Or pull Mistral 7B Instruct Q4
curl -X POST http://localhost:1337/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km"}'

# Monitor download progress
curl http://localhost:1337/v1/models
```

### Step 7 — Start the Model & Chat

```bash
# Start the model (loads it into GPU VRAM)
curl -X POST http://localhost:1337/v1/models/start \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b-gguf-q4-km"}'

# Send your first chat request
curl http://localhost:1337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b-gguf-q4-km",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello! What can you help me with?"}
    ],
    "temperature": 0.7,
    "max_tokens": 512,
    "stream": false
  }'
```

***

## Configuration

### Environment Variables

| Variable               | Default               | Description                                    |
| ---------------------- | --------------------- | ---------------------------------------------- |
| `JAN_API_HOST`         | `0.0.0.0`             | Host to bind the API server                    |
| `JAN_API_PORT`         | `1337`                | Jan Server API port                            |
| `CORTEX_API_PORT`      | `39281`               | Internal Cortex engine port                    |
| `CUDA_VISIBLE_DEVICES` | `all`                 | Which GPUs to expose (comma-separated indices) |
| `JAN_DATA_FOLDER`      | `/root/jan`           | Path to Jan data folder                        |
| `CORTEX_MODELS_PATH`   | `/root/cortex/models` | Path to model storage                          |

### Multi-GPU Configuration

For servers with multiple GPUs (e.g., 2× RTX 3090 on Clore.ai):

```yaml
environment:
  - CUDA_VISIBLE_DEVICES=0,1  # Use both GPUs
```

Or to dedicate specific GPUs:

```bash
# Run Jan Server on GPU 0 only
docker run -d \
  --name jan-server \
  --gpus '"device=0"' \
  -p 1337:1337 \
  -v jan-data:/root/jan \
  -v jan-models:/root/cortex/models \
  ghcr.io/janhq/cortex:latest
```

### Custom Model Configuration

```bash
# List all pulled models
curl http://localhost:1337/v1/models | jq '.data[].id'

# Get model details
curl http://localhost:1337/v1/models/llama3.2:3b-gguf-q4-km

# Stop a running model (free VRAM)
curl -X POST http://localhost:1337/v1/models/stop \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b-gguf-q4-km"}'

# Delete a model (free disk space)
curl -X DELETE http://localhost:1337/v1/models/llama3.2:3b-gguf-q4-km
```

### Securing the API with a Token

Jan Server does not include authentication by default. Use Nginx as a reverse proxy:

```bash
apt-get install -y nginx apache2-utils

# Create password file
htpasswd -c /etc/nginx/.htpasswd admin

# Configure Nginx
cat > /etc/nginx/sites-available/jan-server << 'EOF'
server {
    listen 80;
    server_name _;

    location / {
        auth_basic "Jan Server";
        auth_basic_user_file /etc/nginx/.htpasswd;
        proxy_pass http://127.0.0.1:1337;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
    }
}
EOF

ln -s /etc/nginx/sites-available/jan-server /etc/nginx/sites-enabled/
nginx -t && systemctl restart nginx
```

***

## GPU Acceleration

### Verifying CUDA Acceleration

Jan Server's Cortex engine auto-detects CUDA. Verify it's using the GPU:

```bash
# Check GPU memory usage after loading a model
nvidia-smi

# Should show the cortex process consuming VRAM
# Example output:
# | Processes:                                                            |
# |  GPU   GI   CI        PID   Type   Process name            GPU Memory |
# |    0    N/A  N/A    12345    C   /usr/local/bin/cortex    8192MiB |
```

### Switching Inference Backends

Cortex supports multiple backends:

```bash
# Check which backends are available inside the container
docker exec jan-server cortex engines list

# Use TensorRT-LLM backend for NVIDIA GPUs (faster, requires more setup)
docker exec jan-server cortex engines install tensorrt-llm

# Use llama.cpp backend (default, most compatible)
docker exec jan-server cortex engines install llama-cpp
```

### Context Window and Batch Size Tuning

```bash
# Customize model parameters for GPU performance
curl -X POST http://localhost:1337/v1/models/start \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b-gguf-q4-km",
    "ctx_len": 8192,
    "ngl": 99,
    "n_batch": 512,
    "n_parallel": 4,
    "cpu_threads": 8
  }'
```

| Parameter    | Description                          | Recommendation                    |
| ------------ | ------------------------------------ | --------------------------------- |
| `ngl`        | GPU layers (higher = more GPU usage) | Set to `99` to max out GPU        |
| `ctx_len`    | Context window size                  | 4096–32768 depending on VRAM      |
| `n_batch`    | Batch size for prompt processing     | 512 for RTX 3090, 256 for smaller |
| `n_parallel` | Concurrent request slots             | 4–8 for API server use            |

***

## Tips & Best Practices

### 🎯 Model Selection for Clore.ai Budgets

```bash
# Budget tier (~$0.10/hr, RTX 3060 12GB):
# Use Q4_K_M quants of 7B models
curl -X POST http://localhost:1337/v1/models/pull \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km"}'

# Standard tier (~$0.20/hr, RTX 3090 24GB):
# Use Q5_K_M quants of 13B models or Q4 of 30B
curl -X POST http://localhost:1337/v1/models/pull \
  -d '{"model": "llama3.1:8b-instruct-gguf-q5-km"}'

# High-end tier (~$1.10/hr, A100 80GB):
# Run full 70B models in high precision
curl -X POST http://localhost:1337/v1/models/pull \
  -d '{"model": "llama3.3:70b-instruct-gguf-q4-km"}'
```

### 💾 Persistent Model Storage

Since Clore.ai instances are ephemeral, consider mounting external storage:

```bash
# Use a named volume (persists with Docker)
docker compose down
# Models survive in the 'jan-models' named volume

# For truly persistent storage across instances,
# upload models to object storage and pull on startup:
cat > /workspace/startup.sh << 'EOF'
#!/bin/bash
docker compose up -d
sleep 30
# Pre-pull your frequently used models
curl -X POST http://localhost:1337/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km"}'
EOF
chmod +x /workspace/startup.sh
```

### 🔗 Using Jan Server as OpenAI Drop-in

```python
# Python — use existing OpenAI client libraries
from openai import OpenAI

client = OpenAI(
    base_url="http://<CLORE_IP>:1337/v1",
    api_key="not-required"  # Jan Server has no auth by default
)

response = client.chat.completions.create(
    model="llama3.2:3b-gguf-q4-km",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    temperature=0.7
)
print(response.choices[0].message.content)
```

```bash
# Streaming support
curl http://localhost:1337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b-gguf-q4-km",
    "messages": [{"role": "user", "content": "Write a haiku about GPUs"}],
    "stream": true
  }'
```

### 📊 Monitoring Resource Usage

```bash
# Watch GPU utilization in real-time
watch -n 1 nvidia-smi

# Check container resource usage
docker stats jan-server

# View detailed logs
docker compose logs --tail=100 jan-server

# Check model load times
docker compose logs jan-server | grep -E "(loaded|started|error)"
```

***

## Troubleshooting

### Container fails to start — GPU not found

```bash
# Verify NVIDIA Docker runtime is configured
docker info | grep -i nvidia

# Test GPU access directly
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

# If this fails, check Docker daemon config
cat /etc/docker/daemon.json
# Should contain: {"runtimes": {"nvidia": {...}}}
```

### Model download stuck or fails

```bash
# Check disk space
df -h /root

# Check container logs for error
docker compose logs jan-server | tail -50

# Retry the pull
curl -X POST http://localhost:1337/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km"}'
```

### Out of VRAM (CUDA out of memory)

```bash
# Check current VRAM usage
nvidia-smi --query-gpu=memory.used,memory.free --format=csv

# Stop all running models first
curl http://localhost:1337/v1/models | jq -r '.data[].id' | while read model; do
  curl -X POST http://localhost:1337/v1/models/stop \
    -H "Content-Type: application/json" \
    -d "{\"model\": \"$model\"}"
done

# Use a more heavily quantized model (Q3 or Q4 instead of Q8)
# Q4_K_M typically uses ~50% of the Q8 VRAM requirement
```

### Cannot connect to API from outside the container

```bash
# Ensure port 1337 is bound on all interfaces
docker ps --format "table {{.Names}}\t{{.Ports}}"
# Should show: 0.0.0.0:1337->1337/tcp

# Check Clore.ai firewall rules — open port 1337 in the server settings
# Test locally first:
curl http://127.0.0.1:1337/health

# Then test from outside:
curl http://<CLORE_SERVER_IP>:<MAPPED_PORT>/health
```

### Slow inference (CPU fallback)

```bash
# Confirm CUDA is being used (not CPU)
docker exec jan-server cortex ps
# Should show GPU memory allocated

# Force GPU layers in model start
curl -X POST http://localhost:1337/v1/models/start \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km", "ngl": 99}'
```

***

## Further Reading

* [Jan.ai Official Documentation](https://jan.ai/docs) — Full platform docs
* [Jan GitHub Repository](https://github.com/janhq/jan) — Source code and issues
* [Jan Server / Jan API](https://github.com/janhq/jan-server) — Server-specific docs
* [Cortex.cpp Engine](https://github.com/janhq/cortex.cpp) — The underlying inference engine
* [Clore.ai Getting Started](https://docs.clore.ai/guides/getting-started/getting-started) — Platform basics
* [GPU Comparison Guide](https://docs.clore.ai/guides/getting-started/gpu-comparison) — Choose the right GPU
* [Running Ollama on Clore.ai](https://docs.clore.ai/guides/language-models/ollama) — Alternative LLM server
* [Running vLLM on Clore.ai](https://docs.clore.ai/guides/language-models/vllm) — High-throughput inference server
* [Hugging Face Model Hub](https://huggingface.co/models?library=gguf) — Find GGUF models

> 💡 **Cost tip:** An RTX 3090 on Clore.ai (\~$0.20/hr) can run Llama 3.1 8B at **\~50 tokens/second** — enough for personal use or low-traffic APIs. For production workloads, consider vLLM (see [vLLM guide](https://docs.clore.ai/guides/language-models/vllm)) on an A100.
