Jan.ai Offline Assistant

Deploy Jan.ai Server on Clore.ai — a fully offline, OpenAI-compatible LLM server with model hub, conversation management, and GPU-accelerated inference powered by the Cortex engine.

Overview

Jan.ai is an open-source, privacy-first ChatGPT alternative with over 40,000 GitHub stars. While Jan is best known as a desktop application, its server component — Jan Server — exposes a fully OpenAI-compatible REST API that can be deployed on cloud GPU infrastructure like Clore.ai.

Jan Server is built on the Cortex.cpp inference engine, a high-performance runtime that supports llama.cpp, TensorRT-LLM, and ONNX backends. On Clore.ai you can rent a GPU server for as little as $0.20/hr, run Jan Server with Docker Compose, load any GGUF or GPTQ model, and serve it over an OpenAI-compatible API — all without your data leaving the machine.

Key features:

🔒 100% offline — no data ever leaves your server
🤖 OpenAI-compatible API (/v1/chat/completions, /v1/models, etc.)
📦 Model hub with one-command model downloads
🚀 GPU acceleration via CUDA (llama.cpp + TensorRT-LLM backends)
💬 Built-in conversation management and thread history
🔌 Drop-in replacement for OpenAI in existing applications

Requirements

Hardware Requirements

Tier

GPU

VRAM

RAM

Storage

Clore.ai Price

Minimum

RTX 3060 12GB

12 GB

16 GB

50 GB SSD

~$0.10/hr

Recommended

RTX 3090

24 GB

32 GB

100 GB SSD

~$0.20/hr

High-end

RTX 4090

24 GB

64 GB

200 GB SSD

~$0.35/hr

Large models

A100 80GB

80 GB

128 GB

500 GB SSD

~$1.10/hr

Model VRAM Reference

Model

VRAM Required

Recommended GPU

Llama 3.1 8B (Q4)

~5 GB

RTX 3060

Llama 3.1 8B (FP16)

~16 GB

RTX 3090

Llama 3.3 70B (Q4)

~40 GB

A100 40GB

Llama 3.1 405B (Q4)

~220 GB

4× A100 80GB

Mistral 7B (Q4)

~4 GB

RTX 3060

Qwen2.5 72B (Q4)

~45 GB

A100 80GB

Software Prerequisites

Clore.ai account with funded wallet
Basic Docker knowledge
(Optional) OpenSSH client for port forwarding

Quick Start

Step 1 — Rent a GPU Server on Clore.ai

Browse to clore.ai and log in
Filter servers: GPU Type → RTX 3090 or better, Docker → enabled
Select a server and choose the Docker deployment option
Use the official nvidia/cuda:12.1.0-devel-ubuntu22.04 base image or any CUDA image
Open ports: 1337 (Jan Server API), 39281 (Cortex API), 22 (SSH)

Step 2 — Connect to Your Server

# SSH into your Clore.ai server
ssh -p <CLORE_SSH_PORT> root@<CLORE_SERVER_IP>

# Verify GPU is available
nvidia-smi

Step 3 — Install Docker Compose (if not present)

# Check if Docker Compose is available
docker compose version

# Install if missing (Ubuntu/Debian)
apt-get update && apt-get install -y docker-compose-plugin

# Verify
docker compose version

Step 4 — Deploy Jan Server with Docker Compose

# Create working directory
mkdir -p /workspace/jan-server && cd /workspace/jan-server

# Download the official Jan Server docker-compose.yml
curl -fsSL https://raw.githubusercontent.com/janhq/jan-server/main/docker-compose.yml \
  -o docker-compose.yml

# Review and edit configuration
cat docker-compose.yml

If the upstream compose file is unavailable or you want full control, create it manually:

# /workspace/jan-server/docker-compose.yml
version: '3.8'

services:
  jan-server:
    image: ghcr.io/janhq/cortex:latest
    container_name: jan-server
    restart: unless-stopped
    ports:
      - "1337:1337"
      - "39281:39281"
    volumes:
      - jan-data:/root/jan
      - jan-models:/root/cortex/models
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - JAN_API_HOST=0.0.0.0
      - JAN_API_PORT=1337
      - CORTEX_API_PORT=39281
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:1337/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 60s

volumes:
  jan-data:
    driver: local
  jan-models:
    driver: local

# Start Jan Server
docker compose up -d

# Follow startup logs (wait for "Server started" message)
docker compose logs -f jan-server

Step 5 — Verify the Server is Running

# Check server health
curl http://localhost:1337/health

# List available models (initially empty)
curl http://localhost:1337/v1/models

# Expected response:
# {"object":"list","data":[]}

Step 6 — Pull Your First Model

# Pull Llama 3.2 3B (good starter, ~2GB)
curl -X POST http://localhost:1337/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b-gguf-q4-km"}'

# Or pull Mistral 7B Instruct Q4
curl -X POST http://localhost:1337/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km"}'

# Monitor download progress
curl http://localhost:1337/v1/models

Step 7 — Start the Model & Chat

# Start the model (loads it into GPU VRAM)
curl -X POST http://localhost:1337/v1/models/start \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b-gguf-q4-km"}'

# Send your first chat request
curl http://localhost:1337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b-gguf-q4-km",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello! What can you help me with?"}
    ],
    "temperature": 0.7,
    "max_tokens": 512,
    "stream": false
  }'

Configuration

Environment Variables

Variable

Default

Description

JAN_API_HOST

0.0.0.0

Host to bind the API server

JAN_API_PORT

1337

Jan Server API port

CORTEX_API_PORT

39281

Internal Cortex engine port

CUDA_VISIBLE_DEVICES

all

Which GPUs to expose (comma-separated indices)

JAN_DATA_FOLDER

/root/jan

Path to Jan data folder

CORTEX_MODELS_PATH

/root/cortex/models

Path to model storage

Multi-GPU Configuration

For servers with multiple GPUs (e.g., 2× RTX 3090 on Clore.ai):

environment:
  - CUDA_VISIBLE_DEVICES=0,1  # Use both GPUs

Or to dedicate specific GPUs:

# Run Jan Server on GPU 0 only
docker run -d \
  --name jan-server \
  --gpus '"device=0"' \
  -p 1337:1337 \
  -v jan-data:/root/jan \
  -v jan-models:/root/cortex/models \
  ghcr.io/janhq/cortex:latest

Custom Model Configuration

# List all pulled models
curl http://localhost:1337/v1/models | jq '.data[].id'

# Get model details
curl http://localhost:1337/v1/models/llama3.2:3b-gguf-q4-km

# Stop a running model (free VRAM)
curl -X POST http://localhost:1337/v1/models/stop \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b-gguf-q4-km"}'

# Delete a model (free disk space)
curl -X DELETE http://localhost:1337/v1/models/llama3.2:3b-gguf-q4-km

Securing the API with a Token

Jan Server does not include authentication by default. Use Nginx as a reverse proxy:

apt-get install -y nginx apache2-utils

# Create password file
htpasswd -c /etc/nginx/.htpasswd admin

# Configure Nginx
cat > /etc/nginx/sites-available/jan-server << 'EOF'
server {
    listen 80;
    server_name _;

    location / {
        auth_basic "Jan Server";
        auth_basic_user_file /etc/nginx/.htpasswd;
        proxy_pass http://127.0.0.1:1337;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
    }
}
EOF

ln -s /etc/nginx/sites-available/jan-server /etc/nginx/sites-enabled/
nginx -t && systemctl restart nginx

GPU Acceleration

Verifying CUDA Acceleration

Jan Server's Cortex engine auto-detects CUDA. Verify it's using the GPU:

# Check GPU memory usage after loading a model
nvidia-smi

# Should show the cortex process consuming VRAM
# Example output:
# | Processes:                                                            |
# |  GPU   GI   CI        PID   Type   Process name            GPU Memory |
# |    0    N/A  N/A    12345    C   /usr/local/bin/cortex    8192MiB |

Switching Inference Backends

Cortex supports multiple backends:

# Check which backends are available inside the container
docker exec jan-server cortex engines list

# Use TensorRT-LLM backend for NVIDIA GPUs (faster, requires more setup)
docker exec jan-server cortex engines install tensorrt-llm

# Use llama.cpp backend (default, most compatible)
docker exec jan-server cortex engines install llama-cpp

Context Window and Batch Size Tuning

# Customize model parameters for GPU performance
curl -X POST http://localhost:1337/v1/models/start \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b-gguf-q4-km",
    "ctx_len": 8192,
    "ngl": 99,
    "n_batch": 512,
    "n_parallel": 4,
    "cpu_threads": 8
  }'

Parameter

Description

Recommendation

ngl

GPU layers (higher = more GPU usage)

Set to 99 to max out GPU

ctx_len

Context window size

4096–32768 depending on VRAM

n_batch

Batch size for prompt processing

512 for RTX 3090, 256 for smaller

n_parallel

Concurrent request slots

4–8 for API server use

Tips & Best Practices

🎯 Model Selection for Clore.ai Budgets

# Budget tier (~$0.10/hr, RTX 3060 12GB):
# Use Q4_K_M quants of 7B models
curl -X POST http://localhost:1337/v1/models/pull \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km"}'

# Standard tier (~$0.20/hr, RTX 3090 24GB):
# Use Q5_K_M quants of 13B models or Q4 of 30B
curl -X POST http://localhost:1337/v1/models/pull \
  -d '{"model": "llama3.1:8b-instruct-gguf-q5-km"}'

# High-end tier (~$1.10/hr, A100 80GB):
# Run full 70B models in high precision
curl -X POST http://localhost:1337/v1/models/pull \
  -d '{"model": "llama3.3:70b-instruct-gguf-q4-km"}'

💾 Persistent Model Storage

Since Clore.ai instances are ephemeral, consider mounting external storage:

# Use a named volume (persists with Docker)
docker compose down
# Models survive in the 'jan-models' named volume

# For truly persistent storage across instances,
# upload models to object storage and pull on startup:
cat > /workspace/startup.sh << 'EOF'
#!/bin/bash
docker compose up -d
sleep 30
# Pre-pull your frequently used models
curl -X POST http://localhost:1337/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km"}'
EOF
chmod +x /workspace/startup.sh

🔗 Using Jan Server as OpenAI Drop-in

# Python — use existing OpenAI client libraries
from openai import OpenAI

client = OpenAI(
    base_url="http://<CLORE_IP>:1337/v1",
    api_key="not-required"  # Jan Server has no auth by default
)

response = client.chat.completions.create(
    model="llama3.2:3b-gguf-q4-km",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    temperature=0.7
)
print(response.choices[0].message.content)

# Streaming support
curl http://localhost:1337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b-gguf-q4-km",
    "messages": [{"role": "user", "content": "Write a haiku about GPUs"}],
    "stream": true
  }'

📊 Monitoring Resource Usage

# Watch GPU utilization in real-time
watch -n 1 nvidia-smi

# Check container resource usage
docker stats jan-server

# View detailed logs
docker compose logs --tail=100 jan-server

# Check model load times
docker compose logs jan-server | grep -E "(loaded|started|error)"

Troubleshooting

Container fails to start — GPU not found

# Verify NVIDIA Docker runtime is configured
docker info | grep -i nvidia

# Test GPU access directly
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

# If this fails, check Docker daemon config
cat /etc/docker/daemon.json
# Should contain: {"runtimes": {"nvidia": {...}}}

Model download stuck or fails

# Check disk space
df -h /root

# Check container logs for error
docker compose logs jan-server | tail -50

# Retry the pull
curl -X POST http://localhost:1337/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km"}'

Out of VRAM (CUDA out of memory)

# Check current VRAM usage
nvidia-smi --query-gpu=memory.used,memory.free --format=csv

# Stop all running models first
curl http://localhost:1337/v1/models | jq -r '.data[].id' | while read model; do
  curl -X POST http://localhost:1337/v1/models/stop \
    -H "Content-Type: application/json" \
    -d "{\"model\": \"$model\"}"
done

# Use a more heavily quantized model (Q3 or Q4 instead of Q8)
# Q4_K_M typically uses ~50% of the Q8 VRAM requirement

Cannot connect to API from outside the container

# Ensure port 1337 is bound on all interfaces
docker ps --format "table {{.Names}}\t{{.Ports}}"
# Should show: 0.0.0.0:1337->1337/tcp

# Check Clore.ai firewall rules — open port 1337 in the server settings
# Test locally first:
curl http://127.0.0.1:1337/health

# Then test from outside:
curl http://<CLORE_SERVER_IP>:<MAPPED_PORT>/health

Slow inference (CPU fallback)

# Confirm CUDA is being used (not CPU)
docker exec jan-server cortex ps
# Should show GPU memory allocated

# Force GPU layers in model start
curl -X POST http://localhost:1337/v1/models/start \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km", "ngl": 99}'

hashtagOverview

hashtagRequirements

hashtagHardware Requirements

hashtagModel VRAM Reference

hashtagSoftware Prerequisites

hashtagQuick Start

hashtagStep 1 — Rent a GPU Server on Clore.ai

hashtagStep 2 — Connect to Your Server

hashtagStep 3 — Install Docker Compose (if not present)

hashtagStep 4 — Deploy Jan Server with Docker Compose

hashtagStep 5 — Verify the Server is Running

hashtagStep 6 — Pull Your First Model

hashtagStep 7 — Start the Model & Chat

hashtagConfiguration

hashtagEnvironment Variables

hashtagMulti-GPU Configuration

hashtagCustom Model Configuration

hashtagSecuring the API with a Token

hashtagGPU Acceleration

hashtagVerifying CUDA Acceleration

hashtagSwitching Inference Backends

hashtagContext Window and Batch Size Tuning

hashtagTips & Best Practices

hashtag🎯 Model Selection for Clore.ai Budgets

hashtag💾 Persistent Model Storage

hashtag🔗 Using Jan Server as OpenAI Drop-in

hashtag📊 Monitoring Resource Usage

hashtagTroubleshooting

hashtagContainer fails to start — GPU not found

hashtagModel download stuck or fails

hashtagOut of VRAM (CUDA out of memory)

hashtagCannot connect to API from outside the container

hashtagSlow inference (CPU fallback)

hashtagFurther Reading