# Jan.ai Offline Assistant

## Overview

[Jan.ai](https://github.com/janhq/jan) is an open-source, privacy-first ChatGPT alternative with over 40,000 GitHub stars. While Jan is best known as a desktop application, its server component — **Jan Server** — exposes a fully OpenAI-compatible REST API that can be deployed on cloud GPU infrastructure like Clore.ai.

Jan Server is built on the [Cortex.cpp](https://github.com/janhq/cortex.cpp) inference engine, a high-performance runtime that supports `llama.cpp`, `TensorRT-LLM`, and ONNX backends. On Clore.ai you can rent a GPU server for as little as **$0.20/hr**, run Jan Server with Docker Compose, load any GGUF or GPTQ model, and serve it over an OpenAI-compatible API — all without your data leaving the machine.

**Key features:**

* 🔒 100% offline — no data ever leaves your server
* 🤖 OpenAI-compatible API (`/v1/chat/completions`, `/v1/models`, etc.)
* 📦 Model hub with one-command model downloads
* 🚀 GPU acceleration via CUDA (llama.cpp + TensorRT-LLM backends)
* 💬 Built-in conversation management and thread history
* 🔌 Drop-in replacement for OpenAI in existing applications

***

## Requirements

### Hardware Requirements

| Tier             | GPU           | VRAM  | RAM    | Storage    | Clore.ai Price |
| ---------------- | ------------- | ----- | ------ | ---------- | -------------- |
| **Minimum**      | RTX 3060 12GB | 12 GB | 16 GB  | 50 GB SSD  | \~$0.10/hr     |
| **Recommended**  | RTX 3090      | 24 GB | 32 GB  | 100 GB SSD | \~$0.20/hr     |
| **High-end**     | RTX 4090      | 24 GB | 64 GB  | 200 GB SSD | \~$0.35/hr     |
| **Large models** | A100 80GB     | 80 GB | 128 GB | 500 GB SSD | \~$1.10/hr     |

### Model VRAM Reference

| Model               | VRAM Required | Recommended GPU |
| ------------------- | ------------- | --------------- |
| Llama 3.1 8B (Q4)   | \~5 GB        | RTX 3060        |
| Llama 3.1 8B (FP16) | \~16 GB       | RTX 3090        |
| Llama 3.3 70B (Q4)  | \~40 GB       | A100 40GB       |
| Llama 3.1 405B (Q4) | \~220 GB      | 4× A100 80GB    |
| Mistral 7B (Q4)     | \~4 GB        | RTX 3060        |
| Qwen2.5 72B (Q4)    | \~45 GB       | A100 80GB       |

### Software Prerequisites

* Clore.ai account with funded wallet
* Basic Docker knowledge
* (Optional) OpenSSH client for port forwarding

***

## Quick Start

### Step 1 — Rent a GPU Server on Clore.ai

1. Browse to [clore.ai](https://clore.ai) and log in
2. Filter servers: **GPU Type** → RTX 3090 or better, **Docker** → enabled
3. Select a server and choose the **Docker** deployment option
4. Use the official `nvidia/cuda:12.1.0-devel-ubuntu22.04` base image or any CUDA image
5. Open ports: **1337** (Jan Server API), **39281** (Cortex API), **22** (SSH)

### Step 2 — Connect to Your Server

```bash
# SSH into your Clore.ai server
ssh -p <CLORE_SSH_PORT> root@<CLORE_SERVER_IP>

# Verify GPU is available
nvidia-smi
```

### Step 3 — Install Docker Compose (if not present)

```bash
# Check if Docker Compose is available
docker compose version

# Install if missing (Ubuntu/Debian)
apt-get update && apt-get install -y docker-compose-plugin

# Verify
docker compose version
```

### Step 4 — Deploy Jan Server with Docker Compose

```bash
# Create working directory
mkdir -p /workspace/jan-server && cd /workspace/jan-server

# Download the official Jan Server docker-compose.yml
curl -fsSL https://raw.githubusercontent.com/janhq/jan-server/main/docker-compose.yml \
  -o docker-compose.yml

# Review and edit configuration
cat docker-compose.yml
```

If the upstream compose file is unavailable or you want full control, create it manually:

```yaml
# /workspace/jan-server/docker-compose.yml
version: '3.8'

services:
  jan-server:
    image: ghcr.io/janhq/cortex:latest
    container_name: jan-server
    restart: unless-stopped
    ports:
      - "1337:1337"
      - "39281:39281"
    volumes:
      - jan-data:/root/jan
      - jan-models:/root/cortex/models
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - JAN_API_HOST=0.0.0.0
      - JAN_API_PORT=1337
      - CORTEX_API_PORT=39281
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:1337/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 60s

volumes:
  jan-data:
    driver: local
  jan-models:
    driver: local
```

```bash
# Start Jan Server
docker compose up -d

# Follow startup logs (wait for "Server started" message)
docker compose logs -f jan-server
```

### Step 5 — Verify the Server is Running

```bash
# Check server health
curl http://localhost:1337/health

# List available models (initially empty)
curl http://localhost:1337/v1/models

# Expected response:
# {"object":"list","data":[]}
```

### Step 6 — Pull Your First Model

```bash
# Pull Llama 3.2 3B (good starter, ~2GB)
curl -X POST http://localhost:1337/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b-gguf-q4-km"}'

# Or pull Mistral 7B Instruct Q4
curl -X POST http://localhost:1337/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km"}'

# Monitor download progress
curl http://localhost:1337/v1/models
```

### Step 7 — Start the Model & Chat

```bash
# Start the model (loads it into GPU VRAM)
curl -X POST http://localhost:1337/v1/models/start \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b-gguf-q4-km"}'

# Send your first chat request
curl http://localhost:1337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b-gguf-q4-km",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello! What can you help me with?"}
    ],
    "temperature": 0.7,
    "max_tokens": 512,
    "stream": false
  }'
```

***

## Configuration

### Environment Variables

| Variable               | Default               | Description                                    |
| ---------------------- | --------------------- | ---------------------------------------------- |
| `JAN_API_HOST`         | `0.0.0.0`             | Host to bind the API server                    |
| `JAN_API_PORT`         | `1337`                | Jan Server API port                            |
| `CORTEX_API_PORT`      | `39281`               | Internal Cortex engine port                    |
| `CUDA_VISIBLE_DEVICES` | `all`                 | Which GPUs to expose (comma-separated indices) |
| `JAN_DATA_FOLDER`      | `/root/jan`           | Path to Jan data folder                        |
| `CORTEX_MODELS_PATH`   | `/root/cortex/models` | Path to model storage                          |

### Multi-GPU Configuration

For servers with multiple GPUs (e.g., 2× RTX 3090 on Clore.ai):

```yaml
environment:
  - CUDA_VISIBLE_DEVICES=0,1  # Use both GPUs
```

Or to dedicate specific GPUs:

```bash
# Run Jan Server on GPU 0 only
docker run -d \
  --name jan-server \
  --gpus '"device=0"' \
  -p 1337:1337 \
  -v jan-data:/root/jan \
  -v jan-models:/root/cortex/models \
  ghcr.io/janhq/cortex:latest
```

### Custom Model Configuration

```bash
# List all pulled models
curl http://localhost:1337/v1/models | jq '.data[].id'

# Get model details
curl http://localhost:1337/v1/models/llama3.2:3b-gguf-q4-km

# Stop a running model (free VRAM)
curl -X POST http://localhost:1337/v1/models/stop \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b-gguf-q4-km"}'

# Delete a model (free disk space)
curl -X DELETE http://localhost:1337/v1/models/llama3.2:3b-gguf-q4-km
```

### Securing the API with a Token

Jan Server does not include authentication by default. Use Nginx as a reverse proxy:

```bash
apt-get install -y nginx apache2-utils

# Create password file
htpasswd -c /etc/nginx/.htpasswd admin

# Configure Nginx
cat > /etc/nginx/sites-available/jan-server << 'EOF'
server {
    listen 80;
    server_name _;

    location / {
        auth_basic "Jan Server";
        auth_basic_user_file /etc/nginx/.htpasswd;
        proxy_pass http://127.0.0.1:1337;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
    }
}
EOF

ln -s /etc/nginx/sites-available/jan-server /etc/nginx/sites-enabled/
nginx -t && systemctl restart nginx
```

***

## GPU Acceleration

### Verifying CUDA Acceleration

Jan Server's Cortex engine auto-detects CUDA. Verify it's using the GPU:

```bash
# Check GPU memory usage after loading a model
nvidia-smi

# Should show the cortex process consuming VRAM
# Example output:
# | Processes:                                                            |
# |  GPU   GI   CI        PID   Type   Process name            GPU Memory |
# |    0    N/A  N/A    12345    C   /usr/local/bin/cortex    8192MiB |
```

### Switching Inference Backends

Cortex supports multiple backends:

```bash
# Check which backends are available inside the container
docker exec jan-server cortex engines list

# Use TensorRT-LLM backend for NVIDIA GPUs (faster, requires more setup)
docker exec jan-server cortex engines install tensorrt-llm

# Use llama.cpp backend (default, most compatible)
docker exec jan-server cortex engines install llama-cpp
```

### Context Window and Batch Size Tuning

```bash
# Customize model parameters for GPU performance
curl -X POST http://localhost:1337/v1/models/start \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b-gguf-q4-km",
    "ctx_len": 8192,
    "ngl": 99,
    "n_batch": 512,
    "n_parallel": 4,
    "cpu_threads": 8
  }'
```

| Parameter    | Description                          | Recommendation                    |
| ------------ | ------------------------------------ | --------------------------------- |
| `ngl`        | GPU layers (higher = more GPU usage) | Set to `99` to max out GPU        |
| `ctx_len`    | Context window size                  | 4096–32768 depending on VRAM      |
| `n_batch`    | Batch size for prompt processing     | 512 for RTX 3090, 256 for smaller |
| `n_parallel` | Concurrent request slots             | 4–8 for API server use            |

***

## Tips & Best Practices

### 🎯 Model Selection for Clore.ai Budgets

```bash
# Budget tier (~$0.10/hr, RTX 3060 12GB):
# Use Q4_K_M quants of 7B models
curl -X POST http://localhost:1337/v1/models/pull \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km"}'

# Standard tier (~$0.20/hr, RTX 3090 24GB):
# Use Q5_K_M quants of 13B models or Q4 of 30B
curl -X POST http://localhost:1337/v1/models/pull \
  -d '{"model": "llama3.1:8b-instruct-gguf-q5-km"}'

# High-end tier (~$1.10/hr, A100 80GB):
# Run full 70B models in high precision
curl -X POST http://localhost:1337/v1/models/pull \
  -d '{"model": "llama3.3:70b-instruct-gguf-q4-km"}'
```

### 💾 Persistent Model Storage

Since Clore.ai instances are ephemeral, consider mounting external storage:

```bash
# Use a named volume (persists with Docker)
docker compose down
# Models survive in the 'jan-models' named volume

# For truly persistent storage across instances,
# upload models to object storage and pull on startup:
cat > /workspace/startup.sh << 'EOF'
#!/bin/bash
docker compose up -d
sleep 30
# Pre-pull your frequently used models
curl -X POST http://localhost:1337/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km"}'
EOF
chmod +x /workspace/startup.sh
```

### 🔗 Using Jan Server as OpenAI Drop-in

```python
# Python — use existing OpenAI client libraries
from openai import OpenAI

client = OpenAI(
    base_url="http://<CLORE_IP>:1337/v1",
    api_key="not-required"  # Jan Server has no auth by default
)

response = client.chat.completions.create(
    model="llama3.2:3b-gguf-q4-km",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    temperature=0.7
)
print(response.choices[0].message.content)
```

```bash
# Streaming support
curl http://localhost:1337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b-gguf-q4-km",
    "messages": [{"role": "user", "content": "Write a haiku about GPUs"}],
    "stream": true
  }'
```

### 📊 Monitoring Resource Usage

```bash
# Watch GPU utilization in real-time
watch -n 1 nvidia-smi

# Check container resource usage
docker stats jan-server

# View detailed logs
docker compose logs --tail=100 jan-server

# Check model load times
docker compose logs jan-server | grep -E "(loaded|started|error)"
```

***

## Troubleshooting

### Container fails to start — GPU not found

```bash
# Verify NVIDIA Docker runtime is configured
docker info | grep -i nvidia

# Test GPU access directly
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

# If this fails, check Docker daemon config
cat /etc/docker/daemon.json
# Should contain: {"runtimes": {"nvidia": {...}}}
```

### Model download stuck or fails

```bash
# Check disk space
df -h /root

# Check container logs for error
docker compose logs jan-server | tail -50

# Retry the pull
curl -X POST http://localhost:1337/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km"}'
```

### Out of VRAM (CUDA out of memory)

```bash
# Check current VRAM usage
nvidia-smi --query-gpu=memory.used,memory.free --format=csv

# Stop all running models first
curl http://localhost:1337/v1/models | jq -r '.data[].id' | while read model; do
  curl -X POST http://localhost:1337/v1/models/stop \
    -H "Content-Type: application/json" \
    -d "{\"model\": \"$model\"}"
done

# Use a more heavily quantized model (Q3 or Q4 instead of Q8)
# Q4_K_M typically uses ~50% of the Q8 VRAM requirement
```

### Cannot connect to API from outside the container

```bash
# Ensure port 1337 is bound on all interfaces
docker ps --format "table {{.Names}}\t{{.Ports}}"
# Should show: 0.0.0.0:1337->1337/tcp

# Check Clore.ai firewall rules — open port 1337 in the server settings
# Test locally first:
curl http://127.0.0.1:1337/health

# Then test from outside:
curl http://<CLORE_SERVER_IP>:<MAPPED_PORT>/health
```

### Slow inference (CPU fallback)

```bash
# Confirm CUDA is being used (not CPU)
docker exec jan-server cortex ps
# Should show GPU memory allocated

# Force GPU layers in model start
curl -X POST http://localhost:1337/v1/models/start \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km", "ngl": 99}'
```

***

## Further Reading

* [Jan.ai Official Documentation](https://jan.ai/docs) — Full platform docs
* [Jan GitHub Repository](https://github.com/janhq/jan) — Source code and issues
* [Jan Server / Jan API](https://github.com/janhq/jan-server) — Server-specific docs
* [Cortex.cpp Engine](https://github.com/janhq/cortex.cpp) — The underlying inference engine
* [Clore.ai Getting Started](/guides/getting-started/getting-started.md) — Platform basics
* [GPU Comparison Guide](/guides/getting-started/gpu-comparison.md) — Choose the right GPU
* [Running Ollama on Clore.ai](/guides/language-models/ollama.md) — Alternative LLM server
* [Running vLLM on Clore.ai](/guides/language-models/vllm.md) — High-throughput inference server
* [Hugging Face Model Hub](https://huggingface.co/models?library=gguf) — Find GGUF models

> 💡 **Cost tip:** An RTX 3090 on Clore.ai (\~$0.20/hr) can run Llama 3.1 8B at **\~50 tokens/second** — enough for personal use or low-traffic APIs. For production workloads, consider vLLM (see [vLLM guide](/guides/language-models/vllm.md)) on an A100.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/ai-platforms-and-agents/jan.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
