Jan.ai Offline Assistant

Deploy Jan.ai Server on Clore.ai — a fully offline, OpenAI-compatible LLM server with model hub, conversation management, and GPU-accelerated inference powered by the Cortex engine.

Overview

Jan.aiarrow-up-right is an open-source, privacy-first ChatGPT alternative with over 40,000 GitHub stars. While Jan is best known as a desktop application, its server component — Jan Server — exposes a fully OpenAI-compatible REST API that can be deployed on cloud GPU infrastructure like Clore.ai.

Jan Server is built on the Cortex.cpparrow-up-right inference engine, a high-performance runtime that supports llama.cpp, TensorRT-LLM, and ONNX backends. On Clore.ai you can rent a GPU server for as little as $0.20/hr, run Jan Server with Docker Compose, load any GGUF or GPTQ model, and serve it over an OpenAI-compatible API — all without your data leaving the machine.

Key features:

  • 🔒 100% offline — no data ever leaves your server

  • 🤖 OpenAI-compatible API (/v1/chat/completions, /v1/models, etc.)

  • 📦 Model hub with one-command model downloads

  • 🚀 GPU acceleration via CUDA (llama.cpp + TensorRT-LLM backends)

  • 💬 Built-in conversation management and thread history

  • 🔌 Drop-in replacement for OpenAI in existing applications


Requirements

Hardware Requirements

Tier
GPU
VRAM
RAM
Storage
Clore.ai Price

Minimum

RTX 3060 12GB

12 GB

16 GB

50 GB SSD

~$0.10/hr

Recommended

RTX 3090

24 GB

32 GB

100 GB SSD

~$0.20/hr

High-end

RTX 4090

24 GB

64 GB

200 GB SSD

~$0.35/hr

Large models

A100 80GB

80 GB

128 GB

500 GB SSD

~$1.10/hr

Model VRAM Reference

Model
VRAM Required
Recommended GPU

Llama 3.1 8B (Q4)

~5 GB

RTX 3060

Llama 3.1 8B (FP16)

~16 GB

RTX 3090

Llama 3.3 70B (Q4)

~40 GB

A100 40GB

Llama 3.1 405B (Q4)

~220 GB

4× A100 80GB

Mistral 7B (Q4)

~4 GB

RTX 3060

Qwen2.5 72B (Q4)

~45 GB

A100 80GB

Software Prerequisites

  • Clore.ai account with funded wallet

  • Basic Docker knowledge

  • (Optional) OpenSSH client for port forwarding


Quick Start

Step 1 — Rent a GPU Server on Clore.ai

  1. Browse to clore.aiarrow-up-right and log in

  2. Filter servers: GPU Type → RTX 3090 or better, Docker → enabled

  3. Select a server and choose the Docker deployment option

  4. Use the official nvidia/cuda:12.1.0-devel-ubuntu22.04 base image or any CUDA image

  5. Open ports: 1337 (Jan Server API), 39281 (Cortex API), 22 (SSH)

Step 2 — Connect to Your Server

Step 3 — Install Docker Compose (if not present)

Step 4 — Deploy Jan Server with Docker Compose

If the upstream compose file is unavailable or you want full control, create it manually:

Step 5 — Verify the Server is Running

Step 6 — Pull Your First Model

Step 7 — Start the Model & Chat


Configuration

Environment Variables

Variable
Default
Description

JAN_API_HOST

0.0.0.0

Host to bind the API server

JAN_API_PORT

1337

Jan Server API port

CORTEX_API_PORT

39281

Internal Cortex engine port

CUDA_VISIBLE_DEVICES

all

Which GPUs to expose (comma-separated indices)

JAN_DATA_FOLDER

/root/jan

Path to Jan data folder

CORTEX_MODELS_PATH

/root/cortex/models

Path to model storage

Multi-GPU Configuration

For servers with multiple GPUs (e.g., 2× RTX 3090 on Clore.ai):

Or to dedicate specific GPUs:

Custom Model Configuration

Securing the API with a Token

Jan Server does not include authentication by default. Use Nginx as a reverse proxy:


GPU Acceleration

Verifying CUDA Acceleration

Jan Server's Cortex engine auto-detects CUDA. Verify it's using the GPU:

Switching Inference Backends

Cortex supports multiple backends:

Context Window and Batch Size Tuning

Parameter
Description
Recommendation

ngl

GPU layers (higher = more GPU usage)

Set to 99 to max out GPU

ctx_len

Context window size

4096–32768 depending on VRAM

n_batch

Batch size for prompt processing

512 for RTX 3090, 256 for smaller

n_parallel

Concurrent request slots

4–8 for API server use


Tips & Best Practices

🎯 Model Selection for Clore.ai Budgets

💾 Persistent Model Storage

Since Clore.ai instances are ephemeral, consider mounting external storage:

🔗 Using Jan Server as OpenAI Drop-in

📊 Monitoring Resource Usage


Troubleshooting

Container fails to start — GPU not found

Model download stuck or fails

Out of VRAM (CUDA out of memory)

Cannot connect to API from outside the container

Slow inference (CPU fallback)


Further Reading

💡 Cost tip: An RTX 3090 on Clore.ai (~$0.20/hr) can run Llama 3.1 8B at ~50 tokens/second — enough for personal use or low-traffic APIs. For production workloads, consider vLLM (see vLLM guide) on an A100.

Last updated

Was this helpful?