# Jan.ai 离线助手

## 概览

[Jan.ai](https://github.com/janhq/jan) 是一个开源、以隐私为先的 ChatGPT 替代品，拥有超过 40,000 个 GitHub 星标。虽然 Jan 以桌面应用而闻名，但它的服务器组件 — **Jan 服务器** — 暴露了一个完全兼容 OpenAI 的 REST API，可以部署在像 Clore.ai 这样的云 GPU 基础设施上。

Jan 服务器构建于 [Cortex.cpp](https://github.com/janhq/cortex.cpp) 推理引擎之上，这是一个支持高性能运行时，支持 `llama.cpp`, `TensorRT-LLM`和 ONNX 后端。在 Clore.ai 上你可以以低至 **$0.20/小时**的价格租用 GPU 服务器，使用 Docker Compose 运行 Jan 服务器，加载任意 GGUF 或 GPTQ 模型，并通过兼容 OpenAI 的 API 提供服务 — 所有数据都不会离开机器。

**主要功能：**

* 🔒 100% 离线 — 数据绝不会离开你的服务器
* 🤖 兼容 OpenAI 的 API (`/v1/chat/completions`, `/v1/models`，等.)
* 📦 带一键模型下载的模型中心
* 🚀 通过 CUDA 提供的 GPU 加速（llama.cpp + TensorRT-LLM 后端）
* 💬 内置会话管理和线程历史
* 🔌 可在现有应用中作为 OpenAI 的直接替代品

***

## 要求

### 硬件要求

| 方案       | GPU           | 显存    | 内存     | 存储         | Clore.ai 价格 |
| -------- | ------------- | ----- | ------ | ---------- | ----------- |
| **最低**   | RTX 3060 12GB | 12 GB | 16 GB  | 50 GB SSD  | 约 $0.10/小时  |
| **推荐**   | 速度            | 24 GB | 32 GB  | 100 GB SSD | \~$0.20/小时  |
| **高端**   | 512x512       | 24 GB | 64 GB  | 200 GB SSD | \~$0.35/小时  |
| **大型模型** | 4 小时会话        | 80 GB | 128 GB | 500 GB SSD | \~$1.10/小时  |

### 模型显存参考

| A100               | 所需显存     | 推荐 GPU       |
| ------------------ | -------- | ------------ |
| Llama 3.1 8B (Q4)  | \~5 GB   | 按小时费率        |
| Llama 3.1 8B（FP16） | \~16 GB  | 速度           |
| Llama 3.3 70B（Q4）  | \~40 GB  | 按日费率         |
| Llama 3.1 405B（Q4） | 约 220 GB | 4× A100 80GB |
| Mistral 7B（Q4）     | 约 4 GB   | 按小时费率        |
| Qwen2.5 72B (Q4)   | \~45 GB  | 4 小时会话       |

### 软件先决条件

* 具有已充值钱包的 Clore.ai 账户
* 基本的 Docker 知识
* （可选）用于端口转发的 OpenSSH 客户端

***

## 快速开始

### 第 1 步 — 在 Clore.ai 上租用 GPU 服务器

1. 浏览到 [clore.ai](https://clore.ai) 并登录
2. 筛选服务器： **GPU 类型** → RTX 3090 或更好， **已预装 Docker** → 已启用
3. 选择一个服务器并选择 **已预装 Docker** 部署选项
4. 使用官方 `nvidia/cuda:12.1.0-devel-ubuntu22.04` 基础镜像或任何 CUDA 镜像
5. 打开端口： **1337** （Jan Server API）， **39281** （Cortex API）， **22** （SSH）

### 步骤 2 — 连接到你的服务器

```bash
# SSH 登录到你的 Clore.ai 服务器
ssh -p <CLORE_SSH_PORT> root@<CLORE_SERVER_IP>

# 验证 GPU 是否可用
nvidia-smi
```

### 步骤 3 — 安装 Docker Compose（如果未安装）

```bash
# 检查是否可用 Docker Compose
docker compose version

# 如果缺失则安装（Ubuntu/Debian）
apt-get update && apt-get install -y docker-compose-plugin

# 验证
docker compose version
```

### 步骤 4 — 使用 Docker Compose 部署 Jan 服务器

```bash
# 创建工作目录
mkdir -p /workspace/jan-server && cd /workspace/jan-server

# 下载官方 Jan Server 的 docker-compose.yml
curl -fsSL https://raw.githubusercontent.com/janhq/jan-server/main/docker-compose.yml \
  -o docker-compose.yml

# 检查并编辑配置
cat docker-compose.yml
```

如果上游的 compose 文件不可用或你想完全控制，请手动创建：

```yaml
# /workspace/jan-server/docker-compose.yml
version: '3.8'

services:
  jan-server:
    image: ghcr.io/janhq/cortex:latest
    container_name: jan-server
    restart: unless-stopped
    ports:
      - "1337:1337"
      - "39281:39281"
    volumes:
      - jan-data:/root/jan
      - jan-models:/root/cortex/models
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - JAN_API_HOST=0.0.0.0
      - JAN_API_PORT=1337
      - CORTEX_API_PORT=39281
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:1337/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 60s

volumes:
  jan-data:
    driver: local
  jan-models:
    driver: local
```

```bash
# 启动 Jan 服务器
docker compose up -d

# 关注启动日志（等待 “Server started” 消息）
docker compose logs -f jan-server
```

### 步骤 5 — 验证服务器是否正在运行

```bash
# 检查服务器健康状态
curl http://localhost:1337/health

# 列出可用模型（初始为空）
curl http://localhost:1337/v1/models

# 预期响应：
# {"object":"list","data":[]}
```

### 步骤 6 — 拉取你的第一个模型

```bash
# 拉取 Llama 3.2 3B（良好入门，约 ~2GB）
curl -X POST http://localhost:1337/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b-gguf-q4-km"}'

# 或拉取 Mistral 7B Instruct Q4
curl -X POST http://localhost:1337/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km"}'

# 监控下载进度
curl http://localhost:1337/v1/models
```

### 步骤 7 — 启动模型并聊天

```bash
# 启动模型（将其加载到 GPU 显存）
curl -X POST http://localhost:1337/v1/models/start \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b-gguf-q4-km"}'

# 发送你的第一个聊天请求
curl http://localhost:1337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b-gguf-q4-km",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "你好！你能帮我做什么？"}
    ],
    "temperature": 0.7,
    "max_tokens": 512,
    "stream": false
  }'
```

***

## 配置

### 使用环境变量进行 SSH 和 Jupyter 访问：

| 变量                     | 默认值                   | 4s                  |
| ---------------------- | --------------------- | ------------------- |
| `JAN_API_HOST`         | `0.0.0.0`             | 绑定 API 服务器的主机       |
| `JAN_API_PORT`         | `1337`                | Jan 服务器 API 端口      |
| `CORTEX_API_PORT`      | `39281`               | 内部 Cortex 引擎端口      |
| `CUDA_VISIBLE_DEVICES` | `全部`                  | 要暴露哪些 GPU（以逗号分隔的索引） |
| `JAN_DATA_FOLDER`      | `/root/jan`           | Jan 数据文件夹路径         |
| `CORTEX_MODELS_PATH`   | `/root/cortex/models` | 模型存储路径              |

### 多 GPU 配置

对于具有多 GPU 的服务器（例如在 Clore.ai 上的 2× RTX 3090）：

```yaml
environment:
  - CUDA_VISIBLE_DEVICES=0,1  # 使用两个 GPU
```

或者为特定 GPU 专用：

```bash
# 仅在 GPU 0 上运行 Jan 服务器
docker run -d \
  --name jan-server \
  --gpus '"device=0"' \
  -p 1337:1337 \
  -v jan-data:/root/jan \
  -v jan-models:/root/cortex/models \
  ghcr.io/janhq/cortex:latest
```

### 自定义模型配置

```bash
# 列出所有已拉取的模型
curl http://localhost:1337/v1/models | jq '.data[].id'

# 获取模型详情
curl http://localhost:1337/v1/models/llama3.2:3b-gguf-q4-km

# 停止正在运行的模型（释放显存）
curl -X POST http://localhost:1337/v1/models/stop \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b-gguf-q4-km"}'

# 删除模型（释放磁盘空间）
curl -X DELETE http://localhost:1337/v1/models/llama3.2:3b-gguf-q4-km
```

### 使用令牌保护 API

Jan 服务器默认不包含认证。使用 Nginx 作为反向代理：

```bash
apt-get install -y nginx apache2-utils

# 创建密码文件
htpasswd -c /etc/nginx/.htpasswd admin

# 配置 Nginx
cat > /etc/nginx/sites-available/jan-server << 'EOF'
server {
    listen 80;
    server_name _;

    location / {
        auth_basic "Jan Server";
        auth_basic_user_file /etc/nginx/.htpasswd;
        proxy_pass http://127.0.0.1:1337;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
    }
}
EOF

ln -s /etc/nginx/sites-available/jan-server /etc/nginx/sites-enabled/
nginx -t && systemctl restart nginx
```

***

## GPU 加速

### 验证 CUDA 加速

Jan 服务器的 Cortex 引擎会自动检测 CUDA。验证它是否使用 GPU：

```bash
# 在加载模型后检查 GPU 内存使用情况
nvidia-smi

# 应该显示 cortex 进程正在消耗显存
# 示例输出：
# | Processes:                                                            |
# |  GPU   GI   CI        PID   Type   Process name            GPU Memory |
# |    0    N/A  N/A    12345    C   /usr/local/bin/cortex    8192MiB |
```

### 切换推理后端

Cortex 支持多种后端：

```bash
# 检查容器内部哪些后端可用
docker exec jan-server cortex engines list

# 对于 NVIDIA GPU 使用 TensorRT-LLM 后端（更快，但需要更多设置）
docker exec jan-server cortex engines install tensorrt-llm

# 使用 llama.cpp 后端（默认，兼容性最好）
docker exec jan-server cortex engines install llama-cpp
```

### 上下文窗口和批量大小调优

```bash
# 自定义模型参数以优化 GPU 性能
curl -X POST http://localhost:1337/v1/models/start \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b-gguf-q4-km",
    "ctx_len": 8192,
    "ngl": 99,
    "n_batch": 512,
    "n_parallel": 4,
    "cpu_threads": 8
  }'
```

| 参数           | 4s                    | 建议                          |
| ------------ | --------------------- | --------------------------- |
| `ngl`        | GPU 层数（越高 = 使用更多 GPU） | 设置为 `99` 以最大化 GPU           |
| `ctx_len`    | 上下文窗口大小               | 根据显存为 4096–32768            |
| `n_batch`    | 用于提示处理的批量大小           | RTX 3090 使用 512，较小的显卡使用 256 |
| `n_parallel` | 并发请求插槽数               | 用于 API 服务器时建议 4–8           |

***

## 提示与最佳实践

### 🎯 针对 Clore.ai 预算的模型选择

```bash
# 预算等级（约 $0.10/小时，RTX 3060 12GB）：
# 使用 7B 模型的 Q4_K_M 量化
curl -X POST http://localhost:1337/v1/models/pull \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km"}'

# 标准等级（约 $0.20/小时，RTX 3090 24GB）：
# 使用 13B 模型的 Q5_K_M 量化或 30B 的 Q4
curl -X POST http://localhost:1337/v1/models/pull \
  -d '{"model": "llama3.1:8b-instruct-gguf-q5-km"}'

# 高端等级（约 $1.10/小时，A100 80GB）：
# 在高精度下运行完整的 70B 模型
curl -X POST http://localhost:1337/v1/models/pull \
  -d '{"model": "llama3.3:70b-instruct-gguf-q4-km"}'
```

### 💾 持久化模型存储

由于 Clore.ai 实例是短暂的，考虑挂载外部存储：

```bash
# 使用命名卷（随 Docker 持久化）
docker compose down
# 模型会保存在名为 'jan-models' 的命名卷中

# 若要在实例间实现真正持久的存储，
# 将模型上传到对象存储并在启动时拉取：
cat > /workspace/startup.sh << 'EOF'
#!/bin/bash
docker compose up -d
sleep 30
# 预先拉取你常用的模型
curl -X POST http://localhost:1337/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km"}'
EOF
chmod +x /workspace/startup.sh
```

### 🔗 将 Jan 服务器用作 OpenAI 的直接替代

```python
# Python — 使用现有的 OpenAI 客户端库
from openai import OpenAI

client = OpenAI(
    base_url="http://<CLORE_IP>:1337/v1",
    api_key="not-required"  # Jan 服务器默认没有认证
)

response = client.chat.completions.create(
    model="llama3.2:3b-gguf-q4-km",
    messages=[{"role": "user", "content": "解释量子计算"}],
    temperature=0.7
)
print(response.choices[0].message.content)
```

```bash
# 支持流式传输
curl http://localhost:1337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b-gguf-q4-km",
    "messages": [{"role": "user", "content": "写一首关于 GPU 的俳句"}],
    "stream": true
  }'
```

### 📊 监控资源使用情况

```bash
# 实时查看 GPU 利用率
watch -n 1 nvidia-smi

# 检查容器资源使用情况
docker stats jan-server

# 查看详细日志
docker compose logs --tail=100 jan-server

# 检查模型加载时间
docker compose logs jan-server | grep -E "(loaded|started|error)"
```

***

## # 使用固定种子以获得一致结果

### 容器无法启动 — 找不到 GPU

```bash
# 验证 NVIDIA Docker 运行时是否已配置
docker info | grep -i nvidia

# 直接测试 GPU 访问
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

# 如果失败，请检查 Docker 守护进程配置
cat /etc/docker/daemon.json
# 应包含：{"runtimes": {"nvidia": {...}}}
```

### 模型下载卡住或失败

```bash
# 检查运行中的进程
df -h /root

# 检查容器日志中的错误
docker compose logs jan-server | tail -50

# 重试拉取
curl -X POST http://localhost:1337/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km"}'
```

### 显存不足（CUDA 内存不足）

```bash
# 检查当前显存使用情况
nvidia-smi --query-gpu=memory.used,memory.free --format=csv

# 先停止所有正在运行的模型
curl http://localhost:1337/v1/models | jq -r '.data[].id' | while read model; do
  curl -X POST http://localhost:1337/v1/models/stop \
    -H "Content-Type: application/json" \
    -d "{\"model\": \"$model\"}"
done

# 使用更高量化的模型（使用 Q3 或 Q4 而不是 Q8）
# Q4_K_M 通常使用约 Q8 显存需求的 ~50%
```

### 无法从容器外部连接到 API

```bash
# 确保端口 1337 在所有接口上绑定
docker ps --format "table {{.Names}}\t{{.Ports}}"
# 应显示：0.0.0.0:1337->1337/tcp

# 检查 Clore.ai 的防火墙规则 — 在服务器设置中打开 1337 端口
# 先在本地测试：
curl http://127.0.0.1:1337/health

# 然后从外部测试：
curl http://<CLORE_SERVER_IP>:<MAPPED_PORT>/health
```

### 推理缓慢（回退到 CPU）

```bash
# 确认正在使用 CUDA（而不是 CPU）
docker exec jan-server cortex ps
# 应显示已分配的 GPU 内存

# 在模型启动时强制使用 GPU 层
curl -X POST http://localhost:1337/v1/models/start \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral:7b-instruct-v0.3-gguf-q4-km", "ngl": 99}'
```

***

## 延伸阅读

* [Jan.ai 官方文档](https://jan.ai/docs) — 平台完整文档
* [Jan GitHub 仓库](https://github.com/janhq/jan) — 源代码和问题追踪
* [Jan Server / Jan API](https://github.com/janhq/jan-server) — 服务器相关文档
* [Cortex.cpp 引擎](https://github.com/janhq/cortex.cpp) — 底层推理引擎
* [Clore.ai 入门](/guides/guides_v2-zh/kai-shi-shi-yong/getting-started.md) —— 平台基础
* [GPU 比较指南](/guides/guides_v2-zh/kai-shi-shi-yong/gpu-comparison.md) — 选择合适的 GPU
* [在 Clore.ai 上运行 Ollama](/guides/guides_v2-zh/yu-yan-mo-xing/ollama.md) — 替代的 LLM 服务器
* [在 Clore.ai 上运行 vLLM](/guides/guides_v2-zh/yu-yan-mo-xing/vllm.md) — 高吞吐量推理服务器
* [Hugging Face 模型中心](https://huggingface.co/models?library=gguf) — 查找 GGUF 模型

> 💡 **成本提示：** 在 Clore.ai 上的一块 RTX 3090（约 $0.20/小时）可以以 **约 50 令牌/秒** — 足以用于个人或低流量 API。对于生产工作负载，考虑在 A100 上使用 vLLM（参见 [vLLM 指南](/guides/guides_v2-zh/yu-yan-mo-xing/vllm.md)）.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/ai-ping-tai-yu-zhi-neng-ti/jan.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.