# Aphrodite Engine

Aphrodite 引擎是一个基于 vLLM 构建的优化 LLM 推理服务器，专为创意写作和角色扮演社区定制。它支持从 Pascal（GTX 1000 系列）起的一系列 GPU，使其成为在较旧或预算有限的 CLORE.AI GPU 服务器上运行语言模型的理想选择，其他框架在这些服务器上常常无法运行。Aphrodite 增加了与 Kobold 兼容的 API、Mirostat 采样以及主流服务框架中没有的高级文本采样算法。

{% hint style="success" %}
所有示例都可以在通过以下方式租用的 GPU 服务器上运行 [CLORE.AI 市场](https://clore.ai/marketplace).
{% endhint %}

## 服务器要求

| 参数       | 最低要求                         | 推荐配置          |
| -------- | ---------------------------- | ------------- |
| 内存（RAM）  | 16 GB                        | 32 GB+        |
| 显存（VRAM） | 6 GB                         | 16 GB+        |
| 磁盘       | 40 GB                        | 150 GB 以上     |
| GPU      | NVIDIA Pascal+（GTX 1060 及以上） | RTX 3090、A100 |

{% hint style="info" %}
Aphrodite 引擎是少数支持 Pascal 世代 GPU（GTX 10xx 系列）的 LLM 服务器之一。这使其非常适合在拥有较旧 GPU 且租金低廉的 CLORE.AI 预算服务器上使用。
{% endhint %}

## 在 CLORE.AI 上快速部署

**Docker 镜像：** `alpindale/aphrodite-engine:latest`

**端口：** `22/tcp`, `2242/http`

**环境变量：**

| 变量                | 示例                                   | 描述                     |
| ----------------- | ------------------------------------ | ---------------------- |
| `HF_TOKEN`        | `hf_xxx...`                          | 用于受限模型的 HuggingFace 令牌 |
| `APHRODITE_MODEL` | `mistralai/Mistral-7B-Instruct-v0.3` | 要加载的模型                 |

## 逐步设置

### 1. 在 CLORE.AI 上租用 GPU 服务器

Aphrodite 的广泛 GPU 支持让你可以在以下平台上租到价格友好的服务器： [CLORE.AI 市场](https://clore.ai/marketplace):

* **Pascal（GTX 1060–1080 Ti）**: 6–11 GB 显存 — 使用量化运行小型 3B–7B 模型
* **Turing（RTX 2000 系列）**: 8–24 GB 显存 — 可运行 7B–13B 模型，更佳性能
* **Ampere（RTX 3000/A100）**: 24–80 GB 显存 — 可运行 30B–70B 模型，全速运转
* **Ada（RTX 4000 系列）**: 16–24 GB 显存 — 性能/成本比最佳

### 2. 通过 SSH 连接

```bash
ssh -p <PORT> root@<SERVER_IP>
```

### 3. 拉取 Aphrodite 引擎镜像

```bash
docker pull alpindale/aphrodite-engine:latest
```

### 4. 启动 Aphrodite 引擎

**使用 7B 模型的基础启动示例：**

```bash
docker run -d \
  --name aphrodite \
  --gpus all \
  --ipc host \
  -p 2242:2242 \
  -v /root/models:/root/.cache/huggingface \
  alpindale/aphrodite-engine:latest \
  python3 -m aphrodite.endpoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.3 \
    --host 0.0.0.0 \
    --port 2242 \
    --max-model-len 4096
```

**使用 HuggingFace 令牌（Llama 3）：**

```bash
docker run -d \
  --name aphrodite \
  --gpus all \
  --ipc host \
  -p 2242:2242 \
  -v /root/models:/root/.cache/huggingface \
  -e HF_TOKEN=hf_your_token_here \
  alpindale/aphrodite-engine:latest \
  python3 -m aphrodite.endpoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --host 0.0.0.0 \
    --port 2242 \
    --dtype bfloat16 \
    --max-model-len 8192
```

**使用 GPTQ 量化（针对受限显存）：**

```bash
docker run -d \
  --name aphrodite \
  --gpus all \
  --ipc host \
  -p 2242:2242 \
  -v /root/models:/root/.cache/huggingface \
  alpindale/aphrodite-engine:latest \
  python3 -m aphrodite.endpoints.openai.api_server \
    --model TheBloke/Mistral-7B-Instruct-v0.2-GPTQ \
    --host 0.0.0.0 \
    --port 2242 \
    --quantization gptq \
    --max-model-len 4096
```

**使用 AWQ 量化：**

```bash
docker run -d \
  --name aphrodite \
  --gpus all \
  --ipc host \
  -p 2242:2242 \
  -v /root/models:/root/.cache/huggingface \
  alpindale/aphrodite-engine:latest \
  python3 -m aphrodite.endpoints.openai.api_server \
    --model casperhansen/mistral-7b-instruct-v0.2-awq \
    --host 0.0.0.0 \
    --port 2242 \
    --quantization awq \
    --max-model-len 4096
```

**运行 GGUF 模型（Aphrodite 原生支持 GGUF）：**

```bash
# 首先下载 GGUF 文件
docker exec -it aphrodite bash -c "
pip install huggingface_hub
python3 -c \"from huggingface_hub import hf_hub_download; hf_hub_download(
    repo_id='TheBloke/Mistral-7B-Instruct-v0.2-GGUF',
    filename='mistral-7b-instruct-v0.2.Q4_K_M.gguf',
    local_dir='/root/models/mistral-gguf'
)\"
"

# 然后使用 GGUF 启动
docker run -d \
  --name aphrodite \
  --gpus all \
  --ipc host \
  -p 2242:2242 \
  -v /root/models:/models \
  alpindale/aphrodite-engine:latest \
  python3 -m aphrodite.endpoints.openai.api_server \
    --model /models/mistral-gguf/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 2242 \
    --tokenizer mistralai/Mistral-7B-Instruct-v0.2
```

### 5. 验证服务器

```bash
# 检查日志
docker logs -f aphrodite

# 健康检查
curl http://localhost:2242/health

# 列出已加载的模型
curl http://localhost:2242/v1/models
```

### 6. 通过 CLORE.AI HTTP 代理访问

CLORE.AI 的订单面板会提供一个 `http_pub` 端口 2242 的 URL。将其用于你的客户端应用：

```
https://<order-id>-2242.clore.ai/v1
```

***

## 使用示例

### 示例 1：兼容 OpenAI 的聊天

```bash
curl http://localhost:2242/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [
      {"role": "system", "content": "你是一位专长于奇幻小说的创意作家。"},
      {"role": "user", "content": "开始写一个关于一条学会绘画的龙的短篇故事。"}
    ],
    "max_tokens": 500,
    "temperature": 0.9,
    "top_p": 0.95
  }'
```

### 示例 2：使用 Mirostat 的高级采样

Aphrodite 支持 Mirostat 采样以生成更连贯的长篇文本：

```bash
curl http://localhost:2242/v1/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "prompt": "很久以前在一个赛博朋克城市，",
    "max_tokens": 400,
    "mirostat_mode": 2,
    "mirostat_tau": 5.0,
    "mirostat_eta": 0.1
  }'
```

### 示例 3：与 Kobold 兼容的 API

Aphrodite 包含一个与 Kobold 兼容的端点，可用于基于 KoboldAI 的前端：

```bash
# Kobold 生成端点
curl http://localhost:2242/api/v1/generate \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "prompt": "飞船进入了超空间，",
    "max_length": 200,
    "temperature": 0.8,
    "top_p": 0.92,
    "rep_pen": 1.15
  }'
```

### 示例 4：带自定义采样器的 Python 客户端

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:2242/v1",
    api_key="none",
)

# 使用定制采样器进行创意写作
response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[
        {
            "role": "user",
            "content": "写一首关于恒星之间沉默的诗。",
        }
    ],
    max_tokens=300,
    temperature=1.1,
    top_p=0.95,
    frequency_penalty=0.3,
    presence_penalty=0.2,
)

print(response.choices[0].message.content)
```

### 示例 5：批量完成请求

```python
import requests

BASE_URL = "http://localhost:2242"

prompts = [
    "古老的巫师翻开了他的典籍，",
    "在霓虹灯闪烁的小巷里，侦探注意到",
    "地球上最后的 AI 对机器人说：",
]

for prompt in prompts:
    response = requests.post(
        f"{BASE_URL}/v1/completions",
        json={
            "model": "mistralai/Mistral-7B-Instruct-v0.3",
            "prompt": prompt,
            "max_tokens": 150,
            "temperature": 0.85,
            "top_k": 50,
            "top_p": 0.95,
            "repetition_penalty": 1.1,
        },
    )
    result = response.json()
    print(f"Prompt: {prompt}")
    print(f"Continuation: {result['choices'][0]['text']}\n")
```

***

## invokeai.yaml 配置文件

### 主要启动参数

| 参数                         | 默认          | 描述                                 |
| -------------------------- | ----------- | ---------------------------------- |
| `--model`                  | 必填          | 模型 ID 或本地路径                        |
| `--host`                   | `127.0.0.1` | 绑定地址                               |
| `--port`                   | `2242`      | 服务器端口                              |
| `--dtype`                  | `auto`      | `float16`, `bfloat16`, `float32`   |
| `--quantization`           | none        | `awq`, `gptq`, `squeezellm`, `fp8` |
| `--max-model-len`          | 模型最大值       | 覆盖最大上下文长度                          |
| `--gpu-memory-utilization` | `0.90`      | GPU 内存占比                           |
| `--tensor-parallel-size`   | `1`         | 用于张量并行的 GPU 数量                     |
| `--max-num-seqs`           | `256`       | 最大并发序列数                            |
| `--trust-remote-code`      | false       | 允许自定义模型代码                          |
| `--api-keys`               | none        | 用于认证的逗号分隔 API 密钥                   |
| `--served-model-name`      | 模型名称        | 用于 API 响应的自定义名称                    |

### 添加 API 密钥认证

```bash
python3 -m aphrodite.endpoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --host 0.0.0.0 \
  --port 2242 \
  --api-keys "mysecretkey1,mysecretkey2"
```

然后使用 `Authorization: Bearer mysecretkey1` 在请求中。

### 加载本地模型

```bash
# 挂载你的模型目录并引用它
docker run -d \
  --name aphrodite \
  --gpus all \
  --ipc host \
  -p 2242:2242 \
  -v /path/to/your/model:/model \
  alpindale/aphrodite-engine:latest \
  python3 -m aphrodite.endpoints.openai.api_server \
    --model /model \
    --host 0.0.0.0 \
    --port 2242
```

***

## 1. 使用 SDXL-Turbo 或 SDXL-Lightning 以实现快速生成

### 1. 为你的 GPU 选择合适的量化方案

| GPU 显存 | 7B 模型       | 13B 模型      | 30B 模型  |
| ------ | ----------- | ----------- | ------- |
| 6 GB   | GPTQ/AWQ Q4 | ❌           | ❌       |
| 8 GB   | GPTQ Q4     | GPTQ Q4（紧凑） | ❌       |
| 12 GB  | Float16     | GPTQ Q4     | ❌       |
| 16 GB  | Float16     | Float16     | GPTQ Q4 |
| 24 GB  | Float16     | Float16     | GPTQ Q4 |
| 48 GB  | Float16     | Float16     | Float16 |

### 2. 调整 GPU 内存利用率

```bash
--gpu-memory-utilization 0.93  # 挤出更多 KV 缓存
```

从较低值开始，如果没有出现 OOM 错误再提高。

### 3. 在 Ampere 及更新 GPU 上使用 bfloat16

```bash
--dtype bfloat16
```

比 float16 具有更好的数值稳定性，速度相同。

### 4. 针对角色扮演/创意写作进行优化

这些采样器适合叙事类文本：

```json
{
  "temperature": 0.85,
  "top_p": 0.92,
  "top_k": 40,
  "repetition_penalty": 1.12,
  "mirostat_mode": 2,
  "mirostat_tau": 5.0
}
```

### 5. Pascal GPU 提示（GTX 10xx）

对于 Pascal GPU，避免使用 Flash Attention（不被支持）：

```bash
--dtype float16  # 如果出现 NaN 错误则使用 float32
--max-model-len 2048  # 减小上下文以节省内存
```

***

## 故障排除

### 问题："CUDA capability sm\_6x not supported"

Pascal GPU 需要特殊处理。使用：

```bash
--dtype float16
```

如果仍然失败，检查镜像版本是否支持 Pascal：

```bash
docker pull alpindale/aphrodite-engine:v0.5.4  # 试试特定版本
```

### 问题：小显卡上出现“内存不足”

```bash
--gpu-memory-utilization 0.85
--max-model-len 2048
--quantization gptq  # 或者 awq
```

### 问题：令牌生成速度慢

* 检查 GPU 是否确实在被使用： `nvidia-smi` 在容器内部
* 启用更大的批量大小： `--max-num-seqs 64`
* 使用 AWQ 而不是 GPTQ（推理更快）

### 问题：找不到模型 / 404 错误

始终检查你的模型名称是否完全匹配：

```bash
curl http://localhost:2242/v1/models
```

在请求中使用响应中给出的精确模型名称。

### 问题：输出重复性高

添加重复惩罚：

```json
{
  "repetition_penalty": 1.15,
  "frequency_penalty": 0.3
}
```

### 问题：Docker 容器静默退出

```bash
docker logs aphrodite 2>&1 | tail -100
# 常见原因：显存不足、模型路径无效
```

***

## 文档

* [GitHub](https://github.com/PygmalionAI/aphrodite-engine)
* [文档](https://aphrodite.pygmalion.chat)
* [Docker Hub](https://hub.docker.com/r/alpindale/aphrodite-engine)
* [支持的模型](https://github.com/PygmalionAI/aphrodite-engine?tab=readme-ov-file#supported-models)
* [CLORE.AI 市场](https://clore.ai/marketplace)

***

## Clore.ai 的 GPU 建议

| 在 Clore.ai 上的预估费用 | 开发/测试            | RTX 3090（24GB） |
| ----------------- | ---------------- | -------------- |
| \~$0.12/每 GPU/每小时 | 生产               | RTX 4090（24GB） |
| 生产（7B–13B）        | 大规模              | A100 80GB      |
| 大型模型（70B+）        | A100 80GB / H100 | Clore.ai       |

> GPU 服务器上。浏览可用 GPU 并按小时租用 — 无需承诺，提供完整的 root 访问权限。 [Clore.ai](https://clore.ai/marketplace) GPU 服务器。浏览可用 GPU 并按小时租用 — 无需承诺，提供完整的 root 访问权限。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/yu-yan-mo-xing/aphrodite-engine.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.