# PowerInfer

**利用激活局部性的 CPU/GPU 混合 LLM 推理** — 通过在 CPU 和 GPU 之间智能划分计算，在单张消费者 GPU 上运行 70B 参数模型。

> 🌟 **超过 8,000 个 GitHub 星标** | 由 SJTU IPADS 开发 | MIT 许可证

***

## 什么是 PowerInfer？

PowerInfer 是一个针对大型语言模型的高性能推理引擎，其利用了一个关键见解： **LLM 展示出强烈的激活局部性** — 一小部分神经元（“热神经元”）在大多数推理步骤中持续被激活，而大多数神经元保持不活跃。

PowerInfer 利用此特性来：

1. **将热神经元保留在 GPU 上** 以实现快速计算
2. **将冷神经元卸载到 CPU/RAM** 而不会显著降低质量
3. **动态路由** 根据激活模式在 CPU 与 GPU 之间分配计算

结果：你只需 **16GB 显存** 就能运行 70B 模型，而不需要全部在 GPU 上占用 140GB+。

### 关键能力

* **支持消费者级 GPU** — RTX 3090/4090 可运行 70B 模型
* **面向神经元的调度** — 预测器为每次推理决定走 CPU 还是 GPU
* **最小的质量下降** — 保持超过 95% 的全精度质量
* **兼容 llama.cpp** — 支持 GGUF 格式
* **支持 NUMA 感知的 CPU 卸载** — 为高核数 CPU 进行了优化

### 为什么在 Clore.ai 上使用 PowerInfer？

Clore.ai 的 GPU 租用成本远低于云端替代方案。使用 PowerInfer：

* 运行 **Llama 2 70B** 在一台 **单张 RTX 4090 上** （24GB 显存）
* 相比多 GPU 配置大幅降低 GPU 租用成本
* 使用 CPU RAM 作为溢出处理长上下文窗口
* 运行此前需要昂贵 A100/H100 实例的模型

***

## 硬件要求

| 模型规模 | 最低显存 | 推荐内存  | 性能  |
| ---- | ---- | ----- | --- |
| 7B   | 4GB  | 16GB  | 优秀  |
| 13B  | 6GB  | 32GB  | 非常好 |
| 34B  | 12GB | 64GB  | 良好  |
| 70B  | 16GB | 128GB | 中等  |

{% hint style="info" %}
**CPU 很重要：** PowerInfer 会将冷神经元卸载到 CPU。高核数 CPU（AMD EPYC、Intel Xeon）和快速内存带宽会显著提升大模型的吞吐量。
{% endhint %}

***

## 在 Clore.ai 上快速开始

### 步骤 1：选择服务器

在 [clore.ai](https://clore.ai) 市场中，筛选：

* **NVIDIA GPU** 具有 16GB+ 显存（RTX 3090、RTX 4090、A100）
* **高 CPU 核心数** （理想为 16+ 核）
* **64GB+ 内存** 针对 70B 模型，建议 64GB；13B 模型建议 32GB

### 步骤 2：创建自定义 Docker 镜像

PowerInfer 需要自定义 Docker 设置。使用此 `Dockerfile`:

```dockerfile
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

# 安装依赖项
RUN apt-get update && apt-get install -y \
    git \
    cmake \
    build-essential \
    python3 \
    python3-pip \
    curl \
    wget \
    openssh-server \
    && rm -rf /var/lib/apt/lists/*

# 配置 SSH
RUN mkdir /var/run/sshd && \
    echo 'root:powerinfer' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# 克隆并构建 PowerInfer
RUN git clone https://github.com/SJTU-IPADS/PowerInfer.git /app/PowerInfer
WORKDIR /app/PowerInfer

RUN mkdir build && cd build && \
    cmake .. -DLLAMA_CUBLAS=ON && \
    cmake --build . --config Release -j$(nproc)

# 为求解器安装 Python 依赖
RUN pip3 install torch numpy scipy

EXPOSE 22

CMD ["/bin/bash", "-c", "service ssh start && tail -f /dev/null"]
```

将镜像构建并推送到 Docker Hub，或在 Clore.ai 中内联使用：

```bash
docker build -t yourname/powerinfer:latest .
docker push yourname/powerinfer:latest
```

### 步骤 3：在 Clore.ai 上部署

在你的 Clore.ai 订单中，设置：

* **Docker 镜像：** `yourname/powerinfer:latest`
* **端口：** `22` （SSH）
* **环境：** `NVIDIA_VISIBLE_DEVICES=all`

***

## 从源码构建 PowerInfer

如果你更愿意在容器内构建：

```bash
# SSH 到你的 Clore.ai 服务器
ssh root@<clore-node-ip> -p <ssh-port>

# 安装先决条件
apt-get update && apt-get install -y git cmake build-essential python3 python3-pip

# 克隆 PowerInfer
git clone https://github.com/SJTU-IPADS/PowerInfer.git
cd PowerInfer

# 使用 CUDA 支持进行构建
mkdir build && cd build
cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j$(nproc)

echo "构建完成！"
ls -la bin/
```

### 验证构建

```bash
./build/bin/main --help
# 应输出 PowerInfer CLI 帮助信息
```

***

## 获取模型

### 下载 GGUF 模型

PowerInfer 使用 GGUF 格式（与 llama.cpp 相同）：

```bash
# 安装 HuggingFace CLI
pip3 install huggingface_hub

# 下载 Llama 2 7B Q4（推荐用于测试）
huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF \
  llama-2-7b-chat.Q4_K_M.gguf \
  --local-dir ./models

# 下载 Llama 2 70B Q4（需要 16GB+ 显存）  
huggingface-cli download TheBloke/Llama-2-70B-Chat-GGUF \
  llama-2-70b-chat.Q4_K_M.gguf \
  --local-dir ./models
```

### 生成神经元预测器（PowerInfer 必需）

PowerInfer 需要为每个模型生成一个神经元激活预测器。这是与 llama.cpp 的关键区别：

```bash
# 安装 Python 求解器依赖
pip3 install torch numpy scipy

# 为你的模型生成预测器
python3 PowerInfer/solver/solve.py \
  --model ./models/llama-2-7b-chat.Q4_K_M.gguf \
  --output ./predictors/llama-2-7b-chat \
  --target-gpu-layers 20 \
  --gpu-memory-gb 16

# 这会在 ./predictors/ 中创建预测器文件
ls ./predictors/llama-2-7b-chat/
```

{% hint style="warning" %}
**预测器生成时间：** 根据模型大小，创建神经元预测器可能需要 30–60 分钟。这是一次性操作——预测器可在后续运行中重复使用。
{% endhint %}

***

## 运行推理

### 基本推理（无预测器）

用于不生成预测器时的测试（标准 GPU/CPU 划分）：

```bash
./build/bin/main \
  -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
  --gpu-layers 20 \
  -p "告诉我关于量子计算的事" \
  -n 256
```

### PowerInfer 模式（使用预测器）

带有神经元感知路由的完整 PowerInfer 模式：

```bash
./build/bin/main \
  -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
  --predictor-path ./predictors/llama-2-7b-chat \
  --gpu-layers 20 \
  --n-gpu-layers 20 \
  -p "生命的意义是什么？" \
  -n 512 \
  --ctx-size 4096
```

### 交互聊天模式

```bash
./build/bin/main \
  -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
  --predictor-path ./predictors/llama-2-7b-chat \
  --gpu-layers 20 \
  -i \
  --ctx-size 4096 \
  --temp 0.7 \
  --top-p 0.9 \
  --repeat-penalty 1.1 \
  --color
```

### 服务模式（兼容 OpenAI 的 API）

```bash
./build/bin/server \
  -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
  --predictor-path ./predictors/llama-2-7b-chat \
  --gpu-layers 20 \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 4096
```

***

## 优化 GPU 层划分

参数 `--gpu-layers` 用于确定将多少 Transformer 层保留在 GPU 上。根据你的显存进行调整：

```bash
# 检查可用显存
nvidia-smi --query-gpu=memory.free,memory.total --format=csv

# 针对 Q4 模型的经验法则：
# 7B：每层约 ~0.13GB → 24GB 卡 = 约 184 层（全部）
# 13B：每层约 ~0.18GB → 24GB 卡 = 约 133 层
# 70B：每层约 ~0.23GB → 24GB 卡 = 约 104 层（总共 80 层 中）
```

**层分配指南：**

| GPU 显存 | 7B 模型  | 13B 模型 | 34B 模型 | 70B 模型 |
| ------ | ------ | ------ | ------ | ------ |
| 8GB    | 全部（32） | 20 层   | 10 层   | 4 层    |
| 16GB   | 全部（32） | 全部（40） | 25 层   | 10 层   |
| 24GB   | 全部（32） | 全部（40） | 全部（60） | 20 层   |
| 48GB   | 全部（32） | 全部（40） | 全部（60） | 全部（80） |

***

## 性能基准

### 吞吐量对比（Llama 2 70B，RTX 3090）

| 引擎               | GPU 层数          | 每秒标记数（Tokens/sec） |
| ---------------- | --------------- | ----------------- |
| llama.cpp（仅 GPU） | 20/80           | \~4 t/s           |
| llama.cpp（仅 CPU） | 0/80            | \~1 t/s           |
| **PowerInfer**   | **20/80 + 预测器** | **\~12 t/s**      |

{% hint style="success" %}
**3 倍加速** 对于在消费者 GPU 上的大模型推理，PowerInfer 的神经元感知调度通常能实现相较标准 llama.cpp 的 3 倍加速。
{% endhint %}

***

## 作为服务运行

为持久化 API 提供创建一个 systemd 服务：

```bash
cat > /etc/systemd/system/powerinfer.service << 'EOF'
[Unit]
Description=PowerInfer LLM Server
After=network.target

[Service]
Type=simple
WorkingDirectory=/app/PowerInfer
ExecStart=/app/PowerInfer/build/bin/server \
  -m /models/llama-2-13b-chat.Q4_K_M.gguf \
  --predictor-path /predictors/llama-2-13b-chat \
  --gpu-layers 30 \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 4096
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable powerinfer
systemctl start powerinfer
systemctl status powerinfer
```

***

## API 使用

服务器运行后，可使用任何兼容 OpenAI 的客户端：

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://<clore-node-ip>:<port>/v1",
    api_key="none"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "user", "content": "用简单的方式解释神经网络"}
    ],
    max_tokens=256
)
print(response.choices[0].message.content)
```

***

## 故障排除

### CUDA 显存不足

```bash
# 减少 GPU 层数
./build/bin/main -m model.gguf --gpu-layers 10  # 从 20 降低

# 检查是什么在使用显存
nvidia-smi

# 清理 GPU 内存
sudo fuser -v /dev/nvidia*  # 查看进程
```

### CPU 推理缓慢

```bash
# 启用 CPU 线程优化
./build/bin/main -m model.gguf --threads $(nproc) --gpu-layers 20

# 检查 NUMA 拓扑
numactl --hardware

# 绑定到靠近 GPU 的 NUMA 节点
numactl --cpunodebind=0 --membind=0 ./build/bin/main -m model.gguf
```

### 构建失败

```bash
# 确保已安装 CUDA 工具包
nvcc --version

# 检查 CMake 版本（需 3.14+）
cmake --version

# 清理构建
rm -rf build && mkdir build
cd build && cmake .. -DLLAMA_CUBLAS=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
```

{% hint style="danger" %}
**常见问题：** 如果 `cmake` 找不到 CUDA，请设置 `CUDA_HOME` 环境变量： `export CUDA_HOME=/usr/local/cuda` 再运行 cmake 之前。
{% endhint %}

***

## Clore.ai 的 GPU 建议

PowerInfer 的 CPU/GPU 混合设计改变了运行大模型的经济性。具有大显存 GPU 且 CPU 性能良好的 Clore.ai 服务器是理想选择。

| GPU       | 显存    | Clore.ai 价格 | 最大模型（Q4）        | 吞吐量（Llama 2 70B Q4） |
| --------- | ----- | ----------- | --------------- | ------------------- |
| RTX 3090  | 24 GB | \~$0.12/小时  | 70B（需 64GB+ 内存） | \~8–12 标记/秒         |
| RTX 4090  | 24 GB | \~$0.70/小时  | 70B（更快的 CPU 卸载） | \~12–18 标记/秒        |
| A100 40GB | 40 GB | \~$1.20/小时  | 70B（最小卸载）       | \~35–45 标记/秒        |
| A100 80GB | 80 GB | \~$2.00/小时  | 70B 全精度         | \~50–60 标记/秒        |

{% hint style="info" %}
**PowerInfer 的最佳选择：** 以约 $0.12/小时 的 RTX 3090 在运行 Llama 2 70B Q4 对预算敏感的用户来说是一个突破。你以比租用 A100 低 10–12 倍的成本获得 70B 模型。吞吐量较低（约 \~10 标记/秒），但对于研究或低流量推理来说性价比极高。
{% endhint %}

**CPU 与 GPU 同样重要：** PowerInfer 会将“冷”神经元卸载到 CPU。配备 AMD EPYC 或 Intel Xeon（多核、高内存带宽）的 Clore.ai 服务器在大模型任务中将显著优于单插槽的消费级 CPU。租用前请检查服务器规格。

**内存带宽瓶颈：** 对于 70B 模型，冷神经元计算时 CPU 内存带宽是限制因素。配备 DDR5 ECC 内存或类似 HBM 架构的服务器将获得更好的吞吐量。

***

## 资源

* 🐙 **GitHub：** [github.com/SJTU-IPADS/PowerInfer](https://github.com/SJTU-IPADS/PowerInfer)
* 📄 **研究论文：** [PowerInfer：使用消费级 GPU 的快速大型语言模型服务](https://arxiv.org/abs/2312.12456)
* 🤗 **GGUF 模型：** [huggingface.co/TheBloke](https://huggingface.co/TheBloke)
* 🧩 **上海交通大学 IPADS 实验室：** [ipads.se.sjtu.edu.cn](https://ipads.se.sjtu.edu.cn)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/yu-yan-mo-xing/powerinfer.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.