# PowerInfer

**CPU/GPU hybrid LLM inference exploiting activation locality** — run 70B parameter models on a single consumer GPU by intelligently splitting computation between CPU and GPU.

> 🌟 **8,000+ GitHub stars** | Developed at SJTU IPADS | MIT License

***

## What is PowerInfer?

PowerInfer is a high-performance inference engine for Large Language Models that exploits a key insight: **LLMs exhibit strong activation locality** — a small subset of neurons ("hot neurons") are consistently activated across most inference steps, while the majority remain inactive.

PowerInfer uses this property to:

1. **Keep hot neurons on GPU** for fast computation
2. **Offload cold neurons to CPU/RAM** without significant quality loss
3. **Dynamically route** computation between CPU and GPU based on activation patterns

The result: you can run a 70B model with only **16GB VRAM** instead of requiring 140GB+ all on GPU.

### Key Capabilities

* **Consumer GPU support** — RTX 3090/4090 can run 70B models
* **Neuron-aware scheduling** — predictor determines CPU vs GPU routing per inference
* **Minimal quality degradation** — maintains >95% of full-precision quality
* **llama.cpp compatibility** — GGUF format support
* **NUMA-aware CPU offloading** — optimized for high core count CPUs

### Why Use PowerInfer on Clore.ai?

Clore.ai rents GPUs at far lower cost than cloud alternatives. With PowerInfer:

* Run **Llama 2 70B** on a **single RTX 4090** (24GB VRAM)
* Slash GPU rental costs vs. multi-GPU setups
* Process long context windows with CPU RAM as overflow
* Run models previously requiring expensive A100/H100 instances

***

## Hardware Requirements

| Model Size | Min VRAM | Recommended RAM | Performance |
| ---------- | -------- | --------------- | ----------- |
| 7B         | 4GB      | 16GB            | Excellent   |
| 13B        | 6GB      | 32GB            | Very Good   |
| 34B        | 12GB     | 64GB            | Good        |
| 70B        | 16GB     | 128GB           | Moderate    |

{% hint style="info" %}
**CPU matters:** PowerInfer offloads cold neurons to CPU. A high core-count CPU (AMD EPYC, Intel Xeon) with fast memory bandwidth significantly improves throughput for large models.
{% endhint %}

***

## Quick Start on Clore.ai

### Step 1: Choose Your Server

On [clore.ai](https://clore.ai) marketplace, filter for:

* **NVIDIA GPU** with 16GB+ VRAM (RTX 3090, RTX 4090, A100)
* **High CPU core count** (16+ cores ideal)
* **64GB+ RAM** for 70B models, 32GB for 13B models

### Step 2: Create Custom Docker Image

PowerInfer requires a custom Docker setup. Use this `Dockerfile`:

```dockerfile
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

# Install dependencies
RUN apt-get update && apt-get install -y \
    git \
    cmake \
    build-essential \
    python3 \
    python3-pip \
    curl \
    wget \
    openssh-server \
    && rm -rf /var/lib/apt/lists/*

# Configure SSH
RUN mkdir /var/run/sshd && \
    echo 'root:powerinfer' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# Clone and build PowerInfer
RUN git clone https://github.com/SJTU-IPADS/PowerInfer.git /app/PowerInfer
WORKDIR /app/PowerInfer

RUN mkdir build && cd build && \
    cmake .. -DLLAMA_CUBLAS=ON && \
    cmake --build . --config Release -j$(nproc)

# Install Python dependencies for solver
RUN pip3 install torch numpy scipy

EXPOSE 22

CMD ["/bin/bash", "-c", "service ssh start && tail -f /dev/null"]
```

Build and push to Docker Hub or use inline with Clore.ai:

```bash
docker build -t yourname/powerinfer:latest .
docker push yourname/powerinfer:latest
```

### Step 3: Deploy on Clore.ai

In your Clore.ai order, set:

* **Docker image:** `yourname/powerinfer:latest`
* **Ports:** `22` (SSH)
* **Environment:** `NVIDIA_VISIBLE_DEVICES=all`

***

## Building PowerInfer from Source

If you prefer to build inside the container:

```bash
# SSH into your Clore.ai server
ssh root@<clore-node-ip> -p <ssh-port>

# Install prerequisites
apt-get update && apt-get install -y git cmake build-essential python3 python3-pip

# Clone PowerInfer
git clone https://github.com/SJTU-IPADS/PowerInfer.git
cd PowerInfer

# Build with CUDA support
mkdir build && cd build
cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j$(nproc)

echo "Build complete!"
ls -la bin/
```

### Verify Build

```bash
./build/bin/main --help
# Should output PowerInfer CLI help
```

***

## Getting Models

### Download GGUF Models

PowerInfer uses GGUF format (same as llama.cpp):

```bash
# Install HuggingFace CLI
pip3 install huggingface_hub

# Download Llama 2 7B Q4 (recommended for testing)
huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF \
  llama-2-7b-chat.Q4_K_M.gguf \
  --local-dir ./models

# Download Llama 2 70B Q4 (requires 16GB+ VRAM)  
huggingface-cli download TheBloke/Llama-2-70B-Chat-GGUF \
  llama-2-70b-chat.Q4_K_M.gguf \
  --local-dir ./models
```

### Generate Neuron Predictor (Required for PowerInfer)

PowerInfer needs a neuron activation predictor for each model. This is the key differentiator from llama.cpp:

```bash
# Install Python solver dependencies
pip3 install torch numpy scipy

# Generate predictor for your model
python3 PowerInfer/solver/solve.py \
  --model ./models/llama-2-7b-chat.Q4_K_M.gguf \
  --output ./predictors/llama-2-7b-chat \
  --target-gpu-layers 20 \
  --gpu-memory-gb 16

# This creates predictor files in ./predictors/
ls ./predictors/llama-2-7b-chat/
```

{% hint style="warning" %}
**Predictor generation time:** Creating a neuron predictor can take 30–60 minutes depending on model size. This is a one-time operation — the predictor is reused on subsequent runs.
{% endhint %}

***

## Running Inference

### Basic Inference (No Predictor)

For testing without predictor generation (standard GPU/CPU split):

```bash
./build/bin/main \
  -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
  --gpu-layers 20 \
  -p "Tell me about quantum computing" \
  -n 256
```

### PowerInfer Mode (With Predictor)

Full PowerInfer mode with neuron-aware routing:

```bash
./build/bin/main \
  -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
  --predictor-path ./predictors/llama-2-7b-chat \
  --gpu-layers 20 \
  --n-gpu-layers 20 \
  -p "What is the meaning of life?" \
  -n 512 \
  --ctx-size 4096
```

### Interactive Chat Mode

```bash
./build/bin/main \
  -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
  --predictor-path ./predictors/llama-2-7b-chat \
  --gpu-layers 20 \
  -i \
  --ctx-size 4096 \
  --temp 0.7 \
  --top-p 0.9 \
  --repeat-penalty 1.1 \
  --color
```

### Server Mode (OpenAI-compatible API)

```bash
./build/bin/server \
  -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
  --predictor-path ./predictors/llama-2-7b-chat \
  --gpu-layers 20 \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 4096
```

***

## Optimizing GPU Layer Split

The `--gpu-layers` parameter determines how many transformer layers to keep on GPU. Tune this based on your VRAM:

```bash
# Check available VRAM
nvidia-smi --query-gpu=memory.free,memory.total --format=csv

# Rule of thumb for Q4 models:
# 7B:  ~0.13GB per layer  → 24GB card = ~184 layers (all)
# 13B: ~0.18GB per layer  → 24GB card = ~133 layers
# 70B: ~0.23GB per layer  → 24GB card = ~104 layers (out of 80 total)
```

**Layer allocation guide:**

| GPU VRAM | 7B Model | 13B Model | 34B Model | 70B Model |
| -------- | -------- | --------- | --------- | --------- |
| 8GB      | All (32) | 20 layers | 10 layers | 4 layers  |
| 16GB     | All (32) | All (40)  | 25 layers | 10 layers |
| 24GB     | All (32) | All (40)  | All (60)  | 20 layers |
| 48GB     | All (32) | All (40)  | All (60)  | All (80)  |

***

## Performance Benchmarks

### Throughput Comparison (Llama 2 70B, RTX 3090)

| Engine               | GPU Layers            | Tokens/sec   |
| -------------------- | --------------------- | ------------ |
| llama.cpp (GPU only) | 20/80                 | \~4 t/s      |
| llama.cpp (CPU only) | 0/80                  | \~1 t/s      |
| **PowerInfer**       | **20/80 + predictor** | **\~12 t/s** |

{% hint style="success" %}
**3x speedup** over standard llama.cpp for large model inference on consumer GPUs is typical with PowerInfer's neuron-aware scheduling.
{% endhint %}

***

## Running as a Service

Create a systemd service for persistent API serving:

```bash
cat > /etc/systemd/system/powerinfer.service << 'EOF'
[Unit]
Description=PowerInfer LLM Server
After=network.target

[Service]
Type=simple
WorkingDirectory=/app/PowerInfer
ExecStart=/app/PowerInfer/build/bin/server \
  -m /models/llama-2-13b-chat.Q4_K_M.gguf \
  --predictor-path /predictors/llama-2-13b-chat \
  --gpu-layers 30 \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 4096
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable powerinfer
systemctl start powerinfer
systemctl status powerinfer
```

***

## API Usage

Once the server is running, use any OpenAI-compatible client:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://<clore-node-ip>:<port>/v1",
    api_key="none"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "user", "content": "Explain neural networks simply"}
    ],
    max_tokens=256
)
print(response.choices[0].message.content)
```

***

## Troubleshooting

### CUDA Out of Memory

```bash
# Reduce GPU layers
./build/bin/main -m model.gguf --gpu-layers 10  # Reduce from 20

# Check what's using VRAM
nvidia-smi

# Clear GPU memory
sudo fuser -v /dev/nvidia*  # See processes
```

### Slow CPU Inference

```bash
# Enable CPU threading optimization
./build/bin/main -m model.gguf --threads $(nproc) --gpu-layers 20

# Check NUMA topology
numactl --hardware

# Pin to NUMA node nearest GPU
numactl --cpunodebind=0 --membind=0 ./build/bin/main -m model.gguf
```

### Build Fails

```bash
# Ensure CUDA toolkit is installed
nvcc --version

# Check CMake version (need 3.14+)
cmake --version

# Clean build
rm -rf build && mkdir build
cd build && cmake .. -DLLAMA_CUBLAS=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
```

{% hint style="danger" %}
**Common issue:** If `cmake` can't find CUDA, set `CUDA_HOME` environment variable: `export CUDA_HOME=/usr/local/cuda` before running cmake.
{% endhint %}

***

## Clore.ai GPU Recommendations

PowerInfer's CPU/GPU hybrid design changes the economics of running large models. Clore.ai servers with high-VRAM GPUs AND fast CPUs are ideal.

| GPU       | VRAM  | Clore.ai Price | Max Model (Q4)           | Throughput (Llama 2 70B Q4) |
| --------- | ----- | -------------- | ------------------------ | --------------------------- |
| RTX 3090  | 24 GB | \~$0.12/hr     | 70B (with 64GB+ RAM)     | \~8–12 tok/s                |
| RTX 4090  | 24 GB | \~$0.70/hr     | 70B (faster CPU offload) | \~12–18 tok/s               |
| A100 40GB | 40 GB | \~$1.20/hr     | 70B (minimal offload)    | \~35–45 tok/s               |
| A100 80GB | 80 GB | \~$2.00/hr     | 70B full precision       | \~50–60 tok/s               |

{% hint style="info" %}
**PowerInfer sweet spot:** RTX 3090 at \~$0.12/hr running Llama 2 70B Q4 is a breakthrough for budget-conscious users. You get a 70B model for 10–12× less than an A100 rental cost. Throughput is lower (\~10 tok/s), but for research or low-traffic inference it's unbeatable value.
{% endhint %}

**CPU matters as much as GPU:** PowerInfer offloads "cold" neurons to CPU. Clore.ai servers with AMD EPYC or Intel Xeon CPUs (many cores, high memory bandwidth) will outperform single-socket consumer CPUs significantly. Check the server specs before renting for large model work.

**Memory bandwidth bottleneck:** For 70B models, CPU RAM bandwidth is the limiting factor during cold neuron computation. Servers with DDR5 ECC RAM or HBM-adjacent architectures will see better throughput.

***

## Resources

* 🐙 **GitHub:** [github.com/SJTU-IPADS/PowerInfer](https://github.com/SJTU-IPADS/PowerInfer)
* 📄 **Research Paper:** [PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU](https://arxiv.org/abs/2312.12456)
* 🤗 **GGUF Models:** [huggingface.co/TheBloke](https://huggingface.co/TheBloke)
* 🧩 **SJTU IPADS Lab:** [ipads.se.sjtu.edu.cn](https://ipads.se.sjtu.edu.cn)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/powerinfer.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
