# PowerInfer

**CPU/GPU hybrid LLM inference exploiting activation locality** — run 70B parameter models on a single consumer GPU by intelligently splitting computation between CPU and GPU.

> 🌟 **8,000+ GitHub stars** | Developed at SJTU IPADS | MIT License

***

## What is PowerInfer?

PowerInfer is a high-performance inference engine for Large Language Models that exploits a key insight: **LLMs exhibit strong activation locality** — a small subset of neurons ("hot neurons") are consistently activated across most inference steps, while the majority remain inactive.

PowerInfer uses this property to:

1. **Keep hot neurons on GPU** for fast computation
2. **Offload cold neurons to CPU/RAM** without significant quality loss
3. **Dynamically route** computation between CPU and GPU based on activation patterns

The result: you can run a 70B model with only **16GB VRAM** instead of requiring 140GB+ all on GPU.

### Key Capabilities

* **Consumer GPU support** — RTX 3090/4090 can run 70B models
* **Neuron-aware scheduling** — predictor determines CPU vs GPU routing per inference
* **Minimal quality degradation** — maintains >95% of full-precision quality
* **llama.cpp compatibility** — GGUF format support
* **NUMA-aware CPU offloading** — optimized for high core count CPUs

### Why Use PowerInfer on Clore.ai?

Clore.ai rents GPUs at far lower cost than cloud alternatives. With PowerInfer:

* Run **Llama 2 70B** on a **single RTX 4090** (24GB VRAM)
* Slash GPU rental costs vs. multi-GPU setups
* Process long context windows with CPU RAM as overflow
* Run models previously requiring expensive A100/H100 instances

***

## Hardware Requirements

| Model Size | Min VRAM | Recommended RAM | Performance |
| ---------- | -------- | --------------- | ----------- |
| 7B         | 4GB      | 16GB            | Excellent   |
| 13B        | 6GB      | 32GB            | Very Good   |
| 34B        | 12GB     | 64GB            | Good        |
| 70B        | 16GB     | 128GB           | Moderate    |

{% hint style="info" %}
**CPU matters:** PowerInfer offloads cold neurons to CPU. A high core-count CPU (AMD EPYC, Intel Xeon) with fast memory bandwidth significantly improves throughput for large models.
{% endhint %}

***

## Quick Start on Clore.ai

### Step 1: Choose Your Server

On [clore.ai](https://clore.ai) marketplace, filter for:

* **NVIDIA GPU** with 16GB+ VRAM (RTX 3090, RTX 4090, A100)
* **High CPU core count** (16+ cores ideal)
* **64GB+ RAM** for 70B models, 32GB for 13B models

### Step 2: Create Custom Docker Image

PowerInfer requires a custom Docker setup. Use this `Dockerfile`:

```dockerfile
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

# Install dependencies
RUN apt-get update && apt-get install -y \
    git \
    cmake \
    build-essential \
    python3 \
    python3-pip \
    curl \
    wget \
    openssh-server \
    && rm -rf /var/lib/apt/lists/*

# Configure SSH
RUN mkdir /var/run/sshd && \
    echo 'root:powerinfer' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# Clone and build PowerInfer
RUN git clone https://github.com/SJTU-IPADS/PowerInfer.git /app/PowerInfer
WORKDIR /app/PowerInfer

RUN mkdir build && cd build && \
    cmake .. -DLLAMA_CUBLAS=ON && \
    cmake --build . --config Release -j$(nproc)

# Install Python dependencies for solver
RUN pip3 install torch numpy scipy

EXPOSE 22

CMD ["/bin/bash", "-c", "service ssh start && tail -f /dev/null"]
```

Build and push to Docker Hub or use inline with Clore.ai:

```bash
docker build -t yourname/powerinfer:latest .
docker push yourname/powerinfer:latest
```

### Step 3: Deploy on Clore.ai

In your Clore.ai order, set:

* **Docker image:** `yourname/powerinfer:latest`
* **Ports:** `22` (SSH)
* **Environment:** `NVIDIA_VISIBLE_DEVICES=all`

***

## Building PowerInfer from Source

If you prefer to build inside the container:

```bash
# SSH into your Clore.ai server
ssh root@<clore-node-ip> -p <ssh-port>

# Install prerequisites
apt-get update && apt-get install -y git cmake build-essential python3 python3-pip

# Clone PowerInfer
git clone https://github.com/SJTU-IPADS/PowerInfer.git
cd PowerInfer

# Build with CUDA support
mkdir build && cd build
cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j$(nproc)

echo "Build complete!"
ls -la bin/
```

### Verify Build

```bash
./build/bin/main --help
# Should output PowerInfer CLI help
```

***

## Getting Models

### Download GGUF Models

PowerInfer uses GGUF format (same as llama.cpp):

```bash
# Install HuggingFace CLI
pip3 install huggingface_hub

# Download Llama 2 7B Q4 (recommended for testing)
huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF \
  llama-2-7b-chat.Q4_K_M.gguf \
  --local-dir ./models

# Download Llama 2 70B Q4 (requires 16GB+ VRAM)  
huggingface-cli download TheBloke/Llama-2-70B-Chat-GGUF \
  llama-2-70b-chat.Q4_K_M.gguf \
  --local-dir ./models
```

### Generate Neuron Predictor (Required for PowerInfer)

PowerInfer needs a neuron activation predictor for each model. This is the key differentiator from llama.cpp:

```bash
# Install Python solver dependencies
pip3 install torch numpy scipy

# Generate predictor for your model
python3 PowerInfer/solver/solve.py \
  --model ./models/llama-2-7b-chat.Q4_K_M.gguf \
  --output ./predictors/llama-2-7b-chat \
  --target-gpu-layers 20 \
  --gpu-memory-gb 16

# This creates predictor files in ./predictors/
ls ./predictors/llama-2-7b-chat/
```

{% hint style="warning" %}
**Predictor generation time:** Creating a neuron predictor can take 30–60 minutes depending on model size. This is a one-time operation — the predictor is reused on subsequent runs.
{% endhint %}

***

## Running Inference

### Basic Inference (No Predictor)

For testing without predictor generation (standard GPU/CPU split):

```bash
./build/bin/main \
  -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
  --gpu-layers 20 \
  -p "Tell me about quantum computing" \
  -n 256
```

### PowerInfer Mode (With Predictor)

Full PowerInfer mode with neuron-aware routing:

```bash
./build/bin/main \
  -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
  --predictor-path ./predictors/llama-2-7b-chat \
  --gpu-layers 20 \
  --n-gpu-layers 20 \
  -p "What is the meaning of life?" \
  -n 512 \
  --ctx-size 4096
```

### Interactive Chat Mode

```bash
./build/bin/main \
  -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
  --predictor-path ./predictors/llama-2-7b-chat \
  --gpu-layers 20 \
  -i \
  --ctx-size 4096 \
  --temp 0.7 \
  --top-p 0.9 \
  --repeat-penalty 1.1 \
  --color
```

### Server Mode (OpenAI-compatible API)

```bash
./build/bin/server \
  -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
  --predictor-path ./predictors/llama-2-7b-chat \
  --gpu-layers 20 \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 4096
```

***

## Optimizing GPU Layer Split

The `--gpu-layers` parameter determines how many transformer layers to keep on GPU. Tune this based on your VRAM:

```bash
# Check available VRAM
nvidia-smi --query-gpu=memory.free,memory.total --format=csv

# Rule of thumb for Q4 models:
# 7B:  ~0.13GB per layer  → 24GB card = ~184 layers (all)
# 13B: ~0.18GB per layer  → 24GB card = ~133 layers
# 70B: ~0.23GB per layer  → 24GB card = ~104 layers (out of 80 total)
```

**Layer allocation guide:**

| GPU VRAM | 7B Model | 13B Model | 34B Model | 70B Model |
| -------- | -------- | --------- | --------- | --------- |
| 8GB      | All (32) | 20 layers | 10 layers | 4 layers  |
| 16GB     | All (32) | All (40)  | 25 layers | 10 layers |
| 24GB     | All (32) | All (40)  | All (60)  | 20 layers |
| 48GB     | All (32) | All (40)  | All (60)  | All (80)  |

***

## Performance Benchmarks

### Throughput Comparison (Llama 2 70B, RTX 3090)

| Engine               | GPU Layers            | Tokens/sec   |
| -------------------- | --------------------- | ------------ |
| llama.cpp (GPU only) | 20/80                 | \~4 t/s      |
| llama.cpp (CPU only) | 0/80                  | \~1 t/s      |
| **PowerInfer**       | **20/80 + predictor** | **\~12 t/s** |

{% hint style="success" %}
**3x speedup** over standard llama.cpp for large model inference on consumer GPUs is typical with PowerInfer's neuron-aware scheduling.
{% endhint %}

***

## Running as a Service

Create a systemd service for persistent API serving:

```bash
cat > /etc/systemd/system/powerinfer.service << 'EOF'
[Unit]
Description=PowerInfer LLM Server
After=network.target

[Service]
Type=simple
WorkingDirectory=/app/PowerInfer
ExecStart=/app/PowerInfer/build/bin/server \
  -m /models/llama-2-13b-chat.Q4_K_M.gguf \
  --predictor-path /predictors/llama-2-13b-chat \
  --gpu-layers 30 \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 4096
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable powerinfer
systemctl start powerinfer
systemctl status powerinfer
```

***

## API Usage

Once the server is running, use any OpenAI-compatible client:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://<clore-node-ip>:<port>/v1",
    api_key="none"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "user", "content": "Explain neural networks simply"}
    ],
    max_tokens=256
)
print(response.choices[0].message.content)
```

***

## Troubleshooting

### CUDA Out of Memory

```bash
# Reduce GPU layers
./build/bin/main -m model.gguf --gpu-layers 10  # Reduce from 20

# Check what's using VRAM
nvidia-smi

# Clear GPU memory
sudo fuser -v /dev/nvidia*  # See processes
```

### Slow CPU Inference

```bash
# Enable CPU threading optimization
./build/bin/main -m model.gguf --threads $(nproc) --gpu-layers 20

# Check NUMA topology
numactl --hardware

# Pin to NUMA node nearest GPU
numactl --cpunodebind=0 --membind=0 ./build/bin/main -m model.gguf
```

### Build Fails

```bash
# Ensure CUDA toolkit is installed
nvcc --version

# Check CMake version (need 3.14+)
cmake --version

# Clean build
rm -rf build && mkdir build
cd build && cmake .. -DLLAMA_CUBLAS=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
```

{% hint style="danger" %}
**Common issue:** If `cmake` can't find CUDA, set `CUDA_HOME` environment variable: `export CUDA_HOME=/usr/local/cuda` before running cmake.
{% endhint %}

***

## Clore.ai GPU Recommendations

PowerInfer's CPU/GPU hybrid design changes the economics of running large models. Clore.ai servers with high-VRAM GPUs AND fast CPUs are ideal.

| GPU       | VRAM  | Clore.ai Price | Max Model (Q4)           | Throughput (Llama 2 70B Q4) |
| --------- | ----- | -------------- | ------------------------ | --------------------------- |
| RTX 3090  | 24 GB | \~$0.12/hr     | 70B (with 64GB+ RAM)     | \~8–12 tok/s                |
| RTX 4090  | 24 GB | \~$0.70/hr     | 70B (faster CPU offload) | \~12–18 tok/s               |
| A100 40GB | 40 GB | \~$1.20/hr     | 70B (minimal offload)    | \~35–45 tok/s               |
| A100 80GB | 80 GB | \~$2.00/hr     | 70B full precision       | \~50–60 tok/s               |

{% hint style="info" %}
**PowerInfer sweet spot:** RTX 3090 at \~$0.12/hr running Llama 2 70B Q4 is a breakthrough for budget-conscious users. You get a 70B model for 10–12× less than an A100 rental cost. Throughput is lower (\~10 tok/s), but for research or low-traffic inference it's unbeatable value.
{% endhint %}

**CPU matters as much as GPU:** PowerInfer offloads "cold" neurons to CPU. Clore.ai servers with AMD EPYC or Intel Xeon CPUs (many cores, high memory bandwidth) will outperform single-socket consumer CPUs significantly. Check the server specs before renting for large model work.

**Memory bandwidth bottleneck:** For 70B models, CPU RAM bandwidth is the limiting factor during cold neuron computation. Servers with DDR5 ECC RAM or HBM-adjacent architectures will see better throughput.

***

## Resources

* 🐙 **GitHub:** [github.com/SJTU-IPADS/PowerInfer](https://github.com/SJTU-IPADS/PowerInfer)
* 📄 **Research Paper:** [PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU](https://arxiv.org/abs/2312.12456)
* 🤗 **GGUF Models:** [huggingface.co/TheBloke](https://huggingface.co/TheBloke)
* 🧩 **SJTU IPADS Lab:** [ipads.se.sjtu.edu.cn](https://ipads.se.sjtu.edu.cn)
