PowerInfer

CPU/GPU hybrid LLM inference exploiting activation locality — run 70B parameter models on a single consumer GPU by intelligently splitting computation between CPU and GPU.

🌟 8,000+ GitHub stars | Developed at SJTU IPADS | MIT License


What is PowerInfer?

PowerInfer is a high-performance inference engine for Large Language Models that exploits a key insight: LLMs exhibit strong activation locality — a small subset of neurons ("hot neurons") are consistently activated across most inference steps, while the majority remain inactive.

PowerInfer uses this property to:

  1. Keep hot neurons on GPU for fast computation

  2. Offload cold neurons to CPU/RAM without significant quality loss

  3. Dynamically route computation between CPU and GPU based on activation patterns

The result: you can run a 70B model with only 16GB VRAM instead of requiring 140GB+ all on GPU.

Key Capabilities

  • Consumer GPU support — RTX 3090/4090 can run 70B models

  • Neuron-aware scheduling — predictor determines CPU vs GPU routing per inference

  • Minimal quality degradation — maintains >95% of full-precision quality

  • llama.cpp compatibility — GGUF format support

  • NUMA-aware CPU offloading — optimized for high core count CPUs

Why Use PowerInfer on Clore.ai?

Clore.ai rents GPUs at far lower cost than cloud alternatives. With PowerInfer:

  • Run Llama 2 70B on a single RTX 4090 (24GB VRAM)

  • Slash GPU rental costs vs. multi-GPU setups

  • Process long context windows with CPU RAM as overflow

  • Run models previously requiring expensive A100/H100 instances


Hardware Requirements

Model Size
Min VRAM
Recommended RAM
Performance

7B

4GB

16GB

Excellent

13B

6GB

32GB

Very Good

34B

12GB

64GB

Good

70B

16GB

128GB

Moderate

circle-info

CPU matters: PowerInfer offloads cold neurons to CPU. A high core-count CPU (AMD EPYC, Intel Xeon) with fast memory bandwidth significantly improves throughput for large models.


Quick Start on Clore.ai

Step 1: Choose Your Server

On clore.aiarrow-up-right marketplace, filter for:

  • NVIDIA GPU with 16GB+ VRAM (RTX 3090, RTX 4090, A100)

  • High CPU core count (16+ cores ideal)

  • 64GB+ RAM for 70B models, 32GB for 13B models

Step 2: Create Custom Docker Image

PowerInfer requires a custom Docker setup. Use this Dockerfile:

Build and push to Docker Hub or use inline with Clore.ai:

Step 3: Deploy on Clore.ai

In your Clore.ai order, set:

  • Docker image: yourname/powerinfer:latest

  • Ports: 22 (SSH)

  • Environment: NVIDIA_VISIBLE_DEVICES=all


Building PowerInfer from Source

If you prefer to build inside the container:

Verify Build


Getting Models

Download GGUF Models

PowerInfer uses GGUF format (same as llama.cpp):

Generate Neuron Predictor (Required for PowerInfer)

PowerInfer needs a neuron activation predictor for each model. This is the key differentiator from llama.cpp:

circle-exclamation

Running Inference

Basic Inference (No Predictor)

For testing without predictor generation (standard GPU/CPU split):

PowerInfer Mode (With Predictor)

Full PowerInfer mode with neuron-aware routing:

Interactive Chat Mode

Server Mode (OpenAI-compatible API)


Optimizing GPU Layer Split

The --gpu-layers parameter determines how many transformer layers to keep on GPU. Tune this based on your VRAM:

Layer allocation guide:

GPU VRAM
7B Model
13B Model
34B Model
70B Model

8GB

All (32)

20 layers

10 layers

4 layers

16GB

All (32)

All (40)

25 layers

10 layers

24GB

All (32)

All (40)

All (60)

20 layers

48GB

All (32)

All (40)

All (60)

All (80)


Performance Benchmarks

Throughput Comparison (Llama 2 70B, RTX 3090)

Engine
GPU Layers
Tokens/sec

llama.cpp (GPU only)

20/80

~4 t/s

llama.cpp (CPU only)

0/80

~1 t/s

PowerInfer

20/80 + predictor

~12 t/s

circle-check

Running as a Service

Create a systemd service for persistent API serving:


API Usage

Once the server is running, use any OpenAI-compatible client:


Troubleshooting

CUDA Out of Memory

Slow CPU Inference

Build Fails

triangle-exclamation

Clore.ai GPU Recommendations

PowerInfer's CPU/GPU hybrid design changes the economics of running large models. Clore.ai servers with high-VRAM GPUs AND fast CPUs are ideal.

GPU
VRAM
Clore.ai Price
Max Model (Q4)
Throughput (Llama 2 70B Q4)

RTX 3090

24 GB

~$0.12/hr

70B (with 64GB+ RAM)

~8–12 tok/s

RTX 4090

24 GB

~$0.70/hr

70B (faster CPU offload)

~12–18 tok/s

A100 40GB

40 GB

~$1.20/hr

70B (minimal offload)

~35–45 tok/s

A100 80GB

80 GB

~$2.00/hr

70B full precision

~50–60 tok/s

circle-info

PowerInfer sweet spot: RTX 3090 at ~$0.12/hr running Llama 2 70B Q4 is a breakthrough for budget-conscious users. You get a 70B model for 10–12× less than an A100 rental cost. Throughput is lower (~10 tok/s), but for research or low-traffic inference it's unbeatable value.

CPU matters as much as GPU: PowerInfer offloads "cold" neurons to CPU. Clore.ai servers with AMD EPYC or Intel Xeon CPUs (many cores, high memory bandwidth) will outperform single-socket consumer CPUs significantly. Check the server specs before renting for large model work.

Memory bandwidth bottleneck: For 70B models, CPU RAM bandwidth is the limiting factor during cold neuron computation. Servers with DDR5 ECC RAM or HBM-adjacent architectures will see better throughput.


Resources

Last updated

Was this helpful?