Aphrodite Engine

Run Aphrodite Engine for LLM inference on legacy and modern GPUs on Clore.ai

Aphrodite Engine is an optimized LLM inference server built on top of vLLM, specifically tailored for the creative writing and roleplay community. It supports a wide range of GPUs starting from Pascal (GTX 1000 series), making it the perfect choice for running language models on older or budget CLORE.AI GPU servers where other frameworks fail. Aphrodite adds Kobold-compatible APIs, Mirostat sampling, and advanced text sampling algorithms not found in mainstream serving frameworks.

circle-check

Server Requirements

Parameter
Minimum
Recommended

RAM

16 GB

32 GB+

VRAM

6 GB

16 GB+

Disk

40 GB

150 GB+

GPU

NVIDIA Pascal+ (GTX 1060+)

RTX 3090, A100

circle-info

Aphrodite Engine is one of the few LLM servers supporting Pascal-generation GPUs (GTX 10xx series). This makes it ideal for budget servers on CLORE.AI with older GPUs that have low rental prices.

Quick Deploy on CLORE.AI

Docker Image: alpindale/aphrodite-engine:latest

Ports: 22/tcp, 2242/http

Environment Variables:

Variable
Example
Description

HF_TOKEN

hf_xxx...

HuggingFace token for gated models

APHRODITE_MODEL

mistralai/Mistral-7B-Instruct-v0.3

Model to load

Step-by-Step Setup

1. Rent a GPU Server on CLORE.AI

Aphrodite's wide GPU support lets you grab budget-friendly servers on CLORE.AI Marketplacearrow-up-right:

  • Pascal (GTX 1060–1080 Ti): 6–11 GB VRAM — run small 3B-7B models with quantization

  • Turing (RTX 2000 series): 8–24 GB VRAM — 7B-13B models, better performance

  • Ampere (RTX 3000/A100): 24–80 GB VRAM — 30B-70B models, full speed

  • Ada (RTX 4000 series): 16–24 GB VRAM — best perf/cost ratio

2. Connect via SSH

3. Pull Aphrodite Engine Image

4. Launch Aphrodite Engine

Basic launch with a 7B model:

With HuggingFace token (Llama 3):

With GPTQ quantization (for limited VRAM):

With AWQ quantization:

Running a GGUF model (Aphrodite supports GGUF natively):

5. Verify the Server

6. Access via CLORE.AI HTTP Proxy

The CLORE.AI order panel provides an http_pub URL for port 2242. Use it in your client applications:


Usage Examples

Example 1: OpenAI-Compatible Chat

Example 2: Advanced Sampling with Mirostat

Aphrodite supports Mirostat sampling for more coherent long-form text:

Example 3: Kobold-Compatible API

Aphrodite includes a Kobold-compatible endpoint for use with KoboldAI-based frontends:

Example 4: Python Client with Custom Samplers

Example 5: Batch Completions


Configuration

Key Launch Parameters

Parameter
Default
Description

--model

required

Model ID or local path

--host

127.0.0.1

Bind address

--port

2242

Server port

--dtype

auto

float16, bfloat16, float32

--quantization

none

awq, gptq, squeezellm, fp8

--max-model-len

model max

Override max context length

--gpu-memory-utilization

0.90

GPU memory fraction

--tensor-parallel-size

1

Number of GPUs for tensor parallelism

--max-num-seqs

256

Max concurrent sequences

--trust-remote-code

false

Allow custom model code

--api-keys

none

Comma-separated API keys for auth

--served-model-name

model name

Custom name for API responses

Adding API Key Authentication

Then use Authorization: Bearer mysecretkey1 in requests.

Loading Local Models


Performance Tips

1. Choose the Right Quantization for Your GPU

GPU VRAM
7B Model
13B Model
30B Model

6 GB

GPTQ/AWQ Q4

8 GB

GPTQ Q4

GPTQ Q4 (tight)

12 GB

Float16

GPTQ Q4

16 GB

Float16

Float16

GPTQ Q4

24 GB

Float16

Float16

GPTQ Q4

48 GB

Float16

Float16

Float16

2. Tune GPU Memory Utilization

Start lower and increase if you don't get OOM errors.

3. Use bfloat16 on Ampere+ GPUs

Better numerical stability than float16, same speed.

4. Optimize for Roleplay/Creative Writing

These samplers work well for narrative text:

5. Pascal GPU Tips (GTX 10xx)

For Pascal GPUs, avoid Flash Attention (not supported):


Troubleshooting

Problem: "CUDA capability sm_6x not supported"

Pascal GPUs require special handling. Use:

If still failing, check if the image version supports Pascal:

Problem: "out of memory" on small GPUs

Problem: Slow token generation

  • Check that GPU is actually being used: nvidia-smi inside container

  • Enable larger batch sizes: --max-num-seqs 64

  • Use AWQ instead of GPTQ (faster inference)

Problem: Model not found / 404 errors

Always check your model name matches exactly:

Use the exact model name from the response in your requests.

Problem: Repetitive output

Add repetition penalty:

Problem: Docker container exits silently



Clore.ai GPU Recommendations

Use Case
Recommended GPU
Est. Cost on Clore.ai

Development/Testing

RTX 3090 (24GB)

~$0.12/gpu/hr

Production (7B–13B)

RTX 4090 (24GB)

~$0.70/gpu/hr

Large Models (70B+)

A100 80GB / H100

~$1.20/gpu/hr

💡 All examples in this guide can be deployed on Clore.aiarrow-up-right GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.

Last updated

Was this helpful?