# Aphrodite Engine

Aphrodite Engine is an optimized LLM inference server built on top of vLLM, specifically tailored for the creative writing and roleplay community. It supports a wide range of GPUs starting from Pascal (GTX 1000 series), making it the perfect choice for running language models on older or budget CLORE.AI GPU servers where other frameworks fail. Aphrodite adds Kobold-compatible APIs, Mirostat sampling, and advanced text sampling algorithms not found in mainstream serving frameworks.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Server Requirements

| Parameter | Minimum                    | Recommended    |
| --------- | -------------------------- | -------------- |
| RAM       | 16 GB                      | 32 GB+         |
| VRAM      | 6 GB                       | 16 GB+         |
| Disk      | 40 GB                      | 150 GB+        |
| GPU       | NVIDIA Pascal+ (GTX 1060+) | RTX 3090, A100 |

{% hint style="info" %}
Aphrodite Engine is one of the few LLM servers supporting Pascal-generation GPUs (GTX 10xx series). This makes it ideal for budget servers on CLORE.AI with older GPUs that have low rental prices.
{% endhint %}

## Quick Deploy on CLORE.AI

**Docker Image:** `alpindale/aphrodite-engine:latest`

**Ports:** `22/tcp`, `2242/http`

**Environment Variables:**

| Variable          | Example                              | Description                        |
| ----------------- | ------------------------------------ | ---------------------------------- |
| `HF_TOKEN`        | `hf_xxx...`                          | HuggingFace token for gated models |
| `APHRODITE_MODEL` | `mistralai/Mistral-7B-Instruct-v0.3` | Model to load                      |

## Step-by-Step Setup

### 1. Rent a GPU Server on CLORE.AI

Aphrodite's wide GPU support lets you grab budget-friendly servers on [CLORE.AI Marketplace](https://clore.ai/marketplace):

* **Pascal (GTX 1060–1080 Ti)**: 6–11 GB VRAM — run small 3B-7B models with quantization
* **Turing (RTX 2000 series)**: 8–24 GB VRAM — 7B-13B models, better performance
* **Ampere (RTX 3000/A100)**: 24–80 GB VRAM — 30B-70B models, full speed
* **Ada (RTX 4000 series)**: 16–24 GB VRAM — best perf/cost ratio

### 2. Connect via SSH

```bash
ssh -p <PORT> root@<SERVER_IP>
```

### 3. Pull Aphrodite Engine Image

```bash
docker pull alpindale/aphrodite-engine:latest
```

### 4. Launch Aphrodite Engine

**Basic launch with a 7B model:**

```bash
docker run -d \
  --name aphrodite \
  --gpus all \
  --ipc host \
  -p 2242:2242 \
  -v /root/models:/root/.cache/huggingface \
  alpindale/aphrodite-engine:latest \
  python3 -m aphrodite.endpoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.3 \
    --host 0.0.0.0 \
    --port 2242 \
    --max-model-len 4096
```

**With HuggingFace token (Llama 3):**

```bash
docker run -d \
  --name aphrodite \
  --gpus all \
  --ipc host \
  -p 2242:2242 \
  -v /root/models:/root/.cache/huggingface \
  -e HF_TOKEN=hf_your_token_here \
  alpindale/aphrodite-engine:latest \
  python3 -m aphrodite.endpoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --host 0.0.0.0 \
    --port 2242 \
    --dtype bfloat16 \
    --max-model-len 8192
```

**With GPTQ quantization (for limited VRAM):**

```bash
docker run -d \
  --name aphrodite \
  --gpus all \
  --ipc host \
  -p 2242:2242 \
  -v /root/models:/root/.cache/huggingface \
  alpindale/aphrodite-engine:latest \
  python3 -m aphrodite.endpoints.openai.api_server \
    --model TheBloke/Mistral-7B-Instruct-v0.2-GPTQ \
    --host 0.0.0.0 \
    --port 2242 \
    --quantization gptq \
    --max-model-len 4096
```

**With AWQ quantization:**

```bash
docker run -d \
  --name aphrodite \
  --gpus all \
  --ipc host \
  -p 2242:2242 \
  -v /root/models:/root/.cache/huggingface \
  alpindale/aphrodite-engine:latest \
  python3 -m aphrodite.endpoints.openai.api_server \
    --model casperhansen/mistral-7b-instruct-v0.2-awq \
    --host 0.0.0.0 \
    --port 2242 \
    --quantization awq \
    --max-model-len 4096
```

**Running a GGUF model (Aphrodite supports GGUF natively):**

```bash
# First download the GGUF file
docker exec -it aphrodite bash -c "
pip install huggingface_hub
python3 -c \"from huggingface_hub import hf_hub_download; hf_hub_download(
    repo_id='TheBloke/Mistral-7B-Instruct-v0.2-GGUF',
    filename='mistral-7b-instruct-v0.2.Q4_K_M.gguf',
    local_dir='/root/models/mistral-gguf'
)\"
"

# Then launch with GGUF
docker run -d \
  --name aphrodite \
  --gpus all \
  --ipc host \
  -p 2242:2242 \
  -v /root/models:/models \
  alpindale/aphrodite-engine:latest \
  python3 -m aphrodite.endpoints.openai.api_server \
    --model /models/mistral-gguf/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 2242 \
    --tokenizer mistralai/Mistral-7B-Instruct-v0.2
```

### 5. Verify the Server

```bash
# Check logs
docker logs -f aphrodite

# Health check
curl http://localhost:2242/health

# List loaded models
curl http://localhost:2242/v1/models
```

### 6. Access via CLORE.AI HTTP Proxy

The CLORE.AI order panel provides an `http_pub` URL for port 2242. Use it in your client applications:

```
https://<order-id>-2242.clore.ai/v1
```

***

## Usage Examples

### Example 1: OpenAI-Compatible Chat

```bash
curl http://localhost:2242/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [
      {"role": "system", "content": "You are a creative writer specializing in fantasy fiction."},
      {"role": "user", "content": "Begin a short story about a dragon who learns to paint."}
    ],
    "max_tokens": 500,
    "temperature": 0.9,
    "top_p": 0.95
  }'
```

### Example 2: Advanced Sampling with Mirostat

Aphrodite supports Mirostat sampling for more coherent long-form text:

```bash
curl http://localhost:2242/v1/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "prompt": "Once upon a time in a cyberpunk city,",
    "max_tokens": 400,
    "mirostat_mode": 2,
    "mirostat_tau": 5.0,
    "mirostat_eta": 0.1
  }'
```

### Example 3: Kobold-Compatible API

Aphrodite includes a Kobold-compatible endpoint for use with KoboldAI-based frontends:

```bash
# Kobold generation endpoint
curl http://localhost:2242/api/v1/generate \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "prompt": "The spaceship entered hyperspace,",
    "max_length": 200,
    "temperature": 0.8,
    "top_p": 0.92,
    "rep_pen": 1.15
  }'
```

### Example 4: Python Client with Custom Samplers

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:2242/v1",
    api_key="none",
)

# Creative writing with tailored samplers
response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[
        {
            "role": "user",
            "content": "Write a poem about the silence between stars.",
        }
    ],
    max_tokens=300,
    temperature=1.1,
    top_p=0.95,
    frequency_penalty=0.3,
    presence_penalty=0.2,
)

print(response.choices[0].message.content)
```

### Example 5: Batch Completions

```python
import requests

BASE_URL = "http://localhost:2242"

prompts = [
    "The ancient wizard opened his tome and",
    "In the neon-lit alley, the detective noticed",
    "The last AI on Earth said to the robot:",
]

for prompt in prompts:
    response = requests.post(
        f"{BASE_URL}/v1/completions",
        json={
            "model": "mistralai/Mistral-7B-Instruct-v0.3",
            "prompt": prompt,
            "max_tokens": 150,
            "temperature": 0.85,
            "top_k": 50,
            "top_p": 0.95,
            "repetition_penalty": 1.1,
        },
    )
    result = response.json()
    print(f"Prompt: {prompt}")
    print(f"Continuation: {result['choices'][0]['text']}\n")
```

***

## Configuration

### Key Launch Parameters

| Parameter                  | Default     | Description                           |
| -------------------------- | ----------- | ------------------------------------- |
| `--model`                  | required    | Model ID or local path                |
| `--host`                   | `127.0.0.1` | Bind address                          |
| `--port`                   | `2242`      | Server port                           |
| `--dtype`                  | `auto`      | `float16`, `bfloat16`, `float32`      |
| `--quantization`           | none        | `awq`, `gptq`, `squeezellm`, `fp8`    |
| `--max-model-len`          | model max   | Override max context length           |
| `--gpu-memory-utilization` | `0.90`      | GPU memory fraction                   |
| `--tensor-parallel-size`   | `1`         | Number of GPUs for tensor parallelism |
| `--max-num-seqs`           | `256`       | Max concurrent sequences              |
| `--trust-remote-code`      | false       | Allow custom model code               |
| `--api-keys`               | none        | Comma-separated API keys for auth     |
| `--served-model-name`      | model name  | Custom name for API responses         |

### Adding API Key Authentication

```bash
python3 -m aphrodite.endpoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --host 0.0.0.0 \
  --port 2242 \
  --api-keys "mysecretkey1,mysecretkey2"
```

Then use `Authorization: Bearer mysecretkey1` in requests.

### Loading Local Models

```bash
# Mount your model directory and reference it
docker run -d \
  --name aphrodite \
  --gpus all \
  --ipc host \
  -p 2242:2242 \
  -v /path/to/your/model:/model \
  alpindale/aphrodite-engine:latest \
  python3 -m aphrodite.endpoints.openai.api_server \
    --model /model \
    --host 0.0.0.0 \
    --port 2242
```

***

## Performance Tips

### 1. Choose the Right Quantization for Your GPU

| GPU VRAM | 7B Model    | 13B Model       | 30B Model |
| -------- | ----------- | --------------- | --------- |
| 6 GB     | GPTQ/AWQ Q4 | ❌               | ❌         |
| 8 GB     | GPTQ Q4     | GPTQ Q4 (tight) | ❌         |
| 12 GB    | Float16     | GPTQ Q4         | ❌         |
| 16 GB    | Float16     | Float16         | GPTQ Q4   |
| 24 GB    | Float16     | Float16         | GPTQ Q4   |
| 48 GB    | Float16     | Float16         | Float16   |

### 2. Tune GPU Memory Utilization

```bash
--gpu-memory-utilization 0.93  # Squeeze more KV cache
```

Start lower and increase if you don't get OOM errors.

### 3. Use bfloat16 on Ampere+ GPUs

```bash
--dtype bfloat16
```

Better numerical stability than float16, same speed.

### 4. Optimize for Roleplay/Creative Writing

These samplers work well for narrative text:

```json
{
  "temperature": 0.85,
  "top_p": 0.92,
  "top_k": 40,
  "repetition_penalty": 1.12,
  "mirostat_mode": 2,
  "mirostat_tau": 5.0
}
```

### 5. Pascal GPU Tips (GTX 10xx)

For Pascal GPUs, avoid Flash Attention (not supported):

```bash
--dtype float16  # float32 if you get NaN errors
--max-model-len 2048  # Reduce context for memory savings
```

***

## Troubleshooting

### Problem: "CUDA capability sm\_6x not supported"

Pascal GPUs require special handling. Use:

```bash
--dtype float16
```

If still failing, check if the image version supports Pascal:

```bash
docker pull alpindale/aphrodite-engine:v0.5.4  # Try specific version
```

### Problem: "out of memory" on small GPUs

```bash
--gpu-memory-utilization 0.85
--max-model-len 2048
--quantization gptq  # Or awq
```

### Problem: Slow token generation

* Check that GPU is actually being used: `nvidia-smi` inside container
* Enable larger batch sizes: `--max-num-seqs 64`
* Use AWQ instead of GPTQ (faster inference)

### Problem: Model not found / 404 errors

Always check your model name matches exactly:

```bash
curl http://localhost:2242/v1/models
```

Use the exact model name from the response in your requests.

### Problem: Repetitive output

Add repetition penalty:

```json
{
  "repetition_penalty": 1.15,
  "frequency_penalty": 0.3
}
```

### Problem: Docker container exits silently

```bash
docker logs aphrodite 2>&1 | tail -100
# Common causes: insufficient VRAM, invalid model path
```

***

## Links

* [GitHub](https://github.com/PygmalionAI/aphrodite-engine)
* [Documentation](https://aphrodite.pygmalion.chat)
* [Docker Hub](https://hub.docker.com/r/alpindale/aphrodite-engine)
* [Supported Models](https://github.com/PygmalionAI/aphrodite-engine?tab=readme-ov-file#supported-models)
* [CLORE.AI Marketplace](https://clore.ai/marketplace)

***

## Clore.ai GPU Recommendations

| Use Case            | Recommended GPU  | Est. Cost on Clore.ai |
| ------------------- | ---------------- | --------------------- |
| Development/Testing | RTX 3090 (24GB)  | \~$0.12/gpu/hr        |
| Production (7B–13B) | RTX 4090 (24GB)  | \~$0.70/gpu/hr        |
| Large Models (70B+) | A100 80GB / H100 | \~$1.20/gpu/hr        |

> 💡 All examples in this guide can be deployed on [Clore.ai](https://clore.ai/marketplace) GPU servers. Browse available GPUs and rent by the hour — no commitments, full root access.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/aphrodite-engine.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.