# Llama.cpp Server

Run LLMs efficiently with llama.cpp server on GPU.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Server Requirements

| Parameter    | Minimum       | Recommended |
| ------------ | ------------- | ----------- |
| RAM          | 8GB           | 16GB+       |
| VRAM         | 6GB           | 8GB+        |
| Network      | 200Mbps       | 500Mbps+    |
| Startup Time | \~2-5 minutes | -           |

{% hint style="info" %}
Llama.cpp is memory-efficient due to GGUF quantization. 7B models can run on 6-8GB VRAM.
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## What is Llama.cpp?

Llama.cpp is the fastest CPU/GPU inference engine for LLMs:

* Supports GGUF quantized models
* Low memory usage
* OpenAI-compatible API
* Multi-user support

## Quantization Levels

| Format   | Size (7B) | Speed   | Quality   |
| -------- | --------- | ------- | --------- |
| Q2\_K    | 2.8GB     | Fastest | Low       |
| Q4\_K\_M | 4.1GB     | Fast    | Good      |
| Q5\_K\_M | 4.8GB     | Medium  | Great     |
| Q6\_K    | 5.5GB     | Slower  | Excellent |
| Q8\_0    | 7.2GB     | Slowest | Best      |

## Quick Deploy

**Docker Image:**

```
ghcr.io/ggerganov/llama.cpp:server-cuda
```

**Ports:**

```
22/tcp
8080/http
```

**Command:**

```bash

# Download model
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Run server
./llama-server \
    -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35 \
    -c 4096
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

### Verify It's Working

```bash
# Check health
curl https://your-http-pub.clorecloud.net/health

# Get server info
curl https://your-http-pub.clorecloud.net/props
```

{% hint style="warning" %}
If you get HTTP 502, the service may still be starting or downloading the model. Wait 2-5 minutes and retry.
{% endhint %}

## Complete API Reference

### Standard Endpoints

| Endpoint               | Method | Description                         |
| ---------------------- | ------ | ----------------------------------- |
| `/health`              | GET    | Health check                        |
| `/v1/models`           | GET    | List models                         |
| `/v1/chat/completions` | POST   | Chat (OpenAI compatible)            |
| `/v1/completions`      | POST   | Text completion (OpenAI compatible) |
| `/v1/embeddings`       | POST   | Generate embeddings                 |
| `/completion`          | POST   | Native completion endpoint          |
| `/tokenize`            | POST   | Tokenize text                       |
| `/detokenize`          | POST   | Detokenize tokens                   |
| `/props`               | GET    | Server properties                   |
| `/metrics`             | GET    | Prometheus metrics                  |

#### Tokenize Text

```bash
curl https://your-http-pub.clorecloud.net/tokenize \
    -H "Content-Type: application/json" \
    -d '{"content": "Hello world"}'
```

Response:

```json
{"tokens": [15496, 1917]}
```

#### Server Properties

```bash
curl https://your-http-pub.clorecloud.net/props
```

Response:

```json
{
  "total_slots": 1,
  "chat_template": "...",
  "default_generation_settings": {...}
}
```

## Build from Source

```bash

# Clone repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA
make LLAMA_CUDA=1

# Or with CMake
mkdir build && cd build
cmake .. -DLLAMA_CUDA=ON
cmake --build . --config Release
```

## Download Models

```bash

# Llama 3.1 8B
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Mistral 7B
wget https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf

# Mixtral 8x7B
wget https://huggingface.co/bartowski/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf

# Phi-2
wget https://huggingface.co/bartowski/Phi-4-GGUF/resolve/main/Phi-4-Q4_K_M.gguf

# CodeLlama 7B
wget https://huggingface.co/bartowski/CodeLlama-7B-Instruct-GGUF/resolve/main/CodeLlama-7B-Instruct-Q4_K_M.gguf
```

## Server Options

### Basic Server

```bash
./llama-server \
    -m model.gguf \
    --host 0.0.0.0 \
    --port 8080
```

### Full GPU Offload

```bash
./llama-server \
    -m model.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 99 \           # GPU layers (99 = all)
    -c 4096 \           # Context size
    -t 8 \              # CPU threads
    --parallel 4        # Concurrent requests
```

### All Options

```bash
./llama-server \
    -m model.gguf \           # Model file
    --host 0.0.0.0 \          # Bind address
    --port 8080 \             # Port
    -ngl 35 \                 # GPU layers
    -c 4096 \                 # Context size
    -t 8 \                    # Threads
    -b 512 \                  # Batch size
    --parallel 4 \            # Parallel requests
    --mlock \                 # Lock memory
    --no-mmap \               # Disable mmap
    --cont-batching \         # Continuous batching
    --flash-attn \            # Flash attention
    --metrics                 # Enable metrics endpoint
```

## API Usage

### Chat Completions (OpenAI Compatible)

```python
import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)
```

### Streaming

```python
stream = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
```

### Text Completion

```python
response = client.completions.create(
    model="llama-3.1-8b",
    prompt="The future of AI is",
    max_tokens=100,
    temperature=0.8
)

print(response.choices[0].text)
```

### Embeddings

```python
response = client.embeddings.create(
    model="llama-3.1-8b",
    input="Hello, world!"
)

print(f"Embedding: {response.data[0].embedding[:5]}...")
```

## cURL Examples

### Chat

```bash
curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama-3.1-8b",
        "messages": [
            {"role": "user", "content": "Hello!"}
        ]
    }'
```

### Completion

```bash
curl http://localhost:8080/completion \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "Building a website requires",
        "n_predict": 128,
        "temperature": 0.7
    }'
```

### Health Check

```bash
curl http://localhost:8080/health
```

### Metrics

```bash
curl http://localhost:8080/metrics
```

## Multi-GPU

```bash

# Split across GPUs
./llama-server \
    -m model.gguf \
    -ngl 99 \
    --tensor-split 0.5,0.5 \  # Split between 2 GPUs
    --main-gpu 0              # Primary GPU
```

## Memory Optimization

### For Limited VRAM

```bash

# Partial offload
./llama-server -m model.gguf -ngl 20 -c 2048

# Use smaller quantization

# Download Q2_K or Q3_K instead of Q4_K
```

### For Maximum Speed

```bash
./llama-server \
    -m model.gguf \
    -ngl 99 \
    --flash-attn \
    --cont-batching \
    --parallel 8 \
    -b 1024
```

## Model-Specific Templates

### Llama 2 Chat

```bash
./llama-server -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --chat-template llama2
```

### Mistral Instruct

```bash
./llama-server -m mistral-7b-instruct.gguf \
    --chat-template mistral
```

### ChatML (Many Models)

```bash
./llama-server -m model.gguf \
    --chat-template chatml
```

## Python Server Wrapper

```python
import subprocess
import requests
import time

class LlamaCppServer:
    def __init__(self, model_path, port=8080, gpu_layers=35):
        self.port = port
        self.process = subprocess.Popen([
            "./llama-server",
            "-m", model_path,
            "--host", "0.0.0.0",
            "--port", str(port),
            "-ngl", str(gpu_layers),
            "-c", "4096"
        ])
        self._wait_for_ready()

    def _wait_for_ready(self, timeout=60):
        start = time.time()
        while time.time() - start < timeout:
            try:
                r = requests.get(f"http://localhost:{self.port}/health")
                if r.status_code == 200:
                    return
            except:
                pass
            time.sleep(1)
        raise TimeoutError("Server didn't start")

    def chat(self, messages, **kwargs):
        response = requests.post(
            f"http://localhost:{self.port}/v1/chat/completions",
            json={"messages": messages, **kwargs}
        )
        return response.json()

    def stop(self):
        self.process.terminate()

# Usage
server = LlamaCppServer("llama-3.1-8b.gguf")
result = server.chat([{"role": "user", "content": "Hello!"}])
print(result["choices"][0]["message"]["content"])
server.stop()
```

## Benchmarking

```bash

# Built-in benchmark
./llama-bench -m model.gguf -ngl 99

# Output includes:

# - Tokens per second

# - Memory usage

# - Load time
```

## Performance Comparison

| Model        | GPU      | Quantization | Tokens/sec |
| ------------ | -------- | ------------ | ---------- |
| Llama 3.1 8B | RTX 3090 | Q4\_K\_M     | \~100      |
| Llama 3.1 8B | RTX 4090 | Q4\_K\_M     | \~150      |
| Llama 3.1 8B | RTX 3090 | Q4\_K\_M     | \~60       |
| Mistral 7B   | RTX 3090 | Q4\_K\_M     | \~110      |
| Mixtral 8x7B | A100     | Q4\_K\_M     | \~50       |

## Troubleshooting

### CUDA Not Detected

```bash

# Rebuild with CUDA
make clean
make LLAMA_CUDA=1

# Check CUDA
nvidia-smi
```

### Out of Memory

```bash

# Reduce GPU layers
-ngl 20  # Instead of 99

# Reduce context
-c 2048  # Instead of 4096

# Use smaller quant

# Q4_K_S instead of Q4_K_M
```

### Slow Generation

```bash

# Increase batch size
-b 1024

# Enable flash attention
--flash-attn

# Enable continuous batching
--cont-batching
```

## Production Setup

### Systemd Service

```ini

# /etc/systemd/system/llama.service
[Unit]
Description=Llama.cpp Server
After=network.target

[Service]
Type=simple
ExecStart=/opt/llama.cpp/llama-server -m /models/model.gguf -ngl 99 --host 0.0.0.0 --port 8080
Restart=always

[Install]
WantedBy=multi-user.target
```

### With nginx

```nginx
upstream llama {
    server localhost:8080;
}

server {
    listen 80;

    location / {
        proxy_pass http://llama;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }
}
```

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers

## Next Steps

* vLLM Inference - Higher throughput
* [ExLlamaV2](https://docs.clore.ai/guides/language-models/exllamav2-fast) - Faster inference
* [Text Generation WebUI](https://docs.clore.ai/guides/language-models/text-generation-webui) - Web interface


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/llamacpp-server.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
