Llama.cpp Server

Efficient LLM inference with llama.cpp server on Clore.ai GPUs

Run LLMs efficiently with llama.cpp server on GPU.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Server Requirements

Parameter

Minimum

Recommended

RAM

8GB

16GB+

VRAM

6GB

8GB+

Network

200Mbps

500Mbps+

Startup Time

~2-5 minutes

Llama.cpp is memory-efficient due to GGUF quantization. 7B models can run on 6-8GB VRAM.

Renting on CLORE.AI

Visit CLORE.AI Marketplace
Filter by GPU type, VRAM, and price
Choose On-Demand (fixed rate) or Spot (bid price)
Configure your order:
- Select Docker image
- Set ports (TCP for SSH, HTTP for web UIs)
- Add environment variables if needed
- Enter startup command
Select payment: CLORE, BTC, or USDT/USDC
Create order and wait for deployment

Access Your Server

Find connection details in My Orders
Web interfaces: Use the HTTP port URL
SSH: ssh -p <port> root@<proxy-address>

What is Llama.cpp?

Llama.cpp is the fastest CPU/GPU inference engine for LLMs:

Supports GGUF quantized models
Low memory usage
OpenAI-compatible API
Multi-user support

Quantization Levels

Format

Size (7B)

Speed

Quality

Q2_K

2.8GB

Fastest

Low

Q4_K_M

4.1GB

Fast

Good

Q5_K_M

4.8GB

Medium

Great

Q6_K

5.5GB

Slower

Excellent

Q8_0

7.2GB

Slowest

Best

Quick Deploy

Docker Image:

ghcr.io/ggerganov/llama.cpp:server-cuda

Ports:

22/tcp
8080/http

Command:


# Download model
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Run server
./llama-server \
    -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35 \
    -c 4096

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Verify It's Working

# Check health
curl https://your-http-pub.clorecloud.net/health

# Get server info
curl https://your-http-pub.clorecloud.net/props

If you get HTTP 502, the service may still be starting or downloading the model. Wait 2-5 minutes and retry.

Complete API Reference

Standard Endpoints

Endpoint

Method

Description

/health

GET

Health check

/v1/models

GET

List models

/v1/chat/completions

POST

Chat (OpenAI compatible)

/v1/completions

POST

Text completion (OpenAI compatible)

/v1/embeddings

POST

Generate embeddings

/completion

POST

Native completion endpoint

/tokenize

POST

Tokenize text

/detokenize

POST

Detokenize tokens

/props

GET

Server properties

/metrics

GET

Prometheus metrics

Tokenize Text

curl https://your-http-pub.clorecloud.net/tokenize \
    -H "Content-Type: application/json" \
    -d '{"content": "Hello world"}'

Response:

{"tokens": [15496, 1917]}

Server Properties

curl https://your-http-pub.clorecloud.net/props

Response:

{
  "total_slots": 1,
  "chat_template": "...",
  "default_generation_settings": {...}
}

Build from Source


# Clone repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA
make LLAMA_CUDA=1

# Or with CMake
mkdir build && cd build
cmake .. -DLLAMA_CUDA=ON
cmake --build . --config Release

Download Models


# Llama 3.1 8B
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Mistral 7B
wget https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf

# Mixtral 8x7B
wget https://huggingface.co/bartowski/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf

# Phi-2
wget https://huggingface.co/bartowski/Phi-4-GGUF/resolve/main/Phi-4-Q4_K_M.gguf

# CodeLlama 7B
wget https://huggingface.co/bartowski/CodeLlama-7B-Instruct-GGUF/resolve/main/CodeLlama-7B-Instruct-Q4_K_M.gguf

Server Options

Basic Server

./llama-server \
    -m model.gguf \
    --host 0.0.0.0 \
    --port 8080

Full GPU Offload

./llama-server \
    -m model.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 99 \           # GPU layers (99 = all)
    -c 4096 \           # Context size
    -t 8 \              # CPU threads
    --parallel 4        # Concurrent requests

All Options

./llama-server \
    -m model.gguf \           # Model file
    --host 0.0.0.0 \          # Bind address
    --port 8080 \             # Port
    -ngl 35 \                 # GPU layers
    -c 4096 \                 # Context size
    -t 8 \                    # Threads
    -b 512 \                  # Batch size
    --parallel 4 \            # Parallel requests
    --mlock \                 # Lock memory
    --no-mmap \               # Disable mmap
    --cont-batching \         # Continuous batching
    --flash-attn \            # Flash attention
    --metrics                 # Enable metrics endpoint

API Usage

Chat Completions (OpenAI Compatible)

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Text Completion

response = client.completions.create(
    model="llama-3.1-8b",
    prompt="The future of AI is",
    max_tokens=100,
    temperature=0.8
)

print(response.choices[0].text)

Embeddings

response = client.embeddings.create(
    model="llama-3.1-8b",
    input="Hello, world!"
)

print(f"Embedding: {response.data[0].embedding[:5]}...")

cURL Examples

Chat

curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama-3.1-8b",
        "messages": [
            {"role": "user", "content": "Hello!"}
        ]
    }'

Completion

curl http://localhost:8080/completion \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "Building a website requires",
        "n_predict": 128,
        "temperature": 0.7
    }'

Health Check

curl http://localhost:8080/health

Metrics

curl http://localhost:8080/metrics

Multi-GPU


# Split across GPUs
./llama-server \
    -m model.gguf \
    -ngl 99 \
    --tensor-split 0.5,0.5 \  # Split between 2 GPUs
    --main-gpu 0              # Primary GPU

Memory Optimization

For Limited VRAM


# Partial offload
./llama-server -m model.gguf -ngl 20 -c 2048

# Use smaller quantization

# Download Q2_K or Q3_K instead of Q4_K

For Maximum Speed

./llama-server \
    -m model.gguf \
    -ngl 99 \
    --flash-attn \
    --cont-batching \
    --parallel 8 \
    -b 1024

Model-Specific Templates

Llama 2 Chat

./llama-server -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --chat-template llama2

Mistral Instruct

./llama-server -m mistral-7b-instruct.gguf \
    --chat-template mistral

ChatML (Many Models)

./llama-server -m model.gguf \
    --chat-template chatml

Python Server Wrapper

import subprocess
import requests
import time

class LlamaCppServer:
    def __init__(self, model_path, port=8080, gpu_layers=35):
        self.port = port
        self.process = subprocess.Popen([
            "./llama-server",
            "-m", model_path,
            "--host", "0.0.0.0",
            "--port", str(port),
            "-ngl", str(gpu_layers),
            "-c", "4096"
        ])
        self._wait_for_ready()

    def _wait_for_ready(self, timeout=60):
        start = time.time()
        while time.time() - start < timeout:
            try:
                r = requests.get(f"http://localhost:{self.port}/health")
                if r.status_code == 200:
                    return
            except:
                pass
            time.sleep(1)
        raise TimeoutError("Server didn't start")

    def chat(self, messages, **kwargs):
        response = requests.post(
            f"http://localhost:{self.port}/v1/chat/completions",
            json={"messages": messages, **kwargs}
        )
        return response.json()

    def stop(self):
        self.process.terminate()

# Usage
server = LlamaCppServer("llama-3.1-8b.gguf")
result = server.chat([{"role": "user", "content": "Hello!"}])
print(result["choices"][0]["message"]["content"])
server.stop()

Benchmarking


# Built-in benchmark
./llama-bench -m model.gguf -ngl 99

# Output includes:

# - Tokens per second

# - Memory usage

# - Load time

Performance Comparison

Model

GPU

Quantization

Tokens/sec

Llama 3.1 8B

RTX 3090

Q4_K_M

~100

Llama 3.1 8B

RTX 4090

Q4_K_M

~150

Llama 3.1 8B

RTX 3090

Q4_K_M

~60

Mistral 7B

RTX 3090

Q4_K_M

~110

Mixtral 8x7B

A100

Q4_K_M

~50

Troubleshooting

CUDA Not Detected


# Rebuild with CUDA
make clean
make LLAMA_CUDA=1

# Check CUDA
nvidia-smi

Out of Memory


# Reduce GPU layers
-ngl 20  # Instead of 99

# Reduce context
-c 2048  # Instead of 4096

# Use smaller quant

# Q4_K_S instead of Q4_K_M

Slow Generation


# Increase batch size
-b 1024

# Enable flash attention
--flash-attn

# Enable continuous batching
--cont-batching

Production Setup

Systemd Service


# /etc/systemd/system/llama.service
[Unit]
Description=Llama.cpp Server
After=network.target

[Service]
Type=simple
ExecStart=/opt/llama.cpp/llama-server -m /models/model.gguf -ngl 99 --host 0.0.0.0 --port 8080
Restart=always

[Install]
WantedBy=multi-user.target

With nginx

upstream llama {
    server localhost:8080;
}

server {
    listen 80;

    location / {
        proxy_pass http://llama;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }
}

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU

Hourly Rate

Daily Rate

4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplace for current rates.

Save money:

Use Spot market for flexible workloads (often 30-50% cheaper)
Pay with CLORE tokens
Compare prices across different providers

Next Steps

vLLM Inference - Higher throughput
ExLlamaV2 - Faster inference
Text Generation WebUI - Web interface

PreviousvLLM NextText Generation WebUI

Last updated 26 days ago

Was this helpful?

hashtagServer Requirements

hashtagRenting on CLORE.AI

hashtagAccess Your Server

hashtagWhat is Llama.cpp?

hashtagQuantization Levels

hashtagQuick Deploy

hashtagAccessing Your Service

hashtagVerify It's Working

hashtagComplete API Reference

hashtagStandard Endpoints

hashtagTokenize Text

hashtagServer Properties

hashtagBuild from Source

hashtagDownload Models

hashtagServer Options

hashtagBasic Server

hashtagFull GPU Offload

hashtagAll Options

hashtagAPI Usage

hashtagChat Completions (OpenAI Compatible)

hashtagStreaming

hashtagText Completion

hashtagEmbeddings

hashtagcURL Examples

hashtagChat

hashtagCompletion

hashtagHealth Check

hashtagMetrics

hashtagMulti-GPU

hashtagMemory Optimization

hashtagFor Limited VRAM

hashtagFor Maximum Speed

hashtagModel-Specific Templates

hashtagLlama 2 Chat

hashtagMistral Instruct

hashtagChatML (Many Models)

hashtagPython Server Wrapper

hashtagBenchmarking

hashtagPerformance Comparison

hashtagTroubleshooting

hashtagCUDA Not Detected

hashtagOut of Memory

hashtagSlow Generation

hashtagProduction Setup

hashtagSystemd Service

hashtagWith nginx

hashtagCost Estimate

hashtagNext Steps

Server Requirements

Renting on CLORE.AI

Access Your Server

What is Llama.cpp?

Quantization Levels

Quick Deploy

Accessing Your Service

Verify It's Working

Complete API Reference

Standard Endpoints

Tokenize Text

Server Properties

Build from Source

Download Models

Server Options

Basic Server

Full GPU Offload

All Options

API Usage

Chat Completions (OpenAI Compatible)

Streaming

Text Completion

Embeddings

cURL Examples

Chat

Completion

Health Check

Metrics

Multi-GPU

Memory Optimization

For Limited VRAM

For Maximum Speed

Model-Specific Templates

Llama 2 Chat

Mistral Instruct

ChatML (Many Models)

Python Server Wrapper

Benchmarking

Performance Comparison

Troubleshooting

CUDA Not Detected

Out of Memory

Slow Generation

Production Setup

Systemd Service

With nginx

Cost Estimate

Next Steps