Llama.cpp Server

Run LLMs efficiently with llama.cpp server on GPU.

circle-check

Server Requirements

Parameter
Minimum
Recommended

RAM

8GB

16GB+

VRAM

6GB

8GB+

Network

200Mbps

500Mbps+

Startup Time

~2-5 minutes

-

circle-info

Llama.cpp is memory-efficient due to GGUF quantization. 7B models can run on 6-8GB VRAM.

Renting on CLORE.AI

  1. Filter by GPU type, VRAM, and price

  2. Choose On-Demand (fixed rate) or Spot (bid price)

  3. Configure your order:

    • Select Docker image

    • Set ports (TCP for SSH, HTTP for web UIs)

    • Add environment variables if needed

    • Enter startup command

  4. Select payment: CLORE, BTC, or USDT/USDC

  5. Create order and wait for deployment

Access Your Server

  • Find connection details in My Orders

  • Web interfaces: Use the HTTP port URL

  • SSH: ssh -p <port> root@<proxy-address>

What is Llama.cpp?

Llama.cpp is the fastest CPU/GPU inference engine for LLMs:

  • Supports GGUF quantized models

  • Low memory usage

  • OpenAI-compatible API

  • Multi-user support

Quantization Levels

Format
Size (7B)
Speed
Quality

Q2_K

2.8GB

Fastest

Low

Q4_K_M

4.1GB

Fast

Good

Q5_K_M

4.8GB

Medium

Great

Q6_K

5.5GB

Slower

Excellent

Q8_0

7.2GB

Slowest

Best

Quick Deploy

Docker Image:

Ports:

Command:

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

  1. Go to My Orders page

  2. Click on your order

  3. Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Verify It's Working

circle-exclamation

Complete API Reference

Standard Endpoints

Endpoint
Method
Description

/health

GET

Health check

/v1/models

GET

List models

/v1/chat/completions

POST

Chat (OpenAI compatible)

/v1/completions

POST

Text completion (OpenAI compatible)

/v1/embeddings

POST

Generate embeddings

/completion

POST

Native completion endpoint

/tokenize

POST

Tokenize text

/detokenize

POST

Detokenize tokens

/props

GET

Server properties

/metrics

GET

Prometheus metrics

Tokenize Text

Response:

Server Properties

Response:

Build from Source

Download Models

Server Options

Basic Server

Full GPU Offload

All Options

API Usage

Chat Completions (OpenAI Compatible)

Streaming

Text Completion

Embeddings

cURL Examples

Chat

Completion

Health Check

Metrics

Multi-GPU

Memory Optimization

For Limited VRAM

For Maximum Speed

Model-Specific Templates

Llama 2 Chat

Mistral Instruct

ChatML (Many Models)

Python Server Wrapper

Benchmarking

Performance Comparison

Model
GPU
Quantization
Tokens/sec

Llama 3.1 8B

RTX 3090

Q4_K_M

~100

Llama 3.1 8B

RTX 4090

Q4_K_M

~150

Llama 3.1 8B

RTX 3090

Q4_K_M

~60

Mistral 7B

RTX 3090

Q4_K_M

~110

Mixtral 8x7B

A100

Q4_K_M

~50

Troubleshooting

CUDA Not Detected

Out of Memory

Slow Generation

Production Setup

Systemd Service

With nginx

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU
Hourly Rate
Daily Rate
4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplacearrow-up-right for current rates.

Save money:

  • Use Spot market for flexible workloads (often 30-50% cheaper)

  • Pay with CLORE tokens

  • Compare prices across different providers

Next Steps

Last updated

Was this helpful?