Llama.cpp 服务器

在 Clore.ai GPU 上使用 llama.cpp 服务器实现高效 LLM 推理

在 GPU 上使用 llama.cpp 服务器高效运行大型语言模型（LLM）。

所有示例都可以在通过以下方式租用的 GPU 服务器上运行： CLORE.AI 市场.

服务器要求

参数

最低

在 CLORE.AI 上租用

访问 CLORE.AI 市场
按 GPU 类型、显存和价格筛选
选择按需（固定费率）或竞价（出价价格）
配置您的订单：
- 选择 Docker 镜像
- 设置端口（用于 SSH 的 TCP，Web 界面的 HTTP）
- 如有需要，添加环境变量
- 输入启动命令
选择支付方式： CLORE, BTC，或 USDT/USDC
创建订单并等待部署

访问您的服务器

在以下位置查找连接详情： 我的订单
Web 界面：使用 HTTP 端口的 URL
SSH： ssh -p <port> root@<proxy-address>

什么是 Llama.cpp？

Llama.cpp 是用于 LLM 的最快的 CPU/GPU 推理引擎：

支持 GGUF 量化模型
低内存使用
兼容 OpenAI 的 API
支持多用户

量化等级

格式

大小（7B）

性能

质量

Q2_K

2.8GB

最快

低

Q4_K_M

4.1GB

快速

良好

Q5_K_M

4.8GB

中等

很棒

Q6_K

5.5GB

较慢

优秀

Q8_0

7.2GB

最慢

最佳

快速部署

Docker 镜像：

ghcr.io/ggerganov/llama.cpp:server-cuda

端口：

22/tcp
8080/http

命令：


# 下载模型
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# 运行服务器
./llama-server \
    -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35 \
    -c 4096

访问您的服务

部署后，在以下位置查找您的 http_pub URL： 我的订单:

前往 我的订单 页面
单击您的订单
查找 http_pub URL（例如， abc123.clorecloud.net)

使用 https://YOUR_HTTP_PUB_URL 而不是 localhost 在下面的示例中。

验证是否正常运行

# 检查健康状态
curl https://your-http-pub.clorecloud.net/health

# 获取服务器信息
curl https://your-http-pub.clorecloud.net/props

如果返回 HTTP 502，服务可能仍在启动或正在下载模型。等待 2-5 分钟后重试。

完整 API 参考

标准端点

端点

方法

/health

GET

健康检查

/v1/models

GET

列出模型

/v1/chat/completions

POST

聊天（兼容 OpenAI）

/v1/completions

POST

文本补全（兼容 OpenAI）

/v1/embeddings

POST

生成嵌入

/completion

POST

原生补全端点

/tokenize

POST

对文本进行分词

/detokenize

POST

将标记还原为文本

/props

GET

服务器属性

/metrics

GET

Prometheus 指标

对文本分词

curl https://your-http-pub.clorecloud.net/tokenize \
    -H "Content-Type: application/json" \
    -d '{"content": "Hello world"}'

响应：

{"tokens": [15496, 1917]}

服务器属性

curl https://your-http-pub.clorecloud.net/props

响应：

{
  "total_slots": 1,
  "chat_template": "...",
  "default_generation_settings": {...}
}

从源码构建


# 克隆仓库
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# 使用 CUDA 构建
make LLAMA_CUDA=1

# 或使用 CMake
mkdir build && cd build
cmake .. -DLLAMA_CUDA=ON
cmake --build . --config Release

下载模型


# Llama 3.1 8B
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Mistral 7B
wget https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf

# Mixtral 8x7B
wget https://huggingface.co/bartowski/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf

# Phi-2
wget https://huggingface.co/bartowski/Phi-4-GGUF/resolve/main/Phi-4-Q4_K_M.gguf

# CodeLlama 7B
wget https://huggingface.co/bartowski/CodeLlama-7B-Instruct-GGUF/resolve/main/CodeLlama-7B-Instruct-Q4_K_M.gguf

服务器选项

基础服务器

./llama-server \
    -m model.gguf \
    --host 0.0.0.0 \
    --port 8080

完全 GPU 卸载

./llama-server \
    -m model.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 99 \           # GPU 层数（99 = 全部）
    -c 4096 \           # 上下文大小
    -t 8 \              # CPU 线程数
    --parallel 4        # 并发请求数

全部选项

./llama-server \
    -m model.gguf \           # 模型文件
    --host 0.0.0.0 \          # 绑定地址
    --port 8080 \             # 端口
    -ngl 35 \                 # GPU 层数
    -c 4096 \                 # 上下文大小
    -t 8 \                    # 线程数
    -b 512 \                  # 批处理大小
    --parallel 4 \            # 并行请求
    --mlock \                 # 锁定内存
    --no-mmap \               # 禁用 mmap
    --cont-batching \         # 连续批处理
    --flash-attn \            # 使用 flash attention
    --metrics                 # 启用指标端点

API 使用

聊天补全（兼容 OpenAI）

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

流式传输

stream = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

文本补全

response = client.completions.create(
    model="llama-3.1-8b",
    prompt="The future of AI is",
    max_tokens=100,
    temperature=0.8
)

print(response.choices[0].text)

嵌入

response = client.embeddings.create(
    model="llama-3.1-8b",
    input="Hello, world!"
)

print(f"Embedding: {response.data[0].embedding[:5]}...")

cURL 示例

对话

curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama-3.1-8b",
        "messages": [
            {"role": "user", "content": "Hello!"}
        ]
    }'

补全

curl http://localhost:8080/completion \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "Building a website requires",
        "n_predict": 128,
        "temperature": 0.7
    }'

健康检查

curl http://localhost:8080/health

指标

curl http://localhost:8080/metrics

多 GPU


# 在多个 GPU 间拆分
./llama-server \
    -m model.gguf \
    -ngl 99 \
    --tensor-split 0.5,0.5 \  # 在 2 个 GPU 之间拆分
    --main-gpu 0              # 主 GPU

内存优化

针对有限显存


# 部分卸载
./llama-server -m model.gguf -ngl 20 -c 2048

# 使用更小的量化

# 下载 Q2_K 或 Q3_K 而不是 Q4_K

为了获得最大速度

./llama-server \
    -m model.gguf \
    -ngl 99 \
    --flash-attn \
    --cont-batching \
    --parallel 8 \
    -b 1024

针对模型的模板

Llama 2 聊天

./llama-server -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --chat-template llama2

Mistral 指导式

./llama-server -m mistral-7b-instruct.gguf \
    --chat-template mistral

ChatML（多模型）

./llama-server -m model.gguf \
    --chat-template chatml

Python 服务器封装

import subprocess
import requests
import time

class LlamaCppServer:
    def __init__(self, model_path, port=8080, gpu_layers=35):
        self.port = port
        self.process = subprocess.Popen([
            "./llama-server",
            "-m", model_path,
            "--host", "0.0.0.0",
            "--port", str(port),
            "-ngl", str(gpu_layers),
            "-c", "4096"
        ])
        self._wait_for_ready()

    def _wait_for_ready(self, timeout=60):
        start = time.time()
        while time.time() - start < timeout:
            try:
                r = requests.get(f"http://localhost:{self.port}/health")
                if r.status_code == 200:
                    return
            except:
                pass
            time.sleep(1)
        raise TimeoutError("Server didn't start")

    def chat(self, messages, **kwargs):
        response = requests.post(
            f"http://localhost:{self.port}/v1/chat/completions",
            json={"messages": messages, **kwargs}
        )
        return response.json()

    def stop(self):
        self.process.terminate()

# 用法
server = LlamaCppServer("llama-3.1-8b.gguf")
result = server.chat([{"role": "user", "content": "Hello!"}])
print(result["choices"][0]["message"]["content"])
server.stop()

基准测试


# 内置基准测试
./llama-bench -m model.gguf -ngl 99

# 输出包含：

# - 每秒标记数

# - 内存使用

# - 加载时间

性能比较

A100

GPU

量化

每秒标记数

Llama 3.1 8B

速度

Q4_K_M

~100

Llama 3.1 8B

512x512

Q4_K_M

~150

Llama 3.1 8B

速度

Q4_K_M

~60

Mistral 7B

速度

Q4_K_M

~110

Mixtral 8x7B

Q4_K_M

~50

# 使用固定种子以获得一致结果

检测不到 CUDA


# 使用 CUDA 重新构建
make clean
make LLAMA_CUDA=1

# 检查 CUDA
nvidia-smi

内存不足


# 减少 GPU 层数
-ngl 20  # 替代 99

# 减少上下文
-c 2048  # 替代 4096

# 使用更小的量化

# 使用 Q4_K_S 而不是 Q4_K_M

生成速度慢


# 增加批处理大小
-b 1024

# 启用 flash attention
--flash-attn

# 启用连续批处理
--cont-batching

生产环境设置

Systemd 服务


# /etc/systemd/system/llama.service
[Unit]
Description=Llama.cpp Server
After=network.target

[Service]
Type=simple
ExecStart=/opt/llama.cpp/llama-server -m /models/model.gguf -ngl 99 --host 0.0.0.0 --port 8080
Restart=always

[Install]
WantedBy=multi-user.target

与 nginx 一起

upstream llama {
    server localhost:8080;
}

server {
    listen 80;

    location / {
        proxy_pass http://llama;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }
}

下载所有所需的检查点

检查文件完整性

GPU

验证 CUDA 兼容性

费用估算

CLORE.AI 市场的典型费率（截至 2024 年）：

按小时费率

~$0.03

~$0.70

~$0.12

速度

~$0.06

~$1.50

~$0.25

512x512

~$0.10

~$2.30

~$0.40

按日费率

~$0.17

~$4.00

~$0.70

4 小时会话

~$0.25

~$6.00

~$1.00

RTX 3060 CLORE.AI 市场 A100 40GB

A100 80GB

使用竞价价格随提供商和需求而异。请查看
以获取当前费率。 CLORE 节省费用：
市场用于灵活工作负载（通常便宜 30-50%）

使用以下方式支付

vLLM 推理 - 更高吞吐量
ExLlamaV2 - 更快的推理
文本生成 WebUI - Web 界面

上一页vLLM 下一页文本生成 WebUI

最后更新于22天前

这有帮助吗？

hashtag服务器要求

hashtag在 CLORE.AI 上租用

hashtag访问您的服务器

hashtag什么是 Llama.cpp？

hashtag量化等级

hashtag快速部署

hashtag访问您的服务

hashtag验证是否正常运行

hashtag完整 API 参考

hashtag标准端点

hashtag对文本分词

hashtag服务器属性

hashtag从源码构建

hashtag下载模型

hashtag服务器选项

hashtag基础服务器

hashtag完全 GPU 卸载

hashtag全部选项

hashtagAPI 使用

hashtag聊天补全（兼容 OpenAI）

hashtag流式传输

hashtag文本补全

hashtag嵌入

hashtagcURL 示例

hashtag对话

hashtag补全

hashtag健康检查

hashtag指标

hashtag多 GPU

hashtag内存优化

hashtag针对有限显存

hashtag为了获得最大速度

hashtag针对模型的模板

hashtagLlama 2 聊天

hashtagMistral 指导式

hashtagChatML（多模型）

hashtagPython 服务器封装

hashtag基准测试

hashtag性能比较

hashtag# 使用固定种子以获得一致结果

hashtag检测不到 CUDA

hashtag内存不足

hashtag生成速度慢

hashtag生产环境设置

hashtagSystemd 服务

hashtag与 nginx 一起

hashtag下载所有所需的检查点

hashtag使用以下方式支付

服务器要求

在 CLORE.AI 上租用

访问您的服务器

什么是 Llama.cpp？

量化等级

快速部署

访问您的服务

验证是否正常运行

完整 API 参考

标准端点

对文本分词

服务器属性

从源码构建

下载模型

服务器选项

基础服务器

完全 GPU 卸载

全部选项

API 使用

聊天补全（兼容 OpenAI）

流式传输

文本补全

嵌入

cURL 示例

对话

补全

健康检查

指标

多 GPU

内存优化

针对有限显存

为了获得最大速度

针对模型的模板

Llama 2 聊天

Mistral 指导式

ChatML（多模型）

Python 服务器封装

基准测试

性能比较

# 使用固定种子以获得一致结果

检测不到 CUDA

内存不足

生成速度慢

生产环境设置

Systemd 服务

与 nginx 一起

下载所有所需的检查点

使用以下方式支付