Llama.cpp Server

Inferencia eficiente de LLM con el servidor llama.cpp en las GPU de Clore.ai

Ejecute LLMs de manera eficiente con el servidor llama.cpp en GPU.

Todos los ejemplos se pueden ejecutar en servidores GPU alquilados a través de CLORE.AI Marketplace.

Requisitos del servidor

Parámetro

Mínimo

Recomendado

RAM

8GB

16GB+

VRAM

6GB

8GB+

Red

200Mbps

500Mbps+

Tiempo de inicio

~2-5 minutos

Llama.cpp es eficiente en memoria debido a la cuantización GGUF. Los modelos 7B pueden ejecutarse en 6-8 GB de VRAM.

Alquilar en CLORE.AI

Visita CLORE.AI Marketplace
Filtrar por tipo de GPU, VRAM y precio
Elegir Bajo demanda (tarifa fija) o Spot (precio de puja)
Configura tu pedido:
- Selecciona imagen Docker
- Establece puertos (TCP para SSH, HTTP para interfaces web)
- Agrega variables de entorno si es necesario
- Introduce el comando de inicio
Selecciona pago: CLORE, BTC, o USDT/USDC
Crea el pedido y espera el despliegue

Accede a tu servidor

Encuentra los detalles de conexión en Mis Pedidos
Interfaces web: Usa la URL del puerto HTTP
SSH: ssh -p <port> root@<proxy-address>

¿Qué es Llama.cpp?

Llama.cpp es el motor de inferencia más rápido para LLMs en CPU/GPU:

Soporta modelos cuantizados GGUF
Bajo uso de memoria
API compatible con OpenAI
Soporte multiusuario

Niveles de cuantización

Formato

Tamaño (7B)

Velocidad

Calidad

Q2_K

2.8GB

El más rápido

Baja

Q4_K_M

4.1GB

Rápido

Bueno

Q5_K_M

4.8GB

Medio

Genial

Q6_K

5.5GB

Más lento

Excelente

Q8_0

7.2GB

El más lento

Mejor

Despliegue rápido

Imagen Docker:

ghcr.io/ggerganov/llama.cpp:server-cuda

Puertos:

22/tcp
8080/http

Comando:


# Descargar modelo
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Ejecutar servidor
./llama-server \
    -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35 \
    -c 4096

Accediendo a tu servicio

Después del despliegue, encuentra tu http_pub URL en Mis Pedidos:

Ir a Mis Pedidos página
Haz clic en tu pedido
Encuentra la http_pub URL (por ejemplo, abc123.clorecloud.net)

Usa https://TU_HTTP_PUB_URL en lugar de localhost en los ejemplos abajo.

Verificar que funciona

# Comprobar salud
curl https://your-http-pub.clorecloud.net/health

# Obtener información del servidor
curl https://your-http-pub.clorecloud.net/props

Si obtiene HTTP 502, el servicio puede estar aún iniciándose o descargando el modelo. Espere 2-5 minutos y vuelva a intentarlo.

Referencia completa de la API

Puntos finales estándar

Endpoint

Método

Descripción

/health

GET

Comprobación de salud

/v1/models

GET

Listar modelos

/v1/chat/completions

POST

Chat (compatible con OpenAI)

/v1/completions

POST

Completado de texto (compatible con OpenAI)

/v1/embeddings

POST

Generar embeddings

/completion

POST

Punto final de completado nativo

/tokenize

POST

Tokenizar texto

/detokenize

POST

Detokenizar tokens

/props

GET

Propiedades del servidor

/metrics

GET

Métricas de Prometheus

Tokenizar texto

curl https://your-http-pub.clorecloud.net/tokenize \
    -H "Content-Type: application/json" \
    -d '{"content": "Hello world"}'

Respuesta:

{"tokens": [15496, 1917]}

Propiedades del servidor

curl https://your-http-pub.clorecloud.net/props

Respuesta:

{
  "total_slots": 1,
  "chat_template": "...",
  "default_generation_settings": {...}
}

Construir desde la fuente


# Clonar el repositorio
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Compilar con CUDA
make LLAMA_CUDA=1

# O con CMake
mkdir build && cd build
cmake .. -DLLAMA_CUDA=ON
cmake --build . --config Release

Descargar modelos


# Llama 3.1 8B
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Mistral 7B
wget https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf

# Mixtral 8x7B
wget https://huggingface.co/bartowski/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/Mixtral-8x7B-Instruct-v0.1-Q4_K_M.gguf

# Phi-2
wget https://huggingface.co/bartowski/Phi-4-GGUF/resolve/main/Phi-4-Q4_K_M.gguf

# CodeLlama 7B
wget https://huggingface.co/bartowski/CodeLlama-7B-Instruct-GGUF/resolve/main/CodeLlama-7B-Instruct-Q4_K_M.gguf

Opciones del servidor

Servidor básico

./llama-server \
    -m model.gguf \
    --host 0.0.0.0 \
    --port 8080

Descarga completa a GPU

./llama-server \
    -m model.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 99 \           # Capas GPU (99 = todas)
    -c 4096 \           # Tamaño del contexto
    -t 8 \              # Hilos de CPU
    --parallel 4        # Solicitudes concurrentes

Todas las opciones

./llama-server \
    -m model.gguf \           # Archivo del modelo
    --host 0.0.0.0 \          # Dirección de enlace
    --port 8080 \             # Puerto
    -ngl 35 \                 # Capas GPU
    -c 4096 \                 # Tamaño del contexto
    -t 8 \                    # Hilos
    -b 512 \                  # Tamaño de lote
    --parallel 4 \            # Solicitudes paralelas
    --mlock \                 # Bloquear memoria
    --no-mmap \               # Deshabilitar mmap
    --cont-batching \         # Agrupamiento continuo
    --flash-attn \            # Flash attention
    --metrics                 # Habilitar endpoint de métricas

Uso de la API

Chat Completions (compatible con OpenAI)

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="no-necesaria"
)

response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[
        {"role": "system", "content": "Eres un asistente útil."},
        {"role": "user", "content": "What is machine learning?"}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Completado de texto

response = client.completions.create(
    model="llama-3.1-8b",
    prompt="The future of AI is",
    max_tokens=100,
    temperature=0.8
)

print(response.choices[0].text)

Embeddings

response = client.embeddings.create(
    model="llama-3.1-8b",
    input="Hello, world!"
)

print(f"Embedding: {response.data[0].embedding[:5]}...")

Ejemplos con cURL

Chat

curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama-3.1-8b",
        "messages": [
            {"role": "user", "content": "Hello!"}
        ]
    }'

Completado

curl http://localhost:8080/completion \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "Building a website requires",
        "n_predict": 128,
        "temperature": 0.7
    }'

Chequeo de salud

curl http://localhost:8080/health

Métricas

curl http://localhost:8080/metrics

Multi-GPU


# Dividir entre GPUs
./llama-server \
    -m model.gguf \
    -ngl 99 \
    --tensor-split 0.5,0.5 \  # Dividir entre 2 GPUs
    --main-gpu 0              # GPU primaria

Optimización de memoria

Para VRAM limitada


# Descarga parcial
./llama-server -m model.gguf -ngl 20 -c 2048

# Usar cuantización más pequeña

# Descargar Q2_K o Q3_K en lugar de Q4_K

Para máxima velocidad

./llama-server \
    -m model.gguf \
    -ngl 99 \
    --flash-attn \
    --cont-batching \
    --parallel 8 \
    -b 1024

Plantillas específicas del modelo

Llama 2 Chat

./llama-server -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --chat-template llama2

Mistral Instruct

./llama-server -m mistral-7b-instruct.gguf \
    --chat-template mistral

ChatML (muchos modelos)

./llama-server -m model.gguf \
    --chat-template chatml

Wrapper de servidor en Python

import subprocess
import requests
import time

class LlamaCppServer:
    def __init__(self, model_path, port=8080, gpu_layers=35):
        self.port = port
        self.process = subprocess.Popen([
            "./llama-server",
            "-m", model_path,
            "--host", "0.0.0.0",
            "--port", str(port),
            "-ngl", str(gpu_layers),
            "-c", "4096"
        ])
        self._wait_for_ready()

    def _wait_for_ready(self, timeout=60):
        start = time.time()
        while time.time() - start < timeout:
            try:
                r = requests.get(f"http://localhost:{self.port}/health")
                if r.status_code == 200:
                    return
            except:
                pass
            time.sleep(1)
        raise TimeoutError("Server didn't start")

    def chat(self, messages, **kwargs):
        response = requests.post(
            f"http://localhost:{self.port}/v1/chat/completions",
            json={"messages": messages, **kwargs}
        )
        return response.json()

    def stop(self):
        self.process.terminate()

# Uso
server = LlamaCppServer("llama-3.1-8b.gguf")
result = server.chat([{"role": "user", "content": "Hello!"}])
print(result["choices"][0]["message"]["content"])
server.stop()

Benchmarking


# Benchmark incorporado
./llama-bench -m model.gguf -ngl 99

# La salida incluye:

# - Tokens por segundo

# - Uso de memoria

# - Tiempo de carga

Comparación de rendimiento

Modelo

GPU

Cuantización

Tokens/seg

Llama 3.1 8B

RTX 3090

Q4_K_M

~100

Llama 3.1 8B

RTX 4090

Q4_K_M

~150

Llama 3.1 8B

RTX 3090

Q4_K_M

~60

Mistral 7B

RTX 3090

Q4_K_M

~110

Mixtral 8x7B

A100

Q4_K_M

~50

Solución de problemas

CUDA no detectado


# Reconstruir con CUDA
make clean
make LLAMA_CUDA=1

# Verificar CUDA
nvidia-smi

Memoria insuficiente


# Reducir capas GPU
-ngl 20  # En lugar de 99

# Reducir el contexto
-c 2048  # En lugar de 4096

# Usar cuantización más pequeña

# Q4_K_S en lugar de Q4_K_M

Generación lenta


# Aumentar el tamaño del lote
-b 1024

# Habilitar flash attention
--flash-attn

# Habilitar agrupamiento continuo
--cont-batching

Configuración de producción

Servicio Systemd


# /etc/systemd/system/llama.service
[Unit]
Description=Llama.cpp Server
After=network.target

[Service]
Type=simple
ExecStart=/opt/llama.cpp/llama-server -m /models/model.gguf -ngl 99 --host 0.0.0.0 --port 8080
Restart=always

[Install]
WantedBy=multi-user.target

Con nginx

upstream llama {
    server localhost:8080;
}

server {
    listen 80;

    location / {
        proxy_pass http://llama;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }
}

Estimación de costos

Tarifas típicas del marketplace de CLORE.AI (a fecha de 2024):

GPU

Tarifa por hora

Tarifa diaria

Sesión de 4 horas

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Los precios varían según el proveedor y la demanda. Consulta CLORE.AI Marketplace para las tarifas actuales.

Ahorra dinero:

Usa Spot market para cargas de trabajo flexibles (a menudo 30-50% más barato)
Paga con CLORE tokens
Compara precios entre diferentes proveedores

Próximos pasos

Inferencia vLLM - Mayor rendimiento
ExLlamaV2 - Inferencia más rápida
Text Generation WebUI - Interfaz web

AnteriorvLLM SiguienteText Generation WebUI

Última actualización hace 21 días

¿Te fue útil?

hashtagRequisitos del servidor

hashtagAlquilar en CLORE.AI

hashtagAccede a tu servidor

hashtag¿Qué es Llama.cpp?

hashtagNiveles de cuantización

hashtagDespliegue rápido

hashtagAccediendo a tu servicio

hashtagVerificar que funciona

hashtagReferencia completa de la API

hashtagPuntos finales estándar

hashtagTokenizar texto

hashtagPropiedades del servidor

hashtagConstruir desde la fuente

hashtagDescargar modelos

hashtagOpciones del servidor

hashtagServidor básico

hashtagDescarga completa a GPU

hashtagTodas las opciones

hashtagUso de la API

hashtagChat Completions (compatible con OpenAI)

hashtagStreaming

hashtagCompletado de texto

hashtagEmbeddings

hashtagEjemplos con cURL

hashtagChat

hashtagCompletado

hashtagChequeo de salud

hashtagMétricas

hashtagMulti-GPU

hashtagOptimización de memoria

hashtagPara VRAM limitada

hashtagPara máxima velocidad

hashtagPlantillas específicas del modelo

hashtagLlama 2 Chat

hashtagMistral Instruct

hashtagChatML (muchos modelos)

hashtagWrapper de servidor en Python

hashtagBenchmarking

hashtagComparación de rendimiento

hashtagSolución de problemas

hashtagCUDA no detectado

hashtagMemoria insuficiente

hashtagGeneración lenta

hashtagConfiguración de producción

hashtagServicio Systemd

hashtagCon nginx

hashtagEstimación de costos

hashtagPróximos pasos

Requisitos del servidor

Alquilar en CLORE.AI

Accede a tu servidor

¿Qué es Llama.cpp?

Niveles de cuantización

Despliegue rápido

Accediendo a tu servicio

Verificar que funciona

Referencia completa de la API

Puntos finales estándar

Tokenizar texto

Propiedades del servidor

Construir desde la fuente

Descargar modelos

Opciones del servidor

Servidor básico

Descarga completa a GPU

Todas las opciones

Uso de la API

Chat Completions (compatible con OpenAI)

Streaming

Completado de texto

Embeddings

Ejemplos con cURL

Chat

Completado

Chequeo de salud

Métricas

Multi-GPU

Optimización de memoria

Para VRAM limitada

Para máxima velocidad

Plantillas específicas del modelo

Llama 2 Chat

Mistral Instruct

ChatML (muchos modelos)

Wrapper de servidor en Python

Benchmarking

Comparación de rendimiento

Solución de problemas

CUDA no detectado

Memoria insuficiente

Generación lenta

Configuración de producción

Servicio Systemd

Con nginx

Estimación de costos

Próximos pasos