LocalAI

Self-hosted OpenAI-compatible API with LocalAI on Clore.ai

Run a self-hosted OpenAI-compatible API with LocalAI.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Server Requirements

Parameter

Minimum

Recommended

RAM

8GB

16GB+

VRAM

6GB

8GB+

Network

200Mbps

500Mbps+

Startup Time

5-10 minutes

Important: LocalAI takes 5-10 minutes to fully initialize on first startup. HTTP 502 during this time is normal - the service is downloading and loading models.

LocalAI is lightweight. For running LLMs (7B+ models), choose servers with 16GB+ RAM and 8GB+ VRAM.

What is LocalAI?

LocalAI provides:

Drop-in OpenAI API replacement
Support for multiple model formats
Text, image, audio, and embedding generation
No GPU required (but faster with GPU)

Supported Models

Type

Formats

Examples

LLM

GGUF, GGML

Llama, Mistral, Phi

Embeddings

GGUF

all-MiniLM, BGE

Images

Diffusers

SD 1.5, SDXL

Audio

Whisper

Speech-to-text

TTS

Piper, Bark

Text-to-speech

Quick Deploy

Docker Image:

localai/localai:master-aio-gpu-nvidia-cuda-12

Ports:

22/tcp
8080/http

No command needed - server starts automatically.

Verify It's Working

After deployment, find your http_pub URL in My Orders and test:

# Check if service is ready
curl https://your-http-pub.clorecloud.net/readyz

# List available models
curl https://your-http-pub.clorecloud.net/v1/models

# Get version
curl https://your-http-pub.clorecloud.net/version

If you get HTTP 502, wait 5-10 minutes - LocalAI takes longer to initialize than other services.

Pre-Built Models

LocalAI comes with several models available out of the box:

Model Name

Type

Description

gpt-4

Chat

General-purpose LLM

gpt-4o

Chat

General-purpose LLM

gpt-4o-mini

Chat

Smaller, faster LLM

whisper-1

STT

Speech-to-text

tts-1

TTS

Text-to-speech

text-embedding-ada-002

Embeddings

384-dimensional vectors

jina-reranker-v1-base-en

Reranking

Document reranking

These models work immediately after startup without additional configuration.

Accessing Your Service

When deployed on CLORE.AI, access LocalAI via the http_pub URL:

# Chat completion
curl https://your-http-pub.clorecloud.net/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "gpt-4",
        "messages": [{"role": "user", "content": "Hello!"}]
    }'

All localhost:8080 examples below work when connected via SSH. For external access, replace with your https://your-http-pub.clorecloud.net/ URL.

Docker Deploy (Alternative)

docker run -d \
    --gpus all \
    -p 8080:8080 \
    -v /workspace/models:/models \
    -e THREADS=4 \
    -e CONTEXT_SIZE=4096 \
    localai/localai:master-aio-gpu-nvidia-cuda-12

Download Models

From Model Gallery

LocalAI has a built-in model gallery:

# List available models
curl http://localhost:8080/models/available

# Install from gallery
curl http://localhost:8080/models/apply -d '{"id": "mistral-7b-instruct"}'

From Hugging Face

mkdir -p /workspace/models

# Llama 3.1 8B GGUF
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    -O /workspace/models/llama-3.1-8b.gguf

# Mistral 7B GGUF
wget https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf \
    -O /workspace/models/mistral-7b.gguf

Model Configuration

Create YAML config for each model:

models/llama-3.1-8b.yaml:

name: llama-3.1-8b
backend: llama-cpp
parameters:
  model: llama-3.1-8b.gguf
  context_size: 4096
  threads: 8
  gpu_layers: 35
template:
  chat: |
    {{.Input}}
    ### Response:
  completion: |
    {{.Input}}

API Usage

Chat Completions (OpenAI Compatible)

import openai

# For external access, use your http_pub URL:
client = openai.OpenAI(
    base_url="https://your-http-pub.clorecloud.net/v1",
    api_key="not-needed"
)

# Or via SSH tunnel:
# client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="mistral-7b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="mistral-7b",
    messages=[{"role": "user", "content": "Write a poem about AI"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Embeddings

response = client.embeddings.create(
    model="all-minilm",
    input="The quick brown fox jumps over the lazy dog"
)

embedding = response.data[0].embedding
print(f"Embedding dimension: {len(embedding)}")

Image Generation

response = client.images.generate(
    model="stablediffusion",
    prompt="a beautiful sunset over mountains",
    size="512x512",
    n=1
)

image_url = response.data[0].url

cURL Examples

Chat

curl https://your-http-pub.clorecloud.net/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "mistral-7b",
        "messages": [{"role": "user", "content": "Hello!"}]
    }'

Embeddings

curl https://your-http-pub.clorecloud.net/v1/embeddings \
    -H "Content-Type: application/json" \
    -d '{
        "model": "text-embedding-ada-002",
        "input": "Your text here"
    }'

Response:

{
  "data": [{"embedding": [0.1, -0.2, ...], "index": 0}],
  "model": "text-embedding-ada-002",
  "usage": {"prompt_tokens": 4, "total_tokens": 4}
}

Text-to-Speech (TTS)

curl https://your-http-pub.clorecloud.net/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "model": "tts-1",
        "input": "Hello, welcome to LocalAI!",
        "voice": "alloy"
    }' \
    --output speech.wav

Available voices: alloy, echo, fable, onyx, nova, shimmer

Speech-to-Text (STT)

curl https://your-http-pub.clorecloud.net/v1/audio/transcriptions \
    -F "[email protected]" \
    -F "model=whisper-1"

Response:

{"text": "Transcribed text here..."}

Reranking

Rerank documents by relevance to a query:

curl https://your-http-pub.clorecloud.net/v1/rerank \
    -H "Content-Type: application/json" \
    -d '{
        "model": "jina-reranker-v1-base-en",
        "query": "What is machine learning?",
        "documents": [
            "Machine learning is a subset of AI",
            "The weather is nice today",
            "Deep learning uses neural networks"
        ],
        "top_n": 2
    }'

Response:

{
  "results": [
    {"index": 0, "relevance_score": 0.95},
    {"index": 2, "relevance_score": 0.82}
  ]
}

Complete API Reference

Standard Endpoints (OpenAI Compatible)

Endpoint

Method

Description

/v1/models

GET

List available models

/v1/chat/completions

POST

Chat completion

/v1/completions

POST

Text completion

/v1/embeddings

POST

Generate embeddings

/v1/audio/speech

POST

Text-to-speech

/v1/audio/transcriptions

POST

Speech-to-text

/v1/images/generations

POST

Image generation

Additional Endpoints

Endpoint

Method

Description

/readyz

GET

Readiness check

/healthz

GET

Health check

/version

GET

Get LocalAI version

/v1/rerank

POST

Document reranking

/models/available

GET

List gallery models

/models/apply

POST

Install model from gallery

/swagger/

GET

Swagger UI documentation

/metrics

GET

Prometheus metrics

Get Version

curl https://your-http-pub.clorecloud.net/version

Response:

{"version": "v2.26.0"}

Swagger Documentation

Open in browser for interactive API documentation:

https://your-http-pub.clorecloud.net/swagger/

GPU Acceleration

CUDA Backend

# In model config
parameters:
  gpu_layers: 35  # Number of layers on GPU
  f16: true       # Use FP16

Full GPU Offload

parameters:
  gpu_layers: 99  # All layers on GPU
  main_gpu: 0     # Primary GPU ID

Multiple Models

LocalAI can serve multiple models simultaneously:

models/
├── llama-3.1-8b.yaml
├── llama-3.1-8b.gguf
├── mistral-7b.yaml
├── mistral-7b.gguf
├── whisper.yaml
└── whisper-base.bin

Access each via model name in API calls.

Performance Tuning

For Speed

parameters:
  threads: 8
  gpu_layers: 99
  batch_size: 512
  use_mmap: true
  use_mlock: true

For Memory

parameters:
  gpu_layers: 20  # Partial offload
  context_size: 2048  # Smaller context
  batch_size: 256

Benchmarks

Model

GPU

Tokens/sec

Llama 3.1 8B Q4

RTX 3090

~100

Mistral 7B Q4

RTX 3090

~110

Llama 3.1 8B Q4

RTX 4090

~140

Mixtral 8x7B Q4

A100

~60

Benchmarks updated January 2026.

Troubleshooting

HTTP 502 on http_pub URL

LocalAI takes longer to start than other services. Wait 5-10 minutes and retry:

# Check readiness
curl https://your-http-pub.clorecloud.net/readyz

# Check health
curl https://your-http-pub.clorecloud.net/healthz

Model Not Loading

Check file path in YAML
Verify GGUF format compatibility
Check available VRAM

Slow Responses

Increase gpu_layers
Enable use_mmap
Reduce context_size

Out of Memory

Reduce gpu_layers
Use smaller quantization (Q4 instead of Q8)
Reduce batch size

Image Generation Issues

Stable Diffusion may have CUDA compatibility issues on some GPU configurations. If you encounter CUDA errors with image generation, consider using a dedicated Stable Diffusion image instead.

Cost Estimate

Typical CLORE.AI marketplace rates:

GPU

VRAM

Price/day

Good For

RTX 3060

12GB

$0.15–0.30

7B models

RTX 3090

24GB

$0.30–1.00

13B models

RTX 4090

24GB

$0.50–2.00

Fast inference

A100

40GB

$1.50–3.00

Large models

Prices in USD/day. Rates vary by provider — check CLORE.AI Marketplace for current rates.

Next Steps

vLLM Inference - Higher throughput
Ollama Guide - Simpler setup
RAG with LangChain - Build applications

PreviousExLlamaV2 NextLlama 3.3 70B

Last updated 25 days ago

Was this helpful?

hashtagServer Requirements

hashtagWhat is LocalAI?

hashtagSupported Models

hashtagQuick Deploy

hashtagVerify It's Working

hashtagPre-Built Models

hashtagAccessing Your Service

hashtagDocker Deploy (Alternative)

hashtagDownload Models

hashtagFrom Model Gallery

hashtagFrom Hugging Face

hashtagModel Configuration

hashtagAPI Usage

hashtagChat Completions (OpenAI Compatible)

hashtagStreaming

hashtagEmbeddings

hashtagImage Generation

hashtagcURL Examples

hashtagChat

hashtagEmbeddings

hashtagText-to-Speech (TTS)

hashtagSpeech-to-Text (STT)

hashtagReranking

hashtagComplete API Reference

hashtagStandard Endpoints (OpenAI Compatible)

hashtagAdditional Endpoints

hashtagGet Version

hashtagSwagger Documentation

hashtagGPU Acceleration

hashtagCUDA Backend

hashtagFull GPU Offload

hashtagMultiple Models

hashtagPerformance Tuning

hashtagFor Speed

hashtagFor Memory

hashtagBenchmarks

hashtagTroubleshooting

hashtagHTTP 502 on http_pub URL

hashtagModel Not Loading

hashtagSlow Responses

hashtagOut of Memory

hashtagImage Generation Issues

hashtagCost Estimate

hashtagNext Steps

Server Requirements

What is LocalAI?

Supported Models

Quick Deploy

Verify It's Working

Pre-Built Models

Accessing Your Service

Docker Deploy (Alternative)

Download Models

From Model Gallery

From Hugging Face

Model Configuration

API Usage

Chat Completions (OpenAI Compatible)

Streaming

Embeddings

Image Generation

cURL Examples

Chat

Embeddings

Text-to-Speech (TTS)

Speech-to-Text (STT)

Reranking

Complete API Reference

Standard Endpoints (OpenAI Compatible)

Additional Endpoints

Get Version

Swagger Documentation

GPU Acceleration

CUDA Backend

Full GPU Offload

Multiple Models

Performance Tuning

For Speed

For Memory

Benchmarks

Troubleshooting

HTTP 502 on http_pub URL

Model Not Loading

Slow Responses

Out of Memory

Image Generation Issues

Cost Estimate

Next Steps