LocalAI

Run a self-hosted OpenAI-compatible API with LocalAI.

circle-check

Server Requirements

Parameter
Minimum
Recommended

RAM

8GB

16GB+

VRAM

6GB

8GB+

Network

200Mbps

500Mbps+

Startup Time

5-10 minutes

-

circle-exclamation
circle-info

LocalAI is lightweight. For running LLMs (7B+ models), choose servers with 16GB+ RAM and 8GB+ VRAM.

What is LocalAI?

LocalAI provides:

  • Drop-in OpenAI API replacement

  • Support for multiple model formats

  • Text, image, audio, and embedding generation

  • No GPU required (but faster with GPU)

Supported Models

Type
Formats
Examples

LLM

GGUF, GGML

Llama, Mistral, Phi

Embeddings

GGUF

all-MiniLM, BGE

Images

Diffusers

SD 1.5, SDXL

Audio

Whisper

Speech-to-text

TTS

Piper, Bark

Text-to-speech

Quick Deploy

Docker Image:

Ports:

No command needed - server starts automatically.

Verify It's Working

After deployment, find your http_pub URL in My Orders and test:

circle-exclamation

Pre-Built Models

LocalAI comes with several models available out of the box:

Model Name
Type
Description

gpt-4

Chat

General-purpose LLM

gpt-4o

Chat

General-purpose LLM

gpt-4o-mini

Chat

Smaller, faster LLM

whisper-1

STT

Speech-to-text

tts-1

TTS

Text-to-speech

text-embedding-ada-002

Embeddings

384-dimensional vectors

jina-reranker-v1-base-en

Reranking

Document reranking

circle-info

These models work immediately after startup without additional configuration.

Accessing Your Service

When deployed on CLORE.AI, access LocalAI via the http_pub URL:

circle-info

All localhost:8080 examples below work when connected via SSH. For external access, replace with your https://your-http-pub.clorecloud.net/ URL.

Docker Deploy (Alternative)

Download Models

LocalAI has a built-in model gallery:

From Hugging Face

Model Configuration

Create YAML config for each model:

models/llama-3.1-8b.yaml:

API Usage

Chat Completions (OpenAI Compatible)

Streaming

Embeddings

Image Generation

cURL Examples

Chat

Embeddings

Response:

Text-to-Speech (TTS)

Available voices: alloy, echo, fable, onyx, nova, shimmer

Speech-to-Text (STT)

Response:

Reranking

Rerank documents by relevance to a query:

Response:

Complete API Reference

Standard Endpoints (OpenAI Compatible)

Endpoint
Method
Description

/v1/models

GET

List available models

/v1/chat/completions

POST

Chat completion

/v1/completions

POST

Text completion

/v1/embeddings

POST

Generate embeddings

/v1/audio/speech

POST

Text-to-speech

/v1/audio/transcriptions

POST

Speech-to-text

/v1/images/generations

POST

Image generation

Additional Endpoints

Endpoint
Method
Description

/readyz

GET

Readiness check

/healthz

GET

Health check

/version

GET

Get LocalAI version

/v1/rerank

POST

Document reranking

/models/available

GET

List gallery models

/models/apply

POST

Install model from gallery

/swagger/

GET

Swagger UI documentation

/metrics

GET

Prometheus metrics

Get Version

Response:

Swagger Documentation

Open in browser for interactive API documentation:

GPU Acceleration

CUDA Backend

Full GPU Offload

Multiple Models

LocalAI can serve multiple models simultaneously:

Access each via model name in API calls.

Performance Tuning

For Speed

For Memory

Benchmarks

Model
GPU
Tokens/sec

Llama 3.1 8B Q4

RTX 3090

~100

Mistral 7B Q4

RTX 3090

~110

Llama 3.1 8B Q4

RTX 4090

~140

Mixtral 8x7B Q4

A100

~60

Benchmarks updated January 2026.

Troubleshooting

HTTP 502 on http_pub URL

LocalAI takes longer to start than other services. Wait 5-10 minutes and retry:

Model Not Loading

  • Check file path in YAML

  • Verify GGUF format compatibility

  • Check available VRAM

Slow Responses

  • Increase gpu_layers

  • Enable use_mmap

  • Reduce context_size

Out of Memory

  • Reduce gpu_layers

  • Use smaller quantization (Q4 instead of Q8)

  • Reduce batch size

Image Generation Issues

circle-exclamation

Cost Estimate

Typical CLORE.AI marketplace rates:

GPU
CLORE/day
Approx USD/hr
Good For

RTX 3060

~80

~$0.02

7B models

RTX 3090

~150

~$0.03

13B models

RTX 4090

~200

~$0.04

Fast inference

A100 40GB

~400

~$0.08

Large models

Prices vary by provider. Check CLORE.AI Marketplacearrow-up-right for current rates. Pay with CLORE tokens for best value.

Next Steps

Last updated

Was this helpful?