# LocalAI

Run a self-hosted OpenAI-compatible API with LocalAI.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Server Requirements

| Parameter    | Minimum          | Recommended |
| ------------ | ---------------- | ----------- |
| RAM          | 8GB              | 16GB+       |
| VRAM         | 6GB              | 8GB+        |
| Network      | 200Mbps          | 500Mbps+    |
| Startup Time | **5-10 minutes** | -           |

{% hint style="warning" %}
**Important:** LocalAI takes 5-10 minutes to fully initialize on first startup. HTTP 502 during this time is normal - the service is downloading and loading models.
{% endhint %}

{% hint style="info" %}
LocalAI is lightweight. For running LLMs (7B+ models), choose servers with 16GB+ RAM and 8GB+ VRAM.
{% endhint %}

## What is LocalAI?

LocalAI provides:

* Drop-in OpenAI API replacement
* Support for multiple model formats
* Text, image, audio, and embedding generation
* No GPU required (but faster with GPU)

## Supported Models

| Type       | Formats     | Examples            |
| ---------- | ----------- | ------------------- |
| LLM        | GGUF, GGML  | Llama, Mistral, Phi |
| Embeddings | GGUF        | all-MiniLM, BGE     |
| Images     | Diffusers   | SD 1.5, SDXL        |
| Audio      | Whisper     | Speech-to-text      |
| TTS        | Piper, Bark | Text-to-speech      |

## Quick Deploy

**Docker Image:**

```
localai/localai:master-aio-gpu-nvidia-cuda-12
```

**Ports:**

```
22/tcp
8080/http
```

**No command needed** - server starts automatically.

### Verify It's Working

After deployment, find your `http_pub` URL in **My Orders** and test:

```bash
# Check if service is ready
curl https://your-http-pub.clorecloud.net/readyz

# List available models
curl https://your-http-pub.clorecloud.net/v1/models

# Get version
curl https://your-http-pub.clorecloud.net/version
```

{% hint style="warning" %}
If you get HTTP 502, wait 5-10 minutes - LocalAI takes longer to initialize than other services.
{% endhint %}

## Pre-Built Models

LocalAI comes with several models available out of the box:

| Model Name                 | Type       | Description             |
| -------------------------- | ---------- | ----------------------- |
| `gpt-4`                    | Chat       | General-purpose LLM     |
| `gpt-4o`                   | Chat       | General-purpose LLM     |
| `gpt-4o-mini`              | Chat       | Smaller, faster LLM     |
| `whisper-1`                | STT        | Speech-to-text          |
| `tts-1`                    | TTS        | Text-to-speech          |
| `text-embedding-ada-002`   | Embeddings | 384-dimensional vectors |
| `jina-reranker-v1-base-en` | Reranking  | Document reranking      |

{% hint style="info" %}
These models work immediately after startup without additional configuration.
{% endhint %}

## Accessing Your Service

When deployed on CLORE.AI, access LocalAI via the `http_pub` URL:

```bash
# Chat completion
curl https://your-http-pub.clorecloud.net/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "gpt-4",
        "messages": [{"role": "user", "content": "Hello!"}]
    }'
```

{% hint style="info" %}
All `localhost:8080` examples below work when connected via SSH. For external access, replace with your `https://your-http-pub.clorecloud.net/` URL.
{% endhint %}

## Docker Deploy (Alternative)

```bash
docker run -d \
    --gpus all \
    -p 8080:8080 \
    -v /workspace/models:/models \
    -e THREADS=4 \
    -e CONTEXT_SIZE=4096 \
    localai/localai:master-aio-gpu-nvidia-cuda-12
```

## Download Models

### From Model Gallery

LocalAI has a built-in model gallery:

```bash
# List available models
curl http://localhost:8080/models/available

# Install from gallery
curl http://localhost:8080/models/apply -d '{"id": "mistral-7b-instruct"}'
```

### From Hugging Face

```bash
mkdir -p /workspace/models

# Llama 3.1 8B GGUF
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    -O /workspace/models/llama-3.1-8b.gguf

# Mistral 7B GGUF
wget https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf \
    -O /workspace/models/mistral-7b.gguf
```

## Model Configuration

Create YAML config for each model:

**models/llama-3.1-8b.yaml:**

```yaml
name: llama-3.1-8b
backend: llama-cpp
parameters:
  model: llama-3.1-8b.gguf
  context_size: 4096
  threads: 8
  gpu_layers: 35
template:
  chat: |
    {{.Input}}
    ### Response:
  completion: |
    {{.Input}}
```

## API Usage

### Chat Completions (OpenAI Compatible)

```python
import openai

# For external access, use your http_pub URL:
client = openai.OpenAI(
    base_url="https://your-http-pub.clorecloud.net/v1",
    api_key="not-needed"
)

# Or via SSH tunnel:
# client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="mistral-7b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)
```

### Streaming

```python
stream = client.chat.completions.create(
    model="mistral-7b",
    messages=[{"role": "user", "content": "Write a poem about AI"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
```

### Embeddings

```python
response = client.embeddings.create(
    model="all-minilm",
    input="The quick brown fox jumps over the lazy dog"
)

embedding = response.data[0].embedding
print(f"Embedding dimension: {len(embedding)}")
```

### Image Generation

```python
response = client.images.generate(
    model="stablediffusion",
    prompt="a beautiful sunset over mountains",
    size="512x512",
    n=1
)

image_url = response.data[0].url
```

## cURL Examples

### Chat

```bash
curl https://your-http-pub.clorecloud.net/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "mistral-7b",
        "messages": [{"role": "user", "content": "Hello!"}]
    }'
```

### Embeddings

```bash
curl https://your-http-pub.clorecloud.net/v1/embeddings \
    -H "Content-Type: application/json" \
    -d '{
        "model": "text-embedding-ada-002",
        "input": "Your text here"
    }'
```

Response:

```json
{
  "data": [{"embedding": [0.1, -0.2, ...], "index": 0}],
  "model": "text-embedding-ada-002",
  "usage": {"prompt_tokens": 4, "total_tokens": 4}
}
```

### Text-to-Speech (TTS)

```bash
curl https://your-http-pub.clorecloud.net/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "model": "tts-1",
        "input": "Hello, welcome to LocalAI!",
        "voice": "alloy"
    }' \
    --output speech.wav
```

Available voices: `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`

### Speech-to-Text (STT)

```bash
curl https://your-http-pub.clorecloud.net/v1/audio/transcriptions \
    -F "file=@audio.mp3" \
    -F "model=whisper-1"
```

Response:

```json
{"text": "Transcribed text here..."}
```

### Reranking

Rerank documents by relevance to a query:

```bash
curl https://your-http-pub.clorecloud.net/v1/rerank \
    -H "Content-Type: application/json" \
    -d '{
        "model": "jina-reranker-v1-base-en",
        "query": "What is machine learning?",
        "documents": [
            "Machine learning is a subset of AI",
            "The weather is nice today",
            "Deep learning uses neural networks"
        ],
        "top_n": 2
    }'
```

Response:

```json
{
  "results": [
    {"index": 0, "relevance_score": 0.95},
    {"index": 2, "relevance_score": 0.82}
  ]
}
```

## Complete API Reference

### Standard Endpoints (OpenAI Compatible)

| Endpoint                   | Method | Description           |
| -------------------------- | ------ | --------------------- |
| `/v1/models`               | GET    | List available models |
| `/v1/chat/completions`     | POST   | Chat completion       |
| `/v1/completions`          | POST   | Text completion       |
| `/v1/embeddings`           | POST   | Generate embeddings   |
| `/v1/audio/speech`         | POST   | Text-to-speech        |
| `/v1/audio/transcriptions` | POST   | Speech-to-text        |
| `/v1/images/generations`   | POST   | Image generation      |

### Additional Endpoints

| Endpoint            | Method | Description                |
| ------------------- | ------ | -------------------------- |
| `/readyz`           | GET    | Readiness check            |
| `/healthz`          | GET    | Health check               |
| `/version`          | GET    | Get LocalAI version        |
| `/v1/rerank`        | POST   | Document reranking         |
| `/models/available` | GET    | List gallery models        |
| `/models/apply`     | POST   | Install model from gallery |
| `/swagger/`         | GET    | Swagger UI documentation   |
| `/metrics`          | GET    | Prometheus metrics         |

#### Get Version

```bash
curl https://your-http-pub.clorecloud.net/version
```

Response:

```json
{"version": "v2.26.0"}
```

#### Swagger Documentation

Open in browser for interactive API documentation:

```
https://your-http-pub.clorecloud.net/swagger/
```

## GPU Acceleration

### CUDA Backend

```yaml
# In model config
parameters:
  gpu_layers: 35  # Number of layers on GPU
  f16: true       # Use FP16
```

### Full GPU Offload

```yaml
parameters:
  gpu_layers: 99  # All layers on GPU
  main_gpu: 0     # Primary GPU ID
```

## Multiple Models

LocalAI can serve multiple models simultaneously:

```
models/
├── llama-3.1-8b.yaml
├── llama-3.1-8b.gguf
├── mistral-7b.yaml
├── mistral-7b.gguf
├── whisper.yaml
└── whisper-base.bin
```

Access each via model name in API calls.

## Performance Tuning

### For Speed

```yaml
parameters:
  threads: 8
  gpu_layers: 99
  batch_size: 512
  use_mmap: true
  use_mlock: true
```

### For Memory

```yaml
parameters:
  gpu_layers: 20  # Partial offload
  context_size: 2048  # Smaller context
  batch_size: 256
```

## Benchmarks

| Model           | GPU      | Tokens/sec |
| --------------- | -------- | ---------- |
| Llama 3.1 8B Q4 | RTX 3090 | \~100      |
| Mistral 7B Q4   | RTX 3090 | \~110      |
| Llama 3.1 8B Q4 | RTX 4090 | \~140      |
| Mixtral 8x7B Q4 | A100     | \~60       |

*Benchmarks updated January 2026.*

## Troubleshooting

### HTTP 502 on http\_pub URL

LocalAI takes longer to start than other services. Wait **5-10 minutes** and retry:

```bash
# Check readiness
curl https://your-http-pub.clorecloud.net/readyz

# Check health
curl https://your-http-pub.clorecloud.net/healthz
```

### Model Not Loading

* Check file path in YAML
* Verify GGUF format compatibility
* Check available VRAM

### Slow Responses

* Increase `gpu_layers`
* Enable `use_mmap`
* Reduce `context_size`

### Out of Memory

* Reduce `gpu_layers`
* Use smaller quantization (Q4 instead of Q8)
* Reduce batch size

### Image Generation Issues

{% hint style="warning" %}
Stable Diffusion may have CUDA compatibility issues on some GPU configurations. If you encounter CUDA errors with image generation, consider using a dedicated Stable Diffusion image instead.
{% endhint %}

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU      | VRAM | Price/day  | Good For       |
| -------- | ---- | ---------- | -------------- |
| RTX 3060 | 12GB | $0.15–0.30 | 7B models      |
| RTX 3090 | 24GB | $0.30–1.00 | 13B models     |
| RTX 4090 | 24GB | $0.50–2.00 | Fast inference |
| A100     | 40GB | $1.50–3.00 | Large models   |

*Prices in USD/day. Rates vary by provider — check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

## Next Steps

* [vLLM Inference](https://docs.clore.ai/guides/language-models/vllm) - Higher throughput
* [Ollama Guide](https://docs.clore.ai/guides/language-models/ollama) - Simpler setup
* [RAG with LangChain](https://docs.clore.ai/guides/advanced/api-integration) - Build applications
