# LocalAI

Run a self-hosted OpenAI-compatible API with LocalAI.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Server Requirements

| Parameter    | Minimum          | Recommended |
| ------------ | ---------------- | ----------- |
| RAM          | 8GB              | 16GB+       |
| VRAM         | 6GB              | 8GB+        |
| Network      | 200Mbps          | 500Mbps+    |
| Startup Time | **5-10 minutes** | -           |

{% hint style="warning" %}
**Important:** LocalAI takes 5-10 minutes to fully initialize on first startup. HTTP 502 during this time is normal - the service is downloading and loading models.
{% endhint %}

{% hint style="info" %}
LocalAI is lightweight. For running LLMs (7B+ models), choose servers with 16GB+ RAM and 8GB+ VRAM.
{% endhint %}

## What is LocalAI?

LocalAI provides:

* Drop-in OpenAI API replacement
* Support for multiple model formats
* Text, image, audio, and embedding generation
* No GPU required (but faster with GPU)

## Supported Models

| Type       | Formats     | Examples            |
| ---------- | ----------- | ------------------- |
| LLM        | GGUF, GGML  | Llama, Mistral, Phi |
| Embeddings | GGUF        | all-MiniLM, BGE     |
| Images     | Diffusers   | SD 1.5, SDXL        |
| Audio      | Whisper     | Speech-to-text      |
| TTS        | Piper, Bark | Text-to-speech      |

## Quick Deploy

**Docker Image:**

```
localai/localai:master-aio-gpu-nvidia-cuda-12
```

**Ports:**

```
22/tcp
8080/http
```

**No command needed** - server starts automatically.

### Verify It's Working

After deployment, find your `http_pub` URL in **My Orders** and test:

```bash
# Check if service is ready
curl https://your-http-pub.clorecloud.net/readyz

# List available models
curl https://your-http-pub.clorecloud.net/v1/models

# Get version
curl https://your-http-pub.clorecloud.net/version
```

{% hint style="warning" %}
If you get HTTP 502, wait 5-10 minutes - LocalAI takes longer to initialize than other services.
{% endhint %}

## Pre-Built Models

LocalAI comes with several models available out of the box:

| Model Name                 | Type       | Description             |
| -------------------------- | ---------- | ----------------------- |
| `gpt-4`                    | Chat       | General-purpose LLM     |
| `gpt-4o`                   | Chat       | General-purpose LLM     |
| `gpt-4o-mini`              | Chat       | Smaller, faster LLM     |
| `whisper-1`                | STT        | Speech-to-text          |
| `tts-1`                    | TTS        | Text-to-speech          |
| `text-embedding-ada-002`   | Embeddings | 384-dimensional vectors |
| `jina-reranker-v1-base-en` | Reranking  | Document reranking      |

{% hint style="info" %}
These models work immediately after startup without additional configuration.
{% endhint %}

## Accessing Your Service

When deployed on CLORE.AI, access LocalAI via the `http_pub` URL:

```bash
# Chat completion
curl https://your-http-pub.clorecloud.net/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "gpt-4",
        "messages": [{"role": "user", "content": "Hello!"}]
    }'
```

{% hint style="info" %}
All `localhost:8080` examples below work when connected via SSH. For external access, replace with your `https://your-http-pub.clorecloud.net/` URL.
{% endhint %}

## Docker Deploy (Alternative)

```bash
docker run -d \
    --gpus all \
    -p 8080:8080 \
    -v /workspace/models:/models \
    -e THREADS=4 \
    -e CONTEXT_SIZE=4096 \
    localai/localai:master-aio-gpu-nvidia-cuda-12
```

## Download Models

### From Model Gallery

LocalAI has a built-in model gallery:

```bash
# List available models
curl http://localhost:8080/models/available

# Install from gallery
curl http://localhost:8080/models/apply -d '{"id": "mistral-7b-instruct"}'
```

### From Hugging Face

```bash
mkdir -p /workspace/models

# Llama 3.1 8B GGUF
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    -O /workspace/models/llama-3.1-8b.gguf

# Mistral 7B GGUF
wget https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf \
    -O /workspace/models/mistral-7b.gguf
```

## Model Configuration

Create YAML config for each model:

**models/llama-3.1-8b.yaml:**

```yaml
name: llama-3.1-8b
backend: llama-cpp
parameters:
  model: llama-3.1-8b.gguf
  context_size: 4096
  threads: 8
  gpu_layers: 35
template:
  chat: |
    {{.Input}}
    ### Response:
  completion: |
    {{.Input}}
```

## API Usage

### Chat Completions (OpenAI Compatible)

```python
import openai

# For external access, use your http_pub URL:
client = openai.OpenAI(
    base_url="https://your-http-pub.clorecloud.net/v1",
    api_key="not-needed"
)

# Or via SSH tunnel:
# client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="mistral-7b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)
```

### Streaming

```python
stream = client.chat.completions.create(
    model="mistral-7b",
    messages=[{"role": "user", "content": "Write a poem about AI"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
```

### Embeddings

```python
response = client.embeddings.create(
    model="all-minilm",
    input="The quick brown fox jumps over the lazy dog"
)

embedding = response.data[0].embedding
print(f"Embedding dimension: {len(embedding)}")
```

### Image Generation

```python
response = client.images.generate(
    model="stablediffusion",
    prompt="a beautiful sunset over mountains",
    size="512x512",
    n=1
)

image_url = response.data[0].url
```

## cURL Examples

### Chat

```bash
curl https://your-http-pub.clorecloud.net/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "mistral-7b",
        "messages": [{"role": "user", "content": "Hello!"}]
    }'
```

### Embeddings

```bash
curl https://your-http-pub.clorecloud.net/v1/embeddings \
    -H "Content-Type: application/json" \
    -d '{
        "model": "text-embedding-ada-002",
        "input": "Your text here"
    }'
```

Response:

```json
{
  "data": [{"embedding": [0.1, -0.2, ...], "index": 0}],
  "model": "text-embedding-ada-002",
  "usage": {"prompt_tokens": 4, "total_tokens": 4}
}
```

### Text-to-Speech (TTS)

```bash
curl https://your-http-pub.clorecloud.net/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "model": "tts-1",
        "input": "Hello, welcome to LocalAI!",
        "voice": "alloy"
    }' \
    --output speech.wav
```

Available voices: `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`

### Speech-to-Text (STT)

```bash
curl https://your-http-pub.clorecloud.net/v1/audio/transcriptions \
    -F "file=@audio.mp3" \
    -F "model=whisper-1"
```

Response:

```json
{"text": "Transcribed text here..."}
```

### Reranking

Rerank documents by relevance to a query:

```bash
curl https://your-http-pub.clorecloud.net/v1/rerank \
    -H "Content-Type: application/json" \
    -d '{
        "model": "jina-reranker-v1-base-en",
        "query": "What is machine learning?",
        "documents": [
            "Machine learning is a subset of AI",
            "The weather is nice today",
            "Deep learning uses neural networks"
        ],
        "top_n": 2
    }'
```

Response:

```json
{
  "results": [
    {"index": 0, "relevance_score": 0.95},
    {"index": 2, "relevance_score": 0.82}
  ]
}
```

## Complete API Reference

### Standard Endpoints (OpenAI Compatible)

| Endpoint                   | Method | Description           |
| -------------------------- | ------ | --------------------- |
| `/v1/models`               | GET    | List available models |
| `/v1/chat/completions`     | POST   | Chat completion       |
| `/v1/completions`          | POST   | Text completion       |
| `/v1/embeddings`           | POST   | Generate embeddings   |
| `/v1/audio/speech`         | POST   | Text-to-speech        |
| `/v1/audio/transcriptions` | POST   | Speech-to-text        |
| `/v1/images/generations`   | POST   | Image generation      |

### Additional Endpoints

| Endpoint            | Method | Description                |
| ------------------- | ------ | -------------------------- |
| `/readyz`           | GET    | Readiness check            |
| `/healthz`          | GET    | Health check               |
| `/version`          | GET    | Get LocalAI version        |
| `/v1/rerank`        | POST   | Document reranking         |
| `/models/available` | GET    | List gallery models        |
| `/models/apply`     | POST   | Install model from gallery |
| `/swagger/`         | GET    | Swagger UI documentation   |
| `/metrics`          | GET    | Prometheus metrics         |

#### Get Version

```bash
curl https://your-http-pub.clorecloud.net/version
```

Response:

```json
{"version": "v2.26.0"}
```

#### Swagger Documentation

Open in browser for interactive API documentation:

```
https://your-http-pub.clorecloud.net/swagger/
```

## GPU Acceleration

### CUDA Backend

```yaml
# In model config
parameters:
  gpu_layers: 35  # Number of layers on GPU
  f16: true       # Use FP16
```

### Full GPU Offload

```yaml
parameters:
  gpu_layers: 99  # All layers on GPU
  main_gpu: 0     # Primary GPU ID
```

## Multiple Models

LocalAI can serve multiple models simultaneously:

```
models/
├── llama-3.1-8b.yaml
├── llama-3.1-8b.gguf
├── mistral-7b.yaml
├── mistral-7b.gguf
├── whisper.yaml
└── whisper-base.bin
```

Access each via model name in API calls.

## Performance Tuning

### For Speed

```yaml
parameters:
  threads: 8
  gpu_layers: 99
  batch_size: 512
  use_mmap: true
  use_mlock: true
```

### For Memory

```yaml
parameters:
  gpu_layers: 20  # Partial offload
  context_size: 2048  # Smaller context
  batch_size: 256
```

## Benchmarks

| Model           | GPU      | Tokens/sec |
| --------------- | -------- | ---------- |
| Llama 3.1 8B Q4 | RTX 3090 | \~100      |
| Mistral 7B Q4   | RTX 3090 | \~110      |
| Llama 3.1 8B Q4 | RTX 4090 | \~140      |
| Mixtral 8x7B Q4 | A100     | \~60       |

*Benchmarks updated January 2026.*

## Troubleshooting

### HTTP 502 on http\_pub URL

LocalAI takes longer to start than other services. Wait **5-10 minutes** and retry:

```bash
# Check readiness
curl https://your-http-pub.clorecloud.net/readyz

# Check health
curl https://your-http-pub.clorecloud.net/healthz
```

### Model Not Loading

* Check file path in YAML
* Verify GGUF format compatibility
* Check available VRAM

### Slow Responses

* Increase `gpu_layers`
* Enable `use_mmap`
* Reduce `context_size`

### Out of Memory

* Reduce `gpu_layers`
* Use smaller quantization (Q4 instead of Q8)
* Reduce batch size

### Image Generation Issues

{% hint style="warning" %}
Stable Diffusion may have CUDA compatibility issues on some GPU configurations. If you encounter CUDA errors with image generation, consider using a dedicated Stable Diffusion image instead.
{% endhint %}

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU      | VRAM | Price/day  | Good For       |
| -------- | ---- | ---------- | -------------- |
| RTX 3060 | 12GB | $0.15–0.30 | 7B models      |
| RTX 3090 | 24GB | $0.30–1.00 | 13B models     |
| RTX 4090 | 24GB | $0.50–2.00 | Fast inference |
| A100     | 40GB | $1.50–3.00 | Large models   |

*Prices in USD/day. Rates vary by provider — check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

## Next Steps

* [vLLM Inference](/guides/language-models/vllm.md) - Higher throughput
* [Ollama Guide](/guides/language-models/ollama.md) - Simpler setup
* [RAG with LangChain](/guides/advanced/api-integration.md) - Build applications


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/localai-openai-compatible.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
