# GPT4All Local LLM

## Overview

[GPT4All](https://github.com/nomic-ai/gpt4all) by Nomic AI is one of the most popular open-source local LLM projects, with over **72,000 GitHub stars**. It lets you run large language models completely offline on your own hardware — no internet connection required, no data sent to third parties.

GPT4All is best known for its polished desktop application, but it also includes a **Python library** (`gpt4all` package) and a built-in **OpenAI-compatible API server** running on port **4891**. On Clore.ai, you can deploy GPT4All in a Docker container on a rented GPU, serve it over HTTP, and connect any OpenAI-compatible client to it.

> **Docker note:** GPT4All does not publish an official Docker image for the server component. This guide uses a custom Docker setup with the `gpt4all` Python package. For a more production-ready Docker alternative that runs the **same GGUF model files**, see the [LocalAI alternative section](#alternative-localai-docker-image) — LocalAI is Docker-first and supports the identical model format.

**Key features:**

* 🔒 100% offline — all inference runs locally
* 🤖 OpenAI-compatible REST API (port 4891)
* 📚 LocalDocs — RAG over your own documents
* 🧩 Supports all popular GGUF model formats
* 🐍 Full Python API with `pip install gpt4all`
* 💬 Beautiful desktop UI (not relevant for server, but good for local testing)

***

## Requirements

### Hardware Requirements

| Tier            | GPU           | VRAM  | RAM   | Storage    | Clore.ai Price          |
| --------------- | ------------- | ----- | ----- | ---------- | ----------------------- |
| **CPU-only**    | None          | —     | 16 GB | 50 GB SSD  | \~$0.02/hr (CPU server) |
| **Entry GPU**   | RTX 3060 12GB | 12 GB | 16 GB | 50 GB SSD  | \~$0.10/hr              |
| **Recommended** | RTX 3090      | 24 GB | 32 GB | 100 GB SSD | \~$0.20/hr              |
| **High-end**    | RTX 4090      | 24 GB | 64 GB | 200 GB SSD | \~$0.35/hr              |

> **Note:** GPT4All GPU support uses CUDA via llama.cpp under the hood. Unlike vLLM, it does **not** require a specific CUDA compute capability — RTX 10xx and newer generally work.

### Model VRAM Requirements (GGUF Q4\_K\_M)

| Model                 | Size on Disk | VRAM    | Min GPU     |
| --------------------- | ------------ | ------- | ----------- |
| Phi-3 Mini 3.8B       | \~2.4 GB     | \~3 GB  | RTX 3060    |
| Mistral 7B Instruct   | \~4.1 GB     | \~5 GB  | RTX 3060    |
| Llama 3.1 8B Instruct | \~4.7 GB     | \~6 GB  | RTX 3060    |
| Llama 3 70B Instruct  | \~40 GB      | \~45 GB | A100 80GB   |
| Mixtral 8x7B          | \~26 GB      | \~30 GB | 2× RTX 3090 |

***

## Quick Start

### Step 1 — Rent a GPU Server on Clore.ai

1. Log in to [clore.ai](https://clore.ai)
2. Filter: **Docker enabled**, **GPU**: RTX 3090 (for 7B–13B models)
3. Deploy with image: `nvidia/cuda:12.1.0-runtime-ubuntu22.04`
4. Open ports: **4891** (GPT4All API), **22** (SSH)
5. Allocate at least **50 GB** of disk space

### Step 2 — Connect via SSH

```bash
ssh -p <CLORE_SSH_PORT> root@<CLORE_SERVER_IP>

# Verify GPU
nvidia-smi
# Should list your GPU with driver version
```

### Step 3 — Build the GPT4All Docker Image

Since there's no official GPT4All Docker image, we'll build one:

```bash
mkdir -p /workspace/gpt4all-server && cd /workspace/gpt4all-server

cat > Dockerfile << 'EOF'
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

# Install Python and system dependencies
RUN apt-get update && apt-get install -y \
    python3.11 \
    python3.11-dev \
    python3-pip \
    curl \
    wget \
    git \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

# Make python3.11 the default
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 \
    && update-alternatives --install /usr/bin/python python python3.11 1

# Install GPT4All with CUDA support
RUN pip install --upgrade pip && \
    pip install gpt4all>=2.8.0 fastapi uvicorn aiofiles pydantic

# Create directories
RUN mkdir -p /models /workspace /app

WORKDIR /app

# Copy server script (will be mounted or baked in)
COPY server.py .

EXPOSE 4891

CMD ["python", "server.py"]
EOF
```

### Step 4 — Create the API Server Script

```bash
cat > /workspace/gpt4all-server/server.py << 'PYEOF'
#!/usr/bin/env python3
"""
GPT4All OpenAI-compatible API Server
Runs on port 4891 (GPT4All default)
"""

import os
import time
import json
import asyncio
from typing import Optional, List, Dict, Any
from pathlib import Path

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import uvicorn
from gpt4all import GPT4All

# Configuration
MODEL_NAME = os.environ.get("MODEL_NAME", "Mistral 7B Instruct v0.1 Q4_0")
MODEL_PATH = os.environ.get("MODEL_PATH", "/models")
API_HOST = os.environ.get("API_HOST", "0.0.0.0")
API_PORT = int(os.environ.get("API_PORT", "4891"))
DEVICE = os.environ.get("DEVICE", "gpu")  # 'gpu', 'cpu', 'metal'
N_CTX = int(os.environ.get("N_CTX", "4096"))

app = FastAPI(title="GPT4All API Server", version="1.0.0")

# Global model instance
model = None

def load_model():
    global model
    print(f"Loading model: {MODEL_NAME}")
    print(f"Model path: {MODEL_PATH}")
    print(f"Device: {DEVICE}")
    model = GPT4All(
        model_name=MODEL_NAME,
        model_path=MODEL_PATH,
        device=DEVICE,
        n_ctx=N_CTX,
        allow_download=True,  # Downloads from GPT4All hub if not present
        verbose=True
    )
    print("Model loaded successfully!")

# --- Pydantic models ---

class Message(BaseModel):
    role: str
    content: str

class ChatCompletionRequest(BaseModel):
    model: str
    messages: List[Message]
    temperature: float = 0.7
    max_tokens: int = 512
    top_p: float = 0.95
    top_k: int = 40
    stream: bool = False

class CompletionRequest(BaseModel):
    model: str
    prompt: str
    temperature: float = 0.7
    max_tokens: int = 512
    stream: bool = False

# --- API Routes ---

@app.get("/health")
async def health():
    return {"status": "ok", "model": MODEL_NAME, "device": DEVICE}

@app.get("/v1/models")
async def list_models():
    return {
        "object": "list",
        "data": [{
            "id": MODEL_NAME,
            "object": "model",
            "created": int(time.time()),
            "owned_by": "gpt4all",
        }]
    }

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded")

    # Format messages into a single prompt
    prompt_parts = []
    for msg in request.messages:
        if msg.role == "system":
            prompt_parts.append(f"### System:\n{msg.content}")
        elif msg.role == "user":
            prompt_parts.append(f"### Human:\n{msg.content}")
        elif msg.role == "assistant":
            prompt_parts.append(f"### Assistant:\n{msg.content}")
    prompt_parts.append("### Assistant:")
    full_prompt = "\n\n".join(prompt_parts)

    with model.chat_session():
        response_text = model.generate(
            full_prompt,
            max_tokens=request.max_tokens,
            temp=request.temperature,
            top_p=request.top_p,
            top_k=request.top_k,
        )

    return {
        "id": f"chatcmpl-{int(time.time())}",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": request.model,
        "choices": [{
            "index": 0,
            "message": {"role": "assistant", "content": response_text},
            "finish_reason": "stop"
        }],
        "usage": {
            "prompt_tokens": len(full_prompt.split()),
            "completion_tokens": len(response_text.split()),
            "total_tokens": len(full_prompt.split()) + len(response_text.split())
        }
    }

@app.post("/v1/completions")
async def completions(request: CompletionRequest):
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded")

    response_text = model.generate(
        request.prompt,
        max_tokens=request.max_tokens,
        temp=request.temperature,
    )

    return {
        "id": f"cmpl-{int(time.time())}",
        "object": "text_completion",
        "created": int(time.time()),
        "model": request.model,
        "choices": [{
            "text": response_text,
            "index": 0,
            "finish_reason": "stop"
        }]
    }

if __name__ == "__main__":
    load_model()
    uvicorn.run(app, host=API_HOST, port=API_PORT, log_level="info")
PYEOF
```

### Step 5 — Build and Run

```bash
cd /workspace/gpt4all-server

# Build the Docker image
docker build -t gpt4all-server:latest .

# Download a model first (optional — server can also auto-download)
mkdir -p /workspace/models
wget -O /workspace/models/mistral-7b-instruct-v0.1.Q4_0.gguf \
  https://gpt4all.io/models/gguf/mistral-7b-instruct-v0.1.Q4_0.gguf

# Run with GPU support
docker run -d \
  --name gpt4all-server \
  --gpus all \
  --restart unless-stopped \
  -p 4891:4891 \
  -v /workspace/models:/models \
  -v /workspace/gpt4all-server/server.py:/app/server.py \
  -e MODEL_NAME="mistral-7b-instruct-v0.1.Q4_0.gguf" \
  -e MODEL_PATH="/models" \
  -e DEVICE="gpu" \
  -e N_CTX="4096" \
  gpt4all-server:latest

# Follow logs
docker logs -f gpt4all-server
```

### Step 6 — Test the API

```bash
# Health check
curl http://localhost:4891/health

# List models
curl http://localhost:4891/v1/models

# Chat completion
curl http://localhost:4891/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b-instruct-v0.1.Q4_0.gguf",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'
```

***

## Alternative: LocalAI Docker Image

For a more robust, production-ready Docker deployment that runs the **same GGUF models** as GPT4All, LocalAI is the recommended choice. It has an official Docker image, CUDA support, and is actively maintained:

```bash
# Pull LocalAI with CUDA support
docker pull localai/localai:latest-aio-gpu-nvidia-cuda-12

# Create models directory and download a GGUF model
mkdir -p /workspace/localai-models
wget -O /workspace/localai-models/mistral-7b.gguf \
  https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf

# Create model config
cat > /workspace/localai-models/mistral-7b.yaml << 'EOF'
name: mistral-7b
parameters:
  model: mistral-7b.gguf
  temperature: 0.7
  top_p: 0.95
  top_k: 40
  max_tokens: 2048
context_size: 4096
f16: true
gpu_layers: 35
threads: 8
EOF

# Run LocalAI
docker run -d \
  --name localai \
  --gpus all \
  --restart unless-stopped \
  -p 8080:8080 \
  -v /workspace/localai-models:/build/models \
  -e DEBUG=true \
  localai/localai:latest-aio-gpu-nvidia-cuda-12

# Test LocalAI (same OpenAI-compatible API)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
```

***

## Configuration

### Environment Variables for GPT4All Server

| Variable     | Default                  | Description                        |
| ------------ | ------------------------ | ---------------------------------- |
| `MODEL_NAME` | `mistral-7b-instruct...` | Model filename or GPT4All hub name |
| `MODEL_PATH` | `/models`                | Directory containing model files   |
| `DEVICE`     | `gpu`                    | `gpu`, `cpu`, or `metal` (macOS)   |
| `N_CTX`      | `4096`                   | Context window size (tokens)       |
| `API_HOST`   | `0.0.0.0`                | Bind address                       |
| `API_PORT`   | `4891`                   | Port for the API server            |

### Docker Compose Setup

```yaml
# /workspace/gpt4all-server/docker-compose.yml
version: '3.8'

services:
  gpt4all-server:
    build: .
    container_name: gpt4all-server
    restart: unless-stopped
    ports:
      - "4891:4891"
    volumes:
      - /workspace/models:/models
      - ./server.py:/app/server.py
    environment:
      - MODEL_NAME=mistral-7b-instruct-v0.1.Q4_0.gguf
      - MODEL_PATH=/models
      - DEVICE=gpu
      - N_CTX=4096
      - API_PORT=4891
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:4891/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s
```

```bash
docker compose up -d
docker compose logs -f
```

***

## GPU Acceleration

### Verifying GPU Usage

GPT4All Python library uses `llama.cpp` under the hood with CUDA support:

```bash
# Check GPU VRAM usage after model load
watch -n 2 nvidia-smi

# Check inside container that CUDA is available
docker exec gpt4all-server python3 -c "
from gpt4all import GPT4All
devices = GPT4All.list_gpus()
print('Available GPUs:', devices)
"
```

### Selecting GPU Layers

The `gpu_layers` (or `n_gpu_layers`) parameter controls how much of the model runs on GPU vs CPU:

```python
# In server.py — force all layers to GPU
model = GPT4All(
    model_name=MODEL_NAME,
    model_path=MODEL_PATH,
    device="gpu",
    n_ctx=N_CTX,
    # Additional llama.cpp parameters passed through:
    # n_gpu_layers=99  # All layers on GPU
)
```

```bash
# Rebuild and restart with max GPU layers
docker stop gpt4all-server && docker rm gpt4all-server
docker run -d \
  --name gpt4all-server \
  --gpus all \
  -p 4891:4891 \
  -v /workspace/models:/models \
  -e DEVICE=gpu \
  -e MODEL_NAME=mistral-7b-instruct-v0.1.Q4_0.gguf \
  gpt4all-server:latest
```

### CPU Fallback Mode

If no GPU is available (e.g., CPU-only Clore.ai server for testing):

```bash
docker run -d \
  --name gpt4all-server-cpu \
  -p 4891:4891 \
  -v /workspace/models:/models \
  -e DEVICE=cpu \
  -e MODEL_NAME=Phi-3-mini-4k-instruct.Q4_0.gguf \
  gpt4all-server:latest
```

> ⚠️ CPU inference is **10–50× slower** than GPU. For CPU-only servers, use small models (Phi-3 Mini, TinyLlama) and expect 2–5 tokens/sec.

***

## Tips & Best Practices

### 📥 Pre-downloading Models

Instead of relying on auto-download at startup, pre-download models for faster restarts:

```bash
# Download popular GPT4All models
mkdir -p /workspace/models

# Mistral 7B (most popular, good quality)
wget -q -O /workspace/models/mistral-7b-instruct-v0.1.Q4_0.gguf \
  "https://gpt4all.io/models/gguf/mistral-7b-instruct-v0.1.Q4_0.gguf"

# Phi-3 Mini (fastest, smallest)
wget -q -O /workspace/models/Phi-3-mini-4k-instruct.Q4_0.gguf \
  "https://gpt4all.io/models/gguf/Phi-3-mini-4k-instruct.Q4_0.gguf"

# Llama 3 (best quality in 8B range)
wget -q -O /workspace/models/Meta-Llama-3-8B-Instruct.Q4_0.gguf \
  "https://gpt4all.io/models/gguf/Meta-Llama-3-8B-Instruct.Q4_0.gguf"

ls -lh /workspace/models/
```

### 🔌 Using with Python Applications

```python
# Direct Python usage (without Docker API)
from gpt4all import GPT4All

model = GPT4All(
    model_name="mistral-7b-instruct-v0.1.Q4_0.gguf",
    model_path="/workspace/models",
    device="gpu"
)

# Simple generation
with model.chat_session():
    response = model.generate("Explain GPU computing in simple terms", max_tokens=200)
    print(response)

# Using the API server with OpenAI client
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:4891/v1",
    api_key="not-needed"
)

completion = client.chat.completions.create(
    model="mistral-7b-instruct-v0.1.Q4_0.gguf",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(completion.choices[0].message.content)
```

### 💰 Cost Optimization on Clore.ai

```bash
# RTX 3090 @ $0.20/hr — use for 7B models (best value)
# Expected throughput: ~40 tokens/sec for Mistral 7B Q4
# Cost per 1M tokens generated: ~$0.005 (extremely cheap vs OpenAI)

# RTX 4090 @ $0.35/hr — use for 13B models or when speed matters
# Expected throughput: ~60 tokens/sec for Mistral 7B Q4

# For batch processing: pre-load model, process all prompts, shut down
docker run --rm \
  --gpus all \
  -v /workspace/models:/models \
  -v /workspace/prompts:/prompts \
  gpt4all-server:latest \
  python3 -c "
from gpt4all import GPT4All
import json

model = GPT4All('mistral-7b-instruct-v0.1.Q4_0.gguf', '/models', device='gpu')
prompts = open('/prompts/batch.txt').readlines()
results = []
for p in prompts:
    with model.chat_session():
        results.append(model.generate(p.strip(), max_tokens=256))
json.dump(results, open('/prompts/results.json', 'w'))
print(f'Processed {len(results)} prompts')
"
```

***

## Troubleshooting

### Model fails to load — file not found

```bash
# Check model file exists and has correct name
ls -lh /workspace/models/
docker exec gpt4all-server ls /models/

# GPT4All is case-sensitive with model names
# Use exact filename from ls output as MODEL_NAME
docker stop gpt4all-server && docker rm gpt4all-server
docker run -d --gpus all -p 4891:4891 \
  -v /workspace/models:/models \
  -e MODEL_NAME=mistral-7b-instruct-v0.1.Q4_0.gguf \
  gpt4all-server:latest
```

### CUDA error: no kernel image for this architecture

```bash
# Your GPU might not be compatible with the CUDA version
# Check GPU compute capability
nvidia-smi --query-gpu=compute_cap --format=csv,noheader

# If < 6.0, use CPU mode
docker run -d --gpus all -p 4891:4891 \
  -v /workspace/models:/models \
  -e DEVICE=cpu \
  -e MODEL_NAME=Phi-3-mini-4k-instruct.Q4_0.gguf \
  gpt4all-server:latest
```

### API returns 503 — model not loaded

```bash
# Check startup logs
docker logs gpt4all-server | head -50

# Model loading can take 30–120 seconds
# Wait and retry:
sleep 60 && curl http://localhost:4891/health

# Check if model file is corrupted
python3 -c "
from gpt4all import GPT4All
m = GPT4All('mistral-7b-instruct-v0.1.Q4_0.gguf', '/workspace/models')
print('Model OK:', m)
"
```

### Port 4891 not accessible from outside

```bash
# Verify port binding
docker ps | grep 4891
# Should show: 0.0.0.0:4891->4891/tcp

# Check if Clore.ai has firewall rules
# In Clore.ai server settings, ensure port 4891 is listed as open

# Test internally:
curl http://127.0.0.1:4891/health

# Note: Clore.ai maps ports randomly — use the port shown in your server dashboard
```

***

## Further Reading

* [GPT4All GitHub](https://github.com/nomic-ai/gpt4all) — Main repository
* [GPT4All Python Docs](https://docs.gpt4all.io/) — Python API reference
* [GPT4All Model Explorer](https://gpt4all.io/models/gguf/) — Browse available models
* [LocalAI Documentation](https://localai.io/) — Docker-friendly alternative
* [Ollama on Clore.ai](https://docs.clore.ai/guides/language-models/ollama) — Easier Docker LLM deployment
* [vLLM on Clore.ai](https://docs.clore.ai/guides/language-models/vllm) — Production inference server
* [GPU Comparison Guide](https://docs.clore.ai/guides/getting-started/gpu-comparison) — Pick the right Clore.ai GPU
* [TheBloke on HuggingFace](https://huggingface.co/TheBloke) — Thousands of GGUF quantizations
* [GGUF Format Explained](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) — Model format docs

> 💡 **Recommendation:** If you want the simplest Docker deployment for local LLMs, consider [Ollama](https://docs.clore.ai/guides/language-models/ollama) instead — it has an official Docker image, built-in GPU support, and is specifically designed for server-side deployment. GPT4All's strength is its beautiful desktop UI and LocalDocs (RAG) features, which aren't available in server mode.
