GPT4All Local LLM

Deploy GPT4All on Clore.ai — run privacy-first local LLMs with an OpenAI-compatible API server using Docker, supporting GGUF models with optional CUDA acceleration for maximum performance.

Overview

GPT4All by Nomic AI is one of the most popular open-source local LLM projects, with over 72,000 GitHub stars. It lets you run large language models completely offline on your own hardware — no internet connection required, no data sent to third parties.

GPT4All is best known for its polished desktop application, but it also includes a Python library (gpt4all package) and a built-in OpenAI-compatible API server running on port 4891. On Clore.ai, you can deploy GPT4All in a Docker container on a rented GPU, serve it over HTTP, and connect any OpenAI-compatible client to it.

Docker note: GPT4All does not publish an official Docker image for the server component. This guide uses a custom Docker setup with the gpt4all Python package. For a more production-ready Docker alternative that runs the same GGUF model files, see the LocalAI alternative section — LocalAI is Docker-first and supports the identical model format.

Key features:

🔒 100% offline — all inference runs locally
🤖 OpenAI-compatible REST API (port 4891)
📚 LocalDocs — RAG over your own documents
🧩 Supports all popular GGUF model formats
🐍 Full Python API with pip install gpt4all
💬 Beautiful desktop UI (not relevant for server, but good for local testing)

Requirements

Hardware Requirements

Tier

GPU

VRAM

RAM

Storage

Clore.ai Price

CPU-only

None

—

16 GB

50 GB SSD

~$0.02/hr (CPU server)

Entry GPU

RTX 3060 12GB

12 GB

16 GB

50 GB SSD

~$0.10/hr

Recommended

RTX 3090

24 GB

32 GB

100 GB SSD

~$0.20/hr

High-end

RTX 4090

24 GB

64 GB

200 GB SSD

~$0.35/hr

Note: GPT4All GPU support uses CUDA via llama.cpp under the hood. Unlike vLLM, it does not require a specific CUDA compute capability — RTX 10xx and newer generally work.

Model VRAM Requirements (GGUF Q4_K_M)

Model

Size on Disk

VRAM

Min GPU

Phi-3 Mini 3.8B

~2.4 GB

~3 GB

RTX 3060

Mistral 7B Instruct

~4.1 GB

~5 GB

RTX 3060

Llama 3.1 8B Instruct

~4.7 GB

~6 GB

RTX 3060

Llama 3 70B Instruct

~40 GB

~45 GB

A100 80GB

Mixtral 8x7B

~26 GB

~30 GB

2× RTX 3090

Quick Start

Step 1 — Rent a GPU Server on Clore.ai

Log in to clore.ai
Filter: Docker enabled, GPU: RTX 3090 (for 7B–13B models)
Deploy with image: nvidia/cuda:12.1.0-runtime-ubuntu22.04
Open ports: 4891 (GPT4All API), 22 (SSH)
Allocate at least 50 GB of disk space

Step 2 — Connect via SSH

ssh -p <CLORE_SSH_PORT> root@<CLORE_SERVER_IP>

# Verify GPU
nvidia-smi
# Should list your GPU with driver version

Step 3 — Build the GPT4All Docker Image

Since there's no official GPT4All Docker image, we'll build one:

mkdir -p /workspace/gpt4all-server && cd /workspace/gpt4all-server

cat > Dockerfile << 'EOF'
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

# Install Python and system dependencies
RUN apt-get update && apt-get install -y \
    python3.11 \
    python3.11-dev \
    python3-pip \
    curl \
    wget \
    git \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

# Make python3.11 the default
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 \
    && update-alternatives --install /usr/bin/python python python3.11 1

# Install GPT4All with CUDA support
RUN pip install --upgrade pip && \
    pip install gpt4all>=2.8.0 fastapi uvicorn aiofiles pydantic

# Create directories
RUN mkdir -p /models /workspace /app

WORKDIR /app

# Copy server script (will be mounted or baked in)
COPY server.py .

EXPOSE 4891

CMD ["python", "server.py"]
EOF

Step 4 — Create the API Server Script

cat > /workspace/gpt4all-server/server.py << 'PYEOF'
#!/usr/bin/env python3
"""
GPT4All OpenAI-compatible API Server
Runs on port 4891 (GPT4All default)
"""

import os
import time
import json
import asyncio
from typing import Optional, List, Dict, Any
from pathlib import Path

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import uvicorn
from gpt4all import GPT4All

# Configuration
MODEL_NAME = os.environ.get("MODEL_NAME", "Mistral 7B Instruct v0.1 Q4_0")
MODEL_PATH = os.environ.get("MODEL_PATH", "/models")
API_HOST = os.environ.get("API_HOST", "0.0.0.0")
API_PORT = int(os.environ.get("API_PORT", "4891"))
DEVICE = os.environ.get("DEVICE", "gpu")  # 'gpu', 'cpu', 'metal'
N_CTX = int(os.environ.get("N_CTX", "4096"))

app = FastAPI(title="GPT4All API Server", version="1.0.0")

# Global model instance
model = None

def load_model():
    global model
    print(f"Loading model: {MODEL_NAME}")
    print(f"Model path: {MODEL_PATH}")
    print(f"Device: {DEVICE}")
    model = GPT4All(
        model_name=MODEL_NAME,
        model_path=MODEL_PATH,
        device=DEVICE,
        n_ctx=N_CTX,
        allow_download=True,  # Downloads from GPT4All hub if not present
        verbose=True
    )
    print("Model loaded successfully!")

# --- Pydantic models ---

class Message(BaseModel):
    role: str
    content: str

class ChatCompletionRequest(BaseModel):
    model: str
    messages: List[Message]
    temperature: float = 0.7
    max_tokens: int = 512
    top_p: float = 0.95
    top_k: int = 40
    stream: bool = False

class CompletionRequest(BaseModel):
    model: str
    prompt: str
    temperature: float = 0.7
    max_tokens: int = 512
    stream: bool = False

# --- API Routes ---

@app.get("/health")
async def health():
    return {"status": "ok", "model": MODEL_NAME, "device": DEVICE}

@app.get("/v1/models")
async def list_models():
    return {
        "object": "list",
        "data": [{
            "id": MODEL_NAME,
            "object": "model",
            "created": int(time.time()),
            "owned_by": "gpt4all",
        }]
    }

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded")

    # Format messages into a single prompt
    prompt_parts = []
    for msg in request.messages:
        if msg.role == "system":
            prompt_parts.append(f"### System:\n{msg.content}")
        elif msg.role == "user":
            prompt_parts.append(f"### Human:\n{msg.content}")
        elif msg.role == "assistant":
            prompt_parts.append(f"### Assistant:\n{msg.content}")
    prompt_parts.append("### Assistant:")
    full_prompt = "\n\n".join(prompt_parts)

    with model.chat_session():
        response_text = model.generate(
            full_prompt,
            max_tokens=request.max_tokens,
            temp=request.temperature,
            top_p=request.top_p,
            top_k=request.top_k,
        )

    return {
        "id": f"chatcmpl-{int(time.time())}",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": request.model,
        "choices": [{
            "index": 0,
            "message": {"role": "assistant", "content": response_text},
            "finish_reason": "stop"
        }],
        "usage": {
            "prompt_tokens": len(full_prompt.split()),
            "completion_tokens": len(response_text.split()),
            "total_tokens": len(full_prompt.split()) + len(response_text.split())
        }
    }

@app.post("/v1/completions")
async def completions(request: CompletionRequest):
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded")

    response_text = model.generate(
        request.prompt,
        max_tokens=request.max_tokens,
        temp=request.temperature,
    )

    return {
        "id": f"cmpl-{int(time.time())}",
        "object": "text_completion",
        "created": int(time.time()),
        "model": request.model,
        "choices": [{
            "text": response_text,
            "index": 0,
            "finish_reason": "stop"
        }]
    }

if __name__ == "__main__":
    load_model()
    uvicorn.run(app, host=API_HOST, port=API_PORT, log_level="info")
PYEOF

Step 5 — Build and Run

cd /workspace/gpt4all-server

# Build the Docker image
docker build -t gpt4all-server:latest .

# Download a model first (optional — server can also auto-download)
mkdir -p /workspace/models
wget -O /workspace/models/mistral-7b-instruct-v0.1.Q4_0.gguf \
  https://gpt4all.io/models/gguf/mistral-7b-instruct-v0.1.Q4_0.gguf

# Run with GPU support
docker run -d \
  --name gpt4all-server \
  --gpus all \
  --restart unless-stopped \
  -p 4891:4891 \
  -v /workspace/models:/models \
  -v /workspace/gpt4all-server/server.py:/app/server.py \
  -e MODEL_NAME="mistral-7b-instruct-v0.1.Q4_0.gguf" \
  -e MODEL_PATH="/models" \
  -e DEVICE="gpu" \
  -e N_CTX="4096" \
  gpt4all-server:latest

# Follow logs
docker logs -f gpt4all-server

Step 6 — Test the API

# Health check
curl http://localhost:4891/health

# List models
curl http://localhost:4891/v1/models

# Chat completion
curl http://localhost:4891/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b-instruct-v0.1.Q4_0.gguf",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

Alternative: LocalAI Docker Image

For a more robust, production-ready Docker deployment that runs the same GGUF models as GPT4All, LocalAI is the recommended choice. It has an official Docker image, CUDA support, and is actively maintained:

# Pull LocalAI with CUDA support
docker pull localai/localai:latest-aio-gpu-nvidia-cuda-12

# Create models directory and download a GGUF model
mkdir -p /workspace/localai-models
wget -O /workspace/localai-models/mistral-7b.gguf \
  https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf

# Create model config
cat > /workspace/localai-models/mistral-7b.yaml << 'EOF'
name: mistral-7b
parameters:
  model: mistral-7b.gguf
  temperature: 0.7
  top_p: 0.95
  top_k: 40
  max_tokens: 2048
context_size: 4096
f16: true
gpu_layers: 35
threads: 8
EOF

# Run LocalAI
docker run -d \
  --name localai \
  --gpus all \
  --restart unless-stopped \
  -p 8080:8080 \
  -v /workspace/localai-models:/build/models \
  -e DEBUG=true \
  localai/localai:latest-aio-gpu-nvidia-cuda-12

# Test LocalAI (same OpenAI-compatible API)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Configuration

Environment Variables for GPT4All Server

Variable

Default

Description

MODEL_NAME

mistral-7b-instruct...

Model filename or GPT4All hub name

MODEL_PATH

/models

Directory containing model files

DEVICE

gpu

gpu, cpu, or metal (macOS)

N_CTX

4096

Context window size (tokens)

API_HOST

0.0.0.0

Bind address

API_PORT

4891

Port for the API server

Docker Compose Setup

# /workspace/gpt4all-server/docker-compose.yml
version: '3.8'

services:
  gpt4all-server:
    build: .
    container_name: gpt4all-server
    restart: unless-stopped
    ports:
      - "4891:4891"
    volumes:
      - /workspace/models:/models
      - ./server.py:/app/server.py
    environment:
      - MODEL_NAME=mistral-7b-instruct-v0.1.Q4_0.gguf
      - MODEL_PATH=/models
      - DEVICE=gpu
      - N_CTX=4096
      - API_PORT=4891
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:4891/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s

docker compose up -d
docker compose logs -f

GPU Acceleration

Verifying GPU Usage

GPT4All Python library uses llama.cpp under the hood with CUDA support:

# Check GPU VRAM usage after model load
watch -n 2 nvidia-smi

# Check inside container that CUDA is available
docker exec gpt4all-server python3 -c "
from gpt4all import GPT4All
devices = GPT4All.list_gpus()
print('Available GPUs:', devices)
"

Selecting GPU Layers

The gpu_layers (or n_gpu_layers) parameter controls how much of the model runs on GPU vs CPU:

# In server.py — force all layers to GPU
model = GPT4All(
    model_name=MODEL_NAME,
    model_path=MODEL_PATH,
    device="gpu",
    n_ctx=N_CTX,
    # Additional llama.cpp parameters passed through:
    # n_gpu_layers=99  # All layers on GPU
)

# Rebuild and restart with max GPU layers
docker stop gpt4all-server && docker rm gpt4all-server
docker run -d \
  --name gpt4all-server \
  --gpus all \
  -p 4891:4891 \
  -v /workspace/models:/models \
  -e DEVICE=gpu \
  -e MODEL_NAME=mistral-7b-instruct-v0.1.Q4_0.gguf \
  gpt4all-server:latest

CPU Fallback Mode

If no GPU is available (e.g., CPU-only Clore.ai server for testing):

docker run -d \
  --name gpt4all-server-cpu \
  -p 4891:4891 \
  -v /workspace/models:/models \
  -e DEVICE=cpu \
  -e MODEL_NAME=Phi-3-mini-4k-instruct.Q4_0.gguf \
  gpt4all-server:latest

⚠️ CPU inference is 10–50× slower than GPU. For CPU-only servers, use small models (Phi-3 Mini, TinyLlama) and expect 2–5 tokens/sec.

Tips & Best Practices

📥 Pre-downloading Models

Instead of relying on auto-download at startup, pre-download models for faster restarts:

# Download popular GPT4All models
mkdir -p /workspace/models

# Mistral 7B (most popular, good quality)
wget -q -O /workspace/models/mistral-7b-instruct-v0.1.Q4_0.gguf \
  "https://gpt4all.io/models/gguf/mistral-7b-instruct-v0.1.Q4_0.gguf"

# Phi-3 Mini (fastest, smallest)
wget -q -O /workspace/models/Phi-3-mini-4k-instruct.Q4_0.gguf \
  "https://gpt4all.io/models/gguf/Phi-3-mini-4k-instruct.Q4_0.gguf"

# Llama 3 (best quality in 8B range)
wget -q -O /workspace/models/Meta-Llama-3-8B-Instruct.Q4_0.gguf \
  "https://gpt4all.io/models/gguf/Meta-Llama-3-8B-Instruct.Q4_0.gguf"

ls -lh /workspace/models/

🔌 Using with Python Applications

# Direct Python usage (without Docker API)
from gpt4all import GPT4All

model = GPT4All(
    model_name="mistral-7b-instruct-v0.1.Q4_0.gguf",
    model_path="/workspace/models",
    device="gpu"
)

# Simple generation
with model.chat_session():
    response = model.generate("Explain GPU computing in simple terms", max_tokens=200)
    print(response)

# Using the API server with OpenAI client
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:4891/v1",
    api_key="not-needed"
)

completion = client.chat.completions.create(
    model="mistral-7b-instruct-v0.1.Q4_0.gguf",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(completion.choices[0].message.content)

💰 Cost Optimization on Clore.ai

# RTX 3090 @ $0.20/hr — use for 7B models (best value)
# Expected throughput: ~40 tokens/sec for Mistral 7B Q4
# Cost per 1M tokens generated: ~$0.005 (extremely cheap vs OpenAI)

# RTX 4090 @ $0.35/hr — use for 13B models or when speed matters
# Expected throughput: ~60 tokens/sec for Mistral 7B Q4

# For batch processing: pre-load model, process all prompts, shut down
docker run --rm \
  --gpus all \
  -v /workspace/models:/models \
  -v /workspace/prompts:/prompts \
  gpt4all-server:latest \
  python3 -c "
from gpt4all import GPT4All
import json

model = GPT4All('mistral-7b-instruct-v0.1.Q4_0.gguf', '/models', device='gpu')
prompts = open('/prompts/batch.txt').readlines()
results = []
for p in prompts:
    with model.chat_session():
        results.append(model.generate(p.strip(), max_tokens=256))
json.dump(results, open('/prompts/results.json', 'w'))
print(f'Processed {len(results)} prompts')
"

Troubleshooting

Model fails to load — file not found

# Check model file exists and has correct name
ls -lh /workspace/models/
docker exec gpt4all-server ls /models/

# GPT4All is case-sensitive with model names
# Use exact filename from ls output as MODEL_NAME
docker stop gpt4all-server && docker rm gpt4all-server
docker run -d --gpus all -p 4891:4891 \
  -v /workspace/models:/models \
  -e MODEL_NAME=mistral-7b-instruct-v0.1.Q4_0.gguf \
  gpt4all-server:latest

CUDA error: no kernel image for this architecture

# Your GPU might not be compatible with the CUDA version
# Check GPU compute capability
nvidia-smi --query-gpu=compute_cap --format=csv,noheader

# If < 6.0, use CPU mode
docker run -d --gpus all -p 4891:4891 \
  -v /workspace/models:/models \
  -e DEVICE=cpu \
  -e MODEL_NAME=Phi-3-mini-4k-instruct.Q4_0.gguf \
  gpt4all-server:latest

API returns 503 — model not loaded

# Check startup logs
docker logs gpt4all-server | head -50

# Model loading can take 30–120 seconds
# Wait and retry:
sleep 60 && curl http://localhost:4891/health

# Check if model file is corrupted
python3 -c "
from gpt4all import GPT4All
m = GPT4All('mistral-7b-instruct-v0.1.Q4_0.gguf', '/workspace/models')
print('Model OK:', m)
"

Port 4891 not accessible from outside

# Verify port binding
docker ps | grep 4891
# Should show: 0.0.0.0:4891->4891/tcp

# Check if Clore.ai has firewall rules
# In Clore.ai server settings, ensure port 4891 is listed as open

# Test internally:
curl http://127.0.0.1:4891/health

# Note: Clore.ai maps ports randomly — use the port shown in your server dashboard

hashtagOverview

hashtagRequirements

hashtagHardware Requirements

hashtagModel VRAM Requirements (GGUF Q4_K_M)

hashtagQuick Start

hashtagStep 1 — Rent a GPU Server on Clore.ai

hashtagStep 2 — Connect via SSH

hashtagStep 3 — Build the GPT4All Docker Image

hashtagStep 4 — Create the API Server Script

hashtagStep 5 — Build and Run

hashtagStep 6 — Test the API

hashtagAlternative: LocalAI Docker Image

hashtagConfiguration

hashtagEnvironment Variables for GPT4All Server

hashtagDocker Compose Setup

hashtagGPU Acceleration

hashtagVerifying GPU Usage

hashtagSelecting GPU Layers

hashtagCPU Fallback Mode

hashtagTips & Best Practices

hashtag📥 Pre-downloading Models

hashtag🔌 Using with Python Applications

hashtag💰 Cost Optimization on Clore.ai

hashtagTroubleshooting

hashtagModel fails to load — file not found

hashtagCUDA error: no kernel image for this architecture

hashtagAPI returns 503 — model not loaded

hashtagPort 4891 not accessible from outside

hashtagFurther Reading

Overview

Requirements

Hardware Requirements

Model VRAM Requirements (GGUF Q4_K_M)

Quick Start

Step 1 — Rent a GPU Server on Clore.ai

Step 2 — Connect via SSH

Step 3 — Build the GPT4All Docker Image

Step 4 — Create the API Server Script

Step 5 — Build and Run

Step 6 — Test the API

Alternative: LocalAI Docker Image

Configuration

Environment Variables for GPT4All Server

Docker Compose Setup

GPU Acceleration

Verifying GPU Usage

Selecting GPU Layers

CPU Fallback Mode

Tips & Best Practices

📥 Pre-downloading Models

🔌 Using with Python Applications

💰 Cost Optimization on Clore.ai

Troubleshooting

Model fails to load — file not found

CUDA error: no kernel image for this architecture

API returns 503 — model not loaded

Port 4891 not accessible from outside

Further Reading