Ollama

Run LLMs locally with Ollama on Clore.ai GPUs

The easiest way to run LLMs locally on CLORE.AI GPUs.

Current Version: v0.6+ — This guide covers Ollama v0.6 and later. Key new features include structured outputs (JSON schema enforcement), OpenAI-compatible embeddings endpoint (/api/embed), and concurrent model loading (run multiple models simultaneously without swapping). See New in v0.6+ for details.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Server Requirements

Parameter

Minimum

Recommended

RAM

8GB

16GB+

VRAM

6GB

8GB+

Network

100Mbps

500Mbps+

Startup Time

~30 seconds

Ollama is lightweight and works on most GPU servers. For larger models (13B+), choose servers with 16GB+ RAM and 12GB+ VRAM.

Why Ollama?

One-command setup - No Python, no dependencies
Model library - Download models with ollama pull
OpenAI-compatible API - Drop-in replacement
GPU acceleration - Automatic CUDA detection
Multi-model - Run multiple models simultaneously (v0.6+)

Quick Deploy on CLORE.AI

Docker Image:

ollama/ollama

Ports:

22/tcp
11434/http

Command:

ollama serve

Verify It's Working

After deployment, find your http_pub URL in My Orders and test:

# Replace with your actual http_pub URL
curl https://your-http-pub.clorecloud.net/

# Expected response: "Ollama is running"

If you get HTTP 502, wait 30-60 seconds - the service is still starting.

Accessing Your Service

When deployed on CLORE.AI, access your Ollama instance via the http_pub URL:

# Find your http_pub in My Orders, then:
curl https://your-http-pub.clorecloud.net/api/tags

# For API calls, use your http_pub URL:
curl https://your-http-pub.clorecloud.net/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": false
}'

All localhost:11434 examples below work when connected via SSH. For external access, replace with your https://your-http-pub.clorecloud.net/ URL.

Installation

Using Docker (Recommended)

docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Manual Installation

curl -fsSL https://ollama.com/install.sh | sh

This single command installs the latest version of Ollama, sets up the systemd service, and configures GPU detection automatically. Works on Ubuntu, Debian, Fedora, and most modern Linux distributions.

Running Models

Pull and Run

# Pull model
ollama pull llama3.2

# Run interactive chat
ollama run llama3.2

# Run with prompt
ollama run llama3.2 "Explain quantum computing"

Popular Models

Model

Size

Use Case

llama3.2

Fast, general purpose

llama3.1

Better quality

llama3.1:70b

70B

Best quality

mistral

Fast, good quality

mixtral

47B

MoE, high quality

codellama

7-34B

Code generation

deepseek-coder-v2

16B

Best for code

deepseek-r1

7B-671B

Reasoning model

deepseek-r1:32b

32B

Balanced reasoning

qwen2.5

Multilingual

qwen2.5:72b

72B

Best Qwen quality

phi4

14B

Microsoft's latest

gemma2

Google's model

Model Variants

# Quantization variants
ollama pull llama3.1:8b-instruct-q4_K_M   # 4-bit (smaller, faster)
ollama pull llama3.1:8b-instruct-q8_0     # 8-bit (better quality)
ollama pull llama3.1:8b-instruct-fp16     # Full precision

# Size variants
ollama pull llama3.1:8b    # 8 billion parameters
ollama pull llama3.1:70b   # 70 billion parameters

# New models (v0.6+ era)
ollama pull deepseek-r1:7b      # Reasoning, budget
ollama pull deepseek-r1:14b     # Reasoning, efficient
ollama pull deepseek-r1:32b     # Reasoning, balanced
ollama pull deepseek-r1:70b     # Reasoning, high quality
ollama pull qwen2.5:72b         # Largest Qwen, top quality
ollama pull phi4                # Microsoft Phi-4 14B

New in v0.6+

Ollama v0.6 introduced several major features for production workloads:

Structured Outputs (JSON Schema)

Force model responses to match a specific JSON schema. Useful for building applications that need reliable, parseable output:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Tell me about Canada."}],
  "format": {
    "type": "object",
    "properties": {
      "name": {"type": "string"},
      "capital": {"type": "string"},
      "population": {"type": "integer"},
      "languages": {
        "type": "array",
        "items": {"type": "string"}
      }
    },
    "required": ["name", "capital", "population", "languages"]
  },
  "stream": false
}'

Python example with structured outputs:

from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "List 3 programming languages with their main use cases"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "languages",
            "schema": {
                "type": "object",
                "properties": {
                    "languages": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "name": {"type": "string"},
                                "use_case": {"type": "string"},
                                "popularity_rank": {"type": "integer"}
                            }
                        }
                    }
                }
            }
        }
    }
)

data = json.loads(response.choices[0].message.content)
print(data)

OpenAI-Compatible Embeddings Endpoint (`/api/embed`)

New in v0.6+: the /api/embed endpoint is fully OpenAI-compatible and supports batched inputs:

# Single text embedding
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Hello world"
}'

# Batch embeddings (new in v0.6)
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["First document", "Second document", "Third document"]
}'

OpenAI client works directly with /v1/embeddings:

from openai import OpenAI
import numpy as np

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# Pull embedding model first: ollama pull nomic-embed-text
response = client.embeddings.create(
    model="nomic-embed-text",
    input=["Hello world", "Goodbye world"]
)

emb1 = np.array(response.data[0].embedding)
emb2 = np.array(response.data[1].embedding)

# Cosine similarity
similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
print(f"Similarity: {similarity:.4f}")

Popular embedding models:

ollama pull nomic-embed-text      # 137M, fast, good quality
ollama pull mxbai-embed-large     # 335M, higher quality
ollama pull all-minilm            # 23M, fastest

Concurrent Model Loading

Before v0.6, Ollama would unload one model to load another. V0.6+ supports running multiple models simultaneously, limited only by available VRAM:

# Load two models at the same time
ollama run llama3.2 &
ollama run deepseek-r1:7b &

# Check what's running
curl http://localhost:11434/api/ps

Configure concurrency:

# Allow up to 4 models loaded simultaneously
OLLAMA_MAX_LOADED_MODELS=4 ollama serve

# Each runner in a separate process (better isolation)
OLLAMA_NUM_PARALLEL=2 ollama serve

This is especially useful for:

A/B testing different models
Specialized models for different tasks (coding + chat)
Keeping frequently-used models warm in VRAM

API Usage

Chat Completion

# Via http_pub (external access):
curl https://your-http-pub.clorecloud.net/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": false
}'

# Via SSH tunnel (localhost):
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": false
}'

Add "stream": false to get the complete response at once instead of streaming.

OpenAI-Compatible Endpoint

from openai import OpenAI

# For external access, use your http_pub URL:
client = OpenAI(
    base_url="https://your-http-pub.clorecloud.net/v1",
    api_key="ollama"  # any string works
)

# Or via SSH tunnel:
# client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ]
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Embeddings

# Legacy endpoint (still works)
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "Hello world"
}'

# New v0.6+ endpoint (batch support, OpenAI-compatible)
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["Hello world", "Another text"]
}'

Text Generation (Non-Chat)

curl https://your-http-pub.clorecloud.net/api/generate -d '{
  "model": "llama3.2",
  "prompt": "The meaning of life is",
  "stream": false
}'

Complete API Reference

All endpoints work with both http://localhost:11434 (via SSH) and https://your-http-pub.clorecloud.net (external).

Model Management

Endpoint

Method

Description

/api/tags

GET

List all downloaded models

/api/show

POST

Get model details

/api/pull

POST

Download a model

/api/delete

DELETE

Remove a model

/api/ps

GET

List currently running models

/api/version

GET

Get Ollama version

List Models

curl https://your-http-pub.clorecloud.net/api/tags

Response:

{
  "models": [
    {"name": "llama3.2:latest", "size": 2019393189, "digest": "...", "modified_at": "..."}
  ]
}

Show Model Details

curl https://your-http-pub.clorecloud.net/api/show -d '{"name": "llama3.2"}'

Pull Model via API

curl https://your-http-pub.clorecloud.net/api/pull -d '{
  "name": "mistral:7b",
  "stream": false
}'

Response:

{"status": "success"}

Large models may take several minutes to download. For very large models (30GB+), consider using SSH and the CLI: ollama pull model-name

Delete Model

curl -X DELETE https://your-http-pub.clorecloud.net/api/delete -d '{"name": "mistral:7b"}'

List Running Models

curl https://your-http-pub.clorecloud.net/api/ps

Response:

{
  "models": [
    {"name": "llama3.2:latest", "size": 2019393189, "expires_at": "2025-01-25T12:00:00Z"}
  ]
}

Get Version

curl https://your-http-pub.clorecloud.net/api/version

Response:

{"version": "0.6.8"}

Inference Endpoints

Endpoint

Method

Description

/api/generate

POST

Text completion

/api/chat

POST

Chat completion

/api/embeddings

POST

Generate embeddings (legacy)

/api/embed

POST

Generate embeddings v0.6+ (batch, OpenAI-compatible)

/v1/chat/completions

POST

OpenAI-compatible chat

/v1/embeddings

POST

OpenAI-compatible embeddings

Custom Model Creation

Create custom models with specific system prompts via API:

curl https://your-http-pub.clorecloud.net/api/create -d '{
  "name": "my-assistant",
  "modelfile": "FROM llama3.2\nSYSTEM You are a helpful coding assistant."
}'

GPU Configuration

Check GPU Usage

# In container or server
nvidia-smi

# Ollama shows GPU in logs
ollama run llama3.2 --verbose

Multi-GPU

Ollama automatically uses available GPUs. For specific GPU:

CUDA_VISIBLE_DEVICES=0 ollama serve

Memory Management

# Set GPU memory limit
OLLAMA_GPU_MEMORY=8GiB ollama serve

# Keep model loaded
OLLAMA_KEEP_ALIVE=24h ollama serve

# Allow concurrent models (v0.6+)
OLLAMA_MAX_LOADED_MODELS=3 ollama serve

Custom Models (Modelfile)

Create custom models with system prompts:

# Modelfile
FROM llama3.2

SYSTEM You are a helpful coding assistant. Always provide code examples.

PARAMETER temperature 0.7
PARAMETER top_p 0.9

ollama create coding-assistant -f Modelfile
ollama run coding-assistant

Running as Service

Systemd

# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama Service
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/ollama serve
Restart=always
Environment="OLLAMA_HOST=0.0.0.0"

[Install]
WantedBy=multi-user.target

systemctl enable ollama
systemctl start ollama

Performance Tips

Use appropriate quantization
- Q4_K_M for speed
- Q8_0 for quality
- fp16 for maximum quality
Match model to VRAM
- 8GB: 7B models (Q4)
- 16GB: 13B models or 7B (Q8)
- 24GB: 34B models (Q4)
- 48GB+: 70B models
Keep model loaded
```
OLLAMA_KEEP_ALIVE=1h ollama serve
```
Fast SSD improves performance
- Model loading and KV cache benefit from fast storage
- Servers with NVMe SSD can achieve 2-3x better performance

Benchmarks

Generation Speed (tokens/sec)

Model

RTX 3060

RTX 3090

RTX 4090

A100 40GB

Llama 3.2 3B (Q4)

120

160

200

220

Llama 3.1 8B (Q4)

100

130

150

Llama 3.1 8B (Q8)

110

130

Mistral 7B (Q4)

110

140

160

Mixtral 8x7B (Q4)

Llama 3.1 70B (Q4)

DeepSeek-R1 7B (Q4)

105

135

155

DeepSeek-R1 32B (Q4)

Qwen2.5 72B (Q4)

Phi-4 14B (Q4)

Benchmarks updated January 2026. Actual speeds may vary based on server configuration.

Time to First Token (ms)

Model

RTX 3090

RTX 4090

A100

7-8B

120

13B

250

150

100

34B

600

350

200

70B

1200

500

Context Length vs VRAM (Q4)

Model

2K ctx

4K ctx

8K ctx

16K ctx

5GB

6GB

8GB

12GB

13B

8GB

10GB

14GB

22GB

34B

20GB

24GB

32GB

48GB

70B

40GB

48GB

64GB

96GB

GPU Requirements

Model

Q4 VRAM

Q8 VRAM

3GB

5GB

7-8B

5GB

9GB

13B

8GB

15GB

34B

20GB

38GB

70B

40GB

75GB

Cost Estimate

Typical CLORE.AI marketplace rates:

GPU

VRAM

Price/day

Good For

RTX 3060

12GB

$0.15–0.30

7B models

RTX 3090

24GB

$0.30–1.00

13B-34B models

RTX 4090

24GB

$0.50–2.00

34B models, fast

A100

40GB

$1.50–3.00

70B models

Prices in USD/day. Rates vary by provider — check CLORE.AI Marketplace for current rates.

Troubleshooting

Model won't load

# Check available memory
nvidia-smi

# Try smaller quantization
ollama pull llama3.1:8b-q4_0

Slow generation

# Check if GPU is used
ollama run llama3.2 --verbose

# Ensure CUDA is available
nvidia-smi

Connection refused

# Make sure server is running
ollama serve

# Check if binding to all interfaces
OLLAMA_HOST=0.0.0.0 ollama serve

HTTP 502 on http_pub URL

This means the service is still starting. Wait 30-60 seconds and retry:

# Check if service is ready
curl https://your-http-pub.clorecloud.net/

# Expected: "Ollama is running"
# If 502: wait and retry

Next Steps

Open WebUI - Beautiful chat interface for Ollama
vLLM - High-throughput production serving
DeepSeek-R1 - Reasoning model
DeepSeek-V3 - Best general model
Qwen2.5 - Multilingual alternative
Text Generation WebUI - Advanced features

PreviousOverview NextOpen WebUI

Last updated 7 days ago

Was this helpful?

hashtagServer Requirements

hashtagWhy Ollama?

hashtagQuick Deploy on CLORE.AI

hashtagVerify It's Working

hashtagAccessing Your Service

hashtagInstallation

hashtagUsing Docker (Recommended)

hashtagManual Installation

hashtagRunning Models

hashtagPull and Run

hashtagPopular Models

hashtagModel Variants

hashtagNew in v0.6+

hashtagStructured Outputs (JSON Schema)

hashtagOpenAI-Compatible Embeddings Endpoint (/api/embed)

hashtagConcurrent Model Loading

hashtagAPI Usage

hashtagChat Completion

hashtagOpenAI-Compatible Endpoint

hashtagStreaming

hashtagEmbeddings

hashtagText Generation (Non-Chat)

hashtagComplete API Reference

hashtagModel Management

hashtagList Models

hashtagShow Model Details

hashtagPull Model via API

hashtagDelete Model

hashtagList Running Models

hashtagGet Version

hashtagInference Endpoints

hashtagCustom Model Creation

hashtagGPU Configuration

hashtagCheck GPU Usage

hashtagMulti-GPU

hashtagMemory Management

hashtagCustom Models (Modelfile)

hashtagRunning as Service

hashtagSystemd

hashtagPerformance Tips

hashtagBenchmarks

hashtagGeneration Speed (tokens/sec)

hashtagTime to First Token (ms)

hashtagContext Length vs VRAM (Q4)

hashtagGPU Requirements

hashtagCost Estimate

hashtagTroubleshooting

hashtagModel won't load

hashtagSlow generation

hashtagConnection refused

hashtagHTTP 502 on http_pub URL

hashtagNext Steps

Server Requirements

Why Ollama?

Quick Deploy on CLORE.AI

Verify It's Working

Accessing Your Service

Installation

Using Docker (Recommended)

Manual Installation

Running Models

Pull and Run

Popular Models

Model Variants

New in v0.6+

Structured Outputs (JSON Schema)

OpenAI-Compatible Embeddings Endpoint (`/api/embed`)

Concurrent Model Loading

API Usage

Chat Completion

OpenAI-Compatible Endpoint

Streaming

Embeddings

Text Generation (Non-Chat)

Complete API Reference

Model Management

List Models

Show Model Details

Pull Model via API

Delete Model

List Running Models

Get Version

Inference Endpoints

Custom Model Creation

GPU Configuration

Check GPU Usage

Multi-GPU

Memory Management

Custom Models (Modelfile)

Running as Service

Systemd

Performance Tips

Benchmarks

Generation Speed (tokens/sec)

Time to First Token (ms)

Context Length vs VRAM (Q4)

GPU Requirements

Cost Estimate

Troubleshooting

Model won't load

Slow generation

Connection refused

HTTP 502 on http_pub URL

Next Steps