# Ollama

The easiest way to run LLMs locally on CLORE.AI GPUs.

{% hint style="info" %}
**Current Version: v0.6+** — This guide covers Ollama v0.6 and later. Key new features include structured outputs (JSON schema enforcement), OpenAI-compatible embeddings endpoint (`/api/embed`), and concurrent model loading (run multiple models simultaneously without swapping). See [New in v0.6+](#new-in-v06) for details.
{% endhint %}

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Server Requirements

| Parameter    | Minimum      | Recommended |
| ------------ | ------------ | ----------- |
| RAM          | 8GB          | 16GB+       |
| VRAM         | 6GB          | 8GB+        |
| Network      | 100Mbps      | 500Mbps+    |
| Startup Time | \~30 seconds | -           |

{% hint style="info" %}
Ollama is lightweight and works on most GPU servers. For larger models (13B+), choose servers with 16GB+ RAM and 12GB+ VRAM.
{% endhint %}

## Why Ollama?

* **One-command setup** - No Python, no dependencies
* **Model library** - Download models with `ollama pull`
* **OpenAI-compatible API** - Drop-in replacement
* **GPU acceleration** - Automatic CUDA detection
* **Multi-model** - Run multiple models simultaneously (v0.6+)

## Quick Deploy on CLORE.AI

**Docker Image:**

```
ollama/ollama
```

**Ports:**

```
22/tcp
11434/http
```

**Command:**

```bash
ollama serve
```

### Verify It's Working

After deployment, find your `http_pub` URL in **My Orders** and test:

```bash
# Replace with your actual http_pub URL
curl https://your-http-pub.clorecloud.net/

# Expected response: "Ollama is running"
```

{% hint style="warning" %}
If you get HTTP 502, wait 30-60 seconds - the service is still starting.
{% endhint %}

## Accessing Your Service

When deployed on CLORE.AI, access your Ollama instance via the `http_pub` URL:

```bash
# Find your http_pub in My Orders, then:
curl https://your-http-pub.clorecloud.net/api/tags

# For API calls, use your http_pub URL:
curl https://your-http-pub.clorecloud.net/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": false
}'
```

{% hint style="info" %}
All `localhost:11434` examples below work when connected via SSH. For external access, replace with your `https://your-http-pub.clorecloud.net/` URL.
{% endhint %}

## Installation

### Using Docker (Recommended)

```bash
docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
```

### Manual Installation

```bash
curl -fsSL https://ollama.com/install.sh | sh
```

This single command installs the latest version of Ollama, sets up the systemd service, and configures GPU detection automatically. Works on Ubuntu, Debian, Fedora, and most modern Linux distributions.

## Running Models

### Pull and Run

```bash
# Pull model
ollama pull llama3.2

# Run interactive chat
ollama run llama3.2

# Run with prompt
ollama run llama3.2 "Explain quantum computing"
```

### Popular Models

| Model               | Size    | Use Case              |
| ------------------- | ------- | --------------------- |
| `llama3.2`          | 3B      | Fast, general purpose |
| `llama3.1`          | 8B      | Better quality        |
| `llama3.1:70b`      | 70B     | Best quality          |
| `mistral`           | 7B      | Fast, good quality    |
| `mixtral`           | 47B     | MoE, high quality     |
| `codellama`         | 7-34B   | Code generation       |
| `deepseek-coder-v2` | 16B     | Best for code         |
| `deepseek-r1`       | 7B-671B | Reasoning model       |
| `deepseek-r1:32b`   | 32B     | Balanced reasoning    |
| `qwen2.5`           | 7B      | Multilingual          |
| `qwen2.5:72b`       | 72B     | Best Qwen quality     |
| `phi4`              | 14B     | Microsoft's latest    |
| `gemma2`            | 9B      | Google's model        |

### Model Variants

```bash
# Quantization variants
ollama pull llama3.1:8b-instruct-q4_K_M   # 4-bit (smaller, faster)
ollama pull llama3.1:8b-instruct-q8_0     # 8-bit (better quality)
ollama pull llama3.1:8b-instruct-fp16     # Full precision

# Size variants
ollama pull llama3.1:8b    # 8 billion parameters
ollama pull llama3.1:70b   # 70 billion parameters

# New models (v0.6+ era)
ollama pull deepseek-r1:7b      # Reasoning, budget
ollama pull deepseek-r1:14b     # Reasoning, efficient
ollama pull deepseek-r1:32b     # Reasoning, balanced
ollama pull deepseek-r1:70b     # Reasoning, high quality
ollama pull qwen2.5:72b         # Largest Qwen, top quality
ollama pull phi4                # Microsoft Phi-4 14B
```

## New in v0.6+

Ollama v0.6 introduced several major features for production workloads:

### Structured Outputs (JSON Schema)

Force model responses to match a specific JSON schema. Useful for building applications that need reliable, parseable output:

```bash
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Tell me about Canada."}],
  "format": {
    "type": "object",
    "properties": {
      "name": {"type": "string"},
      "capital": {"type": "string"},
      "population": {"type": "integer"},
      "languages": {
        "type": "array",
        "items": {"type": "string"}
      }
    },
    "required": ["name", "capital", "population", "languages"]
  },
  "stream": false
}'
```

Python example with structured outputs:

```python
from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "List 3 programming languages with their main use cases"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "languages",
            "schema": {
                "type": "object",
                "properties": {
                    "languages": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "name": {"type": "string"},
                                "use_case": {"type": "string"},
                                "popularity_rank": {"type": "integer"}
                            }
                        }
                    }
                }
            }
        }
    }
)

data = json.loads(response.choices[0].message.content)
print(data)
```

### OpenAI-Compatible Embeddings Endpoint (`/api/embed`)

New in v0.6+: the `/api/embed` endpoint is fully OpenAI-compatible and supports batched inputs:

```bash
# Single text embedding
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Hello world"
}'

# Batch embeddings (new in v0.6)
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["First document", "Second document", "Third document"]
}'
```

OpenAI client works directly with `/v1/embeddings`:

```python
from openai import OpenAI
import numpy as np

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# Pull embedding model first: ollama pull nomic-embed-text
response = client.embeddings.create(
    model="nomic-embed-text",
    input=["Hello world", "Goodbye world"]
)

emb1 = np.array(response.data[0].embedding)
emb2 = np.array(response.data[1].embedding)

# Cosine similarity
similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
print(f"Similarity: {similarity:.4f}")
```

Popular embedding models:

```bash
ollama pull nomic-embed-text      # 137M, fast, good quality
ollama pull mxbai-embed-large     # 335M, higher quality
ollama pull all-minilm            # 23M, fastest
```

### Concurrent Model Loading

Before v0.6, Ollama would unload one model to load another. V0.6+ supports running multiple models simultaneously, limited only by available VRAM:

```bash
# Load two models at the same time
ollama run llama3.2 &
ollama run deepseek-r1:7b &

# Check what's running
curl http://localhost:11434/api/ps
```

Configure concurrency:

```bash
# Allow up to 4 models loaded simultaneously
OLLAMA_MAX_LOADED_MODELS=4 ollama serve

# Each runner in a separate process (better isolation)
OLLAMA_NUM_PARALLEL=2 ollama serve
```

This is especially useful for:

* A/B testing different models
* Specialized models for different tasks (coding + chat)
* Keeping frequently-used models warm in VRAM

## API Usage

### Chat Completion

```bash
# Via http_pub (external access):
curl https://your-http-pub.clorecloud.net/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": false
}'

# Via SSH tunnel (localhost):
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": false
}'
```

{% hint style="info" %}
Add `"stream": false` to get the complete response at once instead of streaming.
{% endhint %}

### OpenAI-Compatible Endpoint

```python
from openai import OpenAI

# For external access, use your http_pub URL:
client = OpenAI(
    base_url="https://your-http-pub.clorecloud.net/v1",
    api_key="ollama"  # any string works
)

# Or via SSH tunnel:
# client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ]
)

print(response.choices[0].message.content)
```

### Streaming

```python
stream = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
```

### Embeddings

```bash
# Legacy endpoint (still works)
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "Hello world"
}'

# New v0.6+ endpoint (batch support, OpenAI-compatible)
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["Hello world", "Another text"]
}'
```

### Text Generation (Non-Chat)

```bash
curl https://your-http-pub.clorecloud.net/api/generate -d '{
  "model": "llama3.2",
  "prompt": "The meaning of life is",
  "stream": false
}'
```

## Complete API Reference

All endpoints work with both `http://localhost:11434` (via SSH) and `https://your-http-pub.clorecloud.net` (external).

### Model Management

| Endpoint       | Method | Description                   |
| -------------- | ------ | ----------------------------- |
| `/api/tags`    | GET    | List all downloaded models    |
| `/api/show`    | POST   | Get model details             |
| `/api/pull`    | POST   | Download a model              |
| `/api/delete`  | DELETE | Remove a model                |
| `/api/ps`      | GET    | List currently running models |
| `/api/version` | GET    | Get Ollama version            |

#### List Models

```bash
curl https://your-http-pub.clorecloud.net/api/tags
```

Response:

```json
{
  "models": [
    {"name": "llama3.2:latest", "size": 2019393189, "digest": "...", "modified_at": "..."}
  ]
}
```

#### Show Model Details

```bash
curl https://your-http-pub.clorecloud.net/api/show -d '{"name": "llama3.2"}'
```

#### Pull Model via API

```bash
curl https://your-http-pub.clorecloud.net/api/pull -d '{
  "name": "mistral:7b",
  "stream": false
}'
```

Response:

```json
{"status": "success"}
```

{% hint style="warning" %}
Large models may take several minutes to download. For very large models (30GB+), consider using SSH and the CLI: `ollama pull model-name`
{% endhint %}

#### Delete Model

```bash
curl -X DELETE https://your-http-pub.clorecloud.net/api/delete -d '{"name": "mistral:7b"}'
```

#### List Running Models

```bash
curl https://your-http-pub.clorecloud.net/api/ps
```

Response:

```json
{
  "models": [
    {"name": "llama3.2:latest", "size": 2019393189, "expires_at": "2025-01-25T12:00:00Z"}
  ]
}
```

#### Get Version

```bash
curl https://your-http-pub.clorecloud.net/api/version
```

Response:

```json
{"version": "0.6.8"}
```

### Inference Endpoints

| Endpoint               | Method | Description                                          |
| ---------------------- | ------ | ---------------------------------------------------- |
| `/api/generate`        | POST   | Text completion                                      |
| `/api/chat`            | POST   | Chat completion                                      |
| `/api/embeddings`      | POST   | Generate embeddings (legacy)                         |
| `/api/embed`           | POST   | Generate embeddings v0.6+ (batch, OpenAI-compatible) |
| `/v1/chat/completions` | POST   | OpenAI-compatible chat                               |
| `/v1/embeddings`       | POST   | OpenAI-compatible embeddings                         |

### Custom Model Creation

Create custom models with specific system prompts via API:

```bash
curl https://your-http-pub.clorecloud.net/api/create -d '{
  "name": "my-assistant",
  "modelfile": "FROM llama3.2\nSYSTEM You are a helpful coding assistant."
}'
```

## GPU Configuration

### Check GPU Usage

```bash
# In container or server
nvidia-smi

# Ollama shows GPU in logs
ollama run llama3.2 --verbose
```

### Multi-GPU

Ollama automatically uses available GPUs. For specific GPU:

```bash
CUDA_VISIBLE_DEVICES=0 ollama serve
```

### Memory Management

```bash
# Set GPU memory limit
OLLAMA_GPU_MEMORY=8GiB ollama serve

# Keep model loaded
OLLAMA_KEEP_ALIVE=24h ollama serve

# Allow concurrent models (v0.6+)
OLLAMA_MAX_LOADED_MODELS=3 ollama serve
```

## Custom Models (Modelfile)

Create custom models with system prompts:

```dockerfile
# Modelfile
FROM llama3.2

SYSTEM You are a helpful coding assistant. Always provide code examples.

PARAMETER temperature 0.7
PARAMETER top_p 0.9
```

```bash
ollama create coding-assistant -f Modelfile
ollama run coding-assistant
```

## Running as Service

### Systemd

```ini
# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama Service
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/ollama serve
Restart=always
Environment="OLLAMA_HOST=0.0.0.0"

[Install]
WantedBy=multi-user.target
```

```bash
systemctl enable ollama
systemctl start ollama
```

## Performance Tips

1. **Use appropriate quantization**
   * Q4\_K\_M for speed
   * Q8\_0 for quality
   * fp16 for maximum quality
2. **Match model to VRAM**
   * 8GB: 7B models (Q4)
   * 16GB: 13B models or 7B (Q8)
   * 24GB: 34B models (Q4)
   * 48GB+: 70B models
3. **Keep model loaded**

   ```bash
   OLLAMA_KEEP_ALIVE=1h ollama serve
   ```
4. **Fast SSD improves performance**
   * Model loading and KV cache benefit from fast storage
   * Servers with NVMe SSD can achieve 2-3x better performance

## Benchmarks

### Generation Speed (tokens/sec)

| Model                | RTX 3060 | RTX 3090 | RTX 4090 | A100 40GB |
| -------------------- | -------- | -------- | -------- | --------- |
| Llama 3.2 3B (Q4)    | 120      | 160      | 200      | 220       |
| Llama 3.1 8B (Q4)    | 60       | 100      | 130      | 150       |
| Llama 3.1 8B (Q8)    | 45       | 80       | 110      | 130       |
| Mistral 7B (Q4)      | 70       | 110      | 140      | 160       |
| Mixtral 8x7B (Q4)    | -        | 35       | 55       | 75        |
| Llama 3.1 70B (Q4)   | -        | -        | 18       | 35        |
| DeepSeek-R1 7B (Q4)  | 65       | 105      | 135      | 155       |
| DeepSeek-R1 32B (Q4) | -        | -        | 22       | 42        |
| Qwen2.5 72B (Q4)     | -        | -        | 15       | 30        |
| Phi-4 14B (Q4)       | -        | 50       | 75       | 90        |

*Benchmarks updated January 2026. Actual speeds may vary based on server configuration.*

### Time to First Token (ms)

| Model | RTX 3090 | RTX 4090 | A100 |
| ----- | -------- | -------- | ---- |
| 3B    | 50       | 35       | 25   |
| 7-8B  | 120      | 80       | 60   |
| 13B   | 250      | 150      | 100  |
| 34B   | 600      | 350      | 200  |
| 70B   | -        | 1200     | 500  |

### Context Length vs VRAM (Q4)

| Model | 2K ctx | 4K ctx | 8K ctx | 16K ctx |
| ----- | ------ | ------ | ------ | ------- |
| 7B    | 5GB    | 6GB    | 8GB    | 12GB    |
| 13B   | 8GB    | 10GB   | 14GB   | 22GB    |
| 34B   | 20GB   | 24GB   | 32GB   | 48GB    |
| 70B   | 40GB   | 48GB   | 64GB   | 96GB    |

## GPU Requirements

| Model | Q4 VRAM | Q8 VRAM |
| ----- | ------- | ------- |
| 3B    | 3GB     | 5GB     |
| 7-8B  | 5GB     | 9GB     |
| 13B   | 8GB     | 15GB    |
| 34B   | 20GB    | 38GB    |
| 70B   | 40GB    | 75GB    |

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU      | VRAM | Price/day  | Good For         |
| -------- | ---- | ---------- | ---------------- |
| RTX 3060 | 12GB | $0.15–0.30 | 7B models        |
| RTX 3090 | 24GB | $0.30–1.00 | 13B-34B models   |
| RTX 4090 | 24GB | $0.50–2.00 | 34B models, fast |
| A100     | 40GB | $1.50–3.00 | 70B models       |

*Prices in USD/day. Rates vary by provider — check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

## Troubleshooting

### Model won't load

```bash
# Check available memory
nvidia-smi

# Try smaller quantization
ollama pull llama3.1:8b-q4_0
```

### Slow generation

```bash
# Check if GPU is used
ollama run llama3.2 --verbose

# Ensure CUDA is available
nvidia-smi
```

### Connection refused

```bash
# Make sure server is running
ollama serve

# Check if binding to all interfaces
OLLAMA_HOST=0.0.0.0 ollama serve
```

### HTTP 502 on http\_pub URL

This means the service is still starting. Wait 30-60 seconds and retry:

```bash
# Check if service is ready
curl https://your-http-pub.clorecloud.net/

# Expected: "Ollama is running"
# If 502: wait and retry
```

## Next Steps

* [Open WebUI](/guides/language-models/open-webui.md) - Beautiful chat interface for Ollama
* [vLLM](/guides/language-models/vllm.md) - High-throughput production serving
* [DeepSeek-R1](/guides/language-models/deepseek-r1.md) - Reasoning model
* [DeepSeek-V3](/guides/language-models/deepseek-v3.md) - Best general model
* [Qwen2.5](/guides/language-models/qwen25.md) - Multilingual alternative
* [Text Generation WebUI](/guides/language-models/text-generation-webui.md) - Advanced features


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/ollama.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
