# Ollama

The easiest way to run LLMs locally on CLORE.AI GPUs.

{% hint style="info" %}
**Current Version: v0.6+** — This guide covers Ollama v0.6 and later. Key new features include structured outputs (JSON schema enforcement), OpenAI-compatible embeddings endpoint (`/api/embed`), and concurrent model loading (run multiple models simultaneously without swapping). See [New in v0.6+](#new-in-v06) for details.
{% endhint %}

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Server Requirements

| Parameter    | Minimum      | Recommended |
| ------------ | ------------ | ----------- |
| RAM          | 8GB          | 16GB+       |
| VRAM         | 6GB          | 8GB+        |
| Network      | 100Mbps      | 500Mbps+    |
| Startup Time | \~30 seconds | -           |

{% hint style="info" %}
Ollama is lightweight and works on most GPU servers. For larger models (13B+), choose servers with 16GB+ RAM and 12GB+ VRAM.
{% endhint %}

## Why Ollama?

* **One-command setup** - No Python, no dependencies
* **Model library** - Download models with `ollama pull`
* **OpenAI-compatible API** - Drop-in replacement
* **GPU acceleration** - Automatic CUDA detection
* **Multi-model** - Run multiple models simultaneously (v0.6+)

## Quick Deploy on CLORE.AI

**Docker Image:**

```
ollama/ollama
```

**Ports:**

```
22/tcp
11434/http
```

**Command:**

```bash
ollama serve
```

### Verify It's Working

After deployment, find your `http_pub` URL in **My Orders** and test:

```bash
# Replace with your actual http_pub URL
curl https://your-http-pub.clorecloud.net/

# Expected response: "Ollama is running"
```

{% hint style="warning" %}
If you get HTTP 502, wait 30-60 seconds - the service is still starting.
{% endhint %}

## Accessing Your Service

When deployed on CLORE.AI, access your Ollama instance via the `http_pub` URL:

```bash
# Find your http_pub in My Orders, then:
curl https://your-http-pub.clorecloud.net/api/tags

# For API calls, use your http_pub URL:
curl https://your-http-pub.clorecloud.net/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": false
}'
```

{% hint style="info" %}
All `localhost:11434` examples below work when connected via SSH. For external access, replace with your `https://your-http-pub.clorecloud.net/` URL.
{% endhint %}

## Installation

### Using Docker (Recommended)

```bash
docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
```

### Manual Installation

```bash
curl -fsSL https://ollama.com/install.sh | sh
```

This single command installs the latest version of Ollama, sets up the systemd service, and configures GPU detection automatically. Works on Ubuntu, Debian, Fedora, and most modern Linux distributions.

## Running Models

### Pull and Run

```bash
# Pull model
ollama pull llama3.2

# Run interactive chat
ollama run llama3.2

# Run with prompt
ollama run llama3.2 "Explain quantum computing"
```

### Popular Models

| Model               | Size    | Use Case              |
| ------------------- | ------- | --------------------- |
| `llama3.2`          | 3B      | Fast, general purpose |
| `llama3.1`          | 8B      | Better quality        |
| `llama3.1:70b`      | 70B     | Best quality          |
| `mistral`           | 7B      | Fast, good quality    |
| `mixtral`           | 47B     | MoE, high quality     |
| `codellama`         | 7-34B   | Code generation       |
| `deepseek-coder-v2` | 16B     | Best for code         |
| `deepseek-r1`       | 7B-671B | Reasoning model       |
| `deepseek-r1:32b`   | 32B     | Balanced reasoning    |
| `qwen2.5`           | 7B      | Multilingual          |
| `qwen2.5:72b`       | 72B     | Best Qwen quality     |
| `phi4`              | 14B     | Microsoft's latest    |
| `gemma2`            | 9B      | Google's model        |

### Model Variants

```bash
# Quantization variants
ollama pull llama3.1:8b-instruct-q4_K_M   # 4-bit (smaller, faster)
ollama pull llama3.1:8b-instruct-q8_0     # 8-bit (better quality)
ollama pull llama3.1:8b-instruct-fp16     # Full precision

# Size variants
ollama pull llama3.1:8b    # 8 billion parameters
ollama pull llama3.1:70b   # 70 billion parameters

# New models (v0.6+ era)
ollama pull deepseek-r1:7b      # Reasoning, budget
ollama pull deepseek-r1:14b     # Reasoning, efficient
ollama pull deepseek-r1:32b     # Reasoning, balanced
ollama pull deepseek-r1:70b     # Reasoning, high quality
ollama pull qwen2.5:72b         # Largest Qwen, top quality
ollama pull phi4                # Microsoft Phi-4 14B
```

## New in v0.6+

Ollama v0.6 introduced several major features for production workloads:

### Structured Outputs (JSON Schema)

Force model responses to match a specific JSON schema. Useful for building applications that need reliable, parseable output:

```bash
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Tell me about Canada."}],
  "format": {
    "type": "object",
    "properties": {
      "name": {"type": "string"},
      "capital": {"type": "string"},
      "population": {"type": "integer"},
      "languages": {
        "type": "array",
        "items": {"type": "string"}
      }
    },
    "required": ["name", "capital", "population", "languages"]
  },
  "stream": false
}'
```

Python example with structured outputs:

```python
from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "List 3 programming languages with their main use cases"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "languages",
            "schema": {
                "type": "object",
                "properties": {
                    "languages": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "name": {"type": "string"},
                                "use_case": {"type": "string"},
                                "popularity_rank": {"type": "integer"}
                            }
                        }
                    }
                }
            }
        }
    }
)

data = json.loads(response.choices[0].message.content)
print(data)
```

### OpenAI-Compatible Embeddings Endpoint (`/api/embed`)

New in v0.6+: the `/api/embed` endpoint is fully OpenAI-compatible and supports batched inputs:

```bash
# Single text embedding
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Hello world"
}'

# Batch embeddings (new in v0.6)
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["First document", "Second document", "Third document"]
}'
```

OpenAI client works directly with `/v1/embeddings`:

```python
from openai import OpenAI
import numpy as np

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# Pull embedding model first: ollama pull nomic-embed-text
response = client.embeddings.create(
    model="nomic-embed-text",
    input=["Hello world", "Goodbye world"]
)

emb1 = np.array(response.data[0].embedding)
emb2 = np.array(response.data[1].embedding)

# Cosine similarity
similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
print(f"Similarity: {similarity:.4f}")
```

Popular embedding models:

```bash
ollama pull nomic-embed-text      # 137M, fast, good quality
ollama pull mxbai-embed-large     # 335M, higher quality
ollama pull all-minilm            # 23M, fastest
```

### Concurrent Model Loading

Before v0.6, Ollama would unload one model to load another. V0.6+ supports running multiple models simultaneously, limited only by available VRAM:

```bash
# Load two models at the same time
ollama run llama3.2 &
ollama run deepseek-r1:7b &

# Check what's running
curl http://localhost:11434/api/ps
```

Configure concurrency:

```bash
# Allow up to 4 models loaded simultaneously
OLLAMA_MAX_LOADED_MODELS=4 ollama serve

# Each runner in a separate process (better isolation)
OLLAMA_NUM_PARALLEL=2 ollama serve
```

This is especially useful for:

* A/B testing different models
* Specialized models for different tasks (coding + chat)
* Keeping frequently-used models warm in VRAM

## API Usage

### Chat Completion

```bash
# Via http_pub (external access):
curl https://your-http-pub.clorecloud.net/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": false
}'

# Via SSH tunnel (localhost):
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": false
}'
```

{% hint style="info" %}
Add `"stream": false` to get the complete response at once instead of streaming.
{% endhint %}

### OpenAI-Compatible Endpoint

```python
from openai import OpenAI

# For external access, use your http_pub URL:
client = OpenAI(
    base_url="https://your-http-pub.clorecloud.net/v1",
    api_key="ollama"  # any string works
)

# Or via SSH tunnel:
# client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ]
)

print(response.choices[0].message.content)
```

### Streaming

```python
stream = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
```

### Embeddings

```bash
# Legacy endpoint (still works)
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "Hello world"
}'

# New v0.6+ endpoint (batch support, OpenAI-compatible)
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["Hello world", "Another text"]
}'
```

### Text Generation (Non-Chat)

```bash
curl https://your-http-pub.clorecloud.net/api/generate -d '{
  "model": "llama3.2",
  "prompt": "The meaning of life is",
  "stream": false
}'
```

## Complete API Reference

All endpoints work with both `http://localhost:11434` (via SSH) and `https://your-http-pub.clorecloud.net` (external).

### Model Management

| Endpoint       | Method | Description                   |
| -------------- | ------ | ----------------------------- |
| `/api/tags`    | GET    | List all downloaded models    |
| `/api/show`    | POST   | Get model details             |
| `/api/pull`    | POST   | Download a model              |
| `/api/delete`  | DELETE | Remove a model                |
| `/api/ps`      | GET    | List currently running models |
| `/api/version` | GET    | Get Ollama version            |

#### List Models

```bash
curl https://your-http-pub.clorecloud.net/api/tags
```

Response:

```json
{
  "models": [
    {"name": "llama3.2:latest", "size": 2019393189, "digest": "...", "modified_at": "..."}
  ]
}
```

#### Show Model Details

```bash
curl https://your-http-pub.clorecloud.net/api/show -d '{"name": "llama3.2"}'
```

#### Pull Model via API

```bash
curl https://your-http-pub.clorecloud.net/api/pull -d '{
  "name": "mistral:7b",
  "stream": false
}'
```

Response:

```json
{"status": "success"}
```

{% hint style="warning" %}
Large models may take several minutes to download. For very large models (30GB+), consider using SSH and the CLI: `ollama pull model-name`
{% endhint %}

#### Delete Model

```bash
curl -X DELETE https://your-http-pub.clorecloud.net/api/delete -d '{"name": "mistral:7b"}'
```

#### List Running Models

```bash
curl https://your-http-pub.clorecloud.net/api/ps
```

Response:

```json
{
  "models": [
    {"name": "llama3.2:latest", "size": 2019393189, "expires_at": "2025-01-25T12:00:00Z"}
  ]
}
```

#### Get Version

```bash
curl https://your-http-pub.clorecloud.net/api/version
```

Response:

```json
{"version": "0.6.8"}
```

### Inference Endpoints

| Endpoint               | Method | Description                                          |
| ---------------------- | ------ | ---------------------------------------------------- |
| `/api/generate`        | POST   | Text completion                                      |
| `/api/chat`            | POST   | Chat completion                                      |
| `/api/embeddings`      | POST   | Generate embeddings (legacy)                         |
| `/api/embed`           | POST   | Generate embeddings v0.6+ (batch, OpenAI-compatible) |
| `/v1/chat/completions` | POST   | OpenAI-compatible chat                               |
| `/v1/embeddings`       | POST   | OpenAI-compatible embeddings                         |

### Custom Model Creation

Create custom models with specific system prompts via API:

```bash
curl https://your-http-pub.clorecloud.net/api/create -d '{
  "name": "my-assistant",
  "modelfile": "FROM llama3.2\nSYSTEM You are a helpful coding assistant."
}'
```

## GPU Configuration

### Check GPU Usage

```bash
# In container or server
nvidia-smi

# Ollama shows GPU in logs
ollama run llama3.2 --verbose
```

### Multi-GPU

Ollama automatically uses available GPUs. For specific GPU:

```bash
CUDA_VISIBLE_DEVICES=0 ollama serve
```

### Memory Management

```bash
# Set GPU memory limit
OLLAMA_GPU_MEMORY=8GiB ollama serve

# Keep model loaded
OLLAMA_KEEP_ALIVE=24h ollama serve

# Allow concurrent models (v0.6+)
OLLAMA_MAX_LOADED_MODELS=3 ollama serve
```

## Custom Models (Modelfile)

Create custom models with system prompts:

```dockerfile
# Modelfile
FROM llama3.2

SYSTEM You are a helpful coding assistant. Always provide code examples.

PARAMETER temperature 0.7
PARAMETER top_p 0.9
```

```bash
ollama create coding-assistant -f Modelfile
ollama run coding-assistant
```

## Running as Service

### Systemd

```ini
# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama Service
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/ollama serve
Restart=always
Environment="OLLAMA_HOST=0.0.0.0"

[Install]
WantedBy=multi-user.target
```

```bash
systemctl enable ollama
systemctl start ollama
```

## Performance Tips

1. **Use appropriate quantization**
   * Q4\_K\_M for speed
   * Q8\_0 for quality
   * fp16 for maximum quality
2. **Match model to VRAM**
   * 8GB: 7B models (Q4)
   * 16GB: 13B models or 7B (Q8)
   * 24GB: 34B models (Q4)
   * 48GB+: 70B models
3. **Keep model loaded**

   ```bash
   OLLAMA_KEEP_ALIVE=1h ollama serve
   ```
4. **Fast SSD improves performance**
   * Model loading and KV cache benefit from fast storage
   * Servers with NVMe SSD can achieve 2-3x better performance

## Benchmarks

### Generation Speed (tokens/sec)

| Model                | RTX 3060 | RTX 3090 | RTX 4090 | A100 40GB |
| -------------------- | -------- | -------- | -------- | --------- |
| Llama 3.2 3B (Q4)    | 120      | 160      | 200      | 220       |
| Llama 3.1 8B (Q4)    | 60       | 100      | 130      | 150       |
| Llama 3.1 8B (Q8)    | 45       | 80       | 110      | 130       |
| Mistral 7B (Q4)      | 70       | 110      | 140      | 160       |
| Mixtral 8x7B (Q4)    | -        | 35       | 55       | 75        |
| Llama 3.1 70B (Q4)   | -        | -        | 18       | 35        |
| DeepSeek-R1 7B (Q4)  | 65       | 105      | 135      | 155       |
| DeepSeek-R1 32B (Q4) | -        | -        | 22       | 42        |
| Qwen2.5 72B (Q4)     | -        | -        | 15       | 30        |
| Phi-4 14B (Q4)       | -        | 50       | 75       | 90        |

*Benchmarks updated January 2026. Actual speeds may vary based on server configuration.*

### Time to First Token (ms)

| Model | RTX 3090 | RTX 4090 | A100 |
| ----- | -------- | -------- | ---- |
| 3B    | 50       | 35       | 25   |
| 7-8B  | 120      | 80       | 60   |
| 13B   | 250      | 150      | 100  |
| 34B   | 600      | 350      | 200  |
| 70B   | -        | 1200     | 500  |

### Context Length vs VRAM (Q4)

| Model | 2K ctx | 4K ctx | 8K ctx | 16K ctx |
| ----- | ------ | ------ | ------ | ------- |
| 7B    | 5GB    | 6GB    | 8GB    | 12GB    |
| 13B   | 8GB    | 10GB   | 14GB   | 22GB    |
| 34B   | 20GB   | 24GB   | 32GB   | 48GB    |
| 70B   | 40GB   | 48GB   | 64GB   | 96GB    |

## GPU Requirements

| Model | Q4 VRAM | Q8 VRAM |
| ----- | ------- | ------- |
| 3B    | 3GB     | 5GB     |
| 7-8B  | 5GB     | 9GB     |
| 13B   | 8GB     | 15GB    |
| 34B   | 20GB    | 38GB    |
| 70B   | 40GB    | 75GB    |

## Cost Estimate

Typical CLORE.AI marketplace rates:

| GPU      | VRAM | Price/day  | Good For         |
| -------- | ---- | ---------- | ---------------- |
| RTX 3060 | 12GB | $0.15–0.30 | 7B models        |
| RTX 3090 | 24GB | $0.30–1.00 | 13B-34B models   |
| RTX 4090 | 24GB | $0.50–2.00 | 34B models, fast |
| A100     | 40GB | $1.50–3.00 | 70B models       |

*Prices in USD/day. Rates vary by provider — check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

## Troubleshooting

### Model won't load

```bash
# Check available memory
nvidia-smi

# Try smaller quantization
ollama pull llama3.1:8b-q4_0
```

### Slow generation

```bash
# Check if GPU is used
ollama run llama3.2 --verbose

# Ensure CUDA is available
nvidia-smi
```

### Connection refused

```bash
# Make sure server is running
ollama serve

# Check if binding to all interfaces
OLLAMA_HOST=0.0.0.0 ollama serve
```

### HTTP 502 on http\_pub URL

This means the service is still starting. Wait 30-60 seconds and retry:

```bash
# Check if service is ready
curl https://your-http-pub.clorecloud.net/

# Expected: "Ollama is running"
# If 502: wait and retry
```

## Next Steps

* [Open WebUI](https://docs.clore.ai/guides/language-models/open-webui) - Beautiful chat interface for Ollama
* [vLLM](https://docs.clore.ai/guides/language-models/vllm) - High-throughput production serving
* [DeepSeek-R1](https://docs.clore.ai/guides/language-models/deepseek-r1) - Reasoning model
* [DeepSeek-V3](https://docs.clore.ai/guides/language-models/deepseek-v3) - Best general model
* [Qwen2.5](https://docs.clore.ai/guides/language-models/qwen25) - Multilingual alternative
* [Text Generation WebUI](https://docs.clore.ai/guides/language-models/text-generation-webui) - Advanced features
