Ollama

Run LLMs locally with Ollama on Clore.ai GPUs

The easiest way to run LLMs locally on CLORE.AI GPUs.

circle-info

Current Version: v0.6+ — This guide covers Ollama v0.6 and later. Key new features include structured outputs (JSON schema enforcement), OpenAI-compatible embeddings endpoint (/api/embed), and concurrent model loading (run multiple models simultaneously without swapping). See New in v0.6+ for details.

circle-check

Server Requirements

Parameter
Minimum
Recommended

RAM

8GB

16GB+

VRAM

6GB

8GB+

Network

100Mbps

500Mbps+

Startup Time

~30 seconds

-

circle-info

Ollama is lightweight and works on most GPU servers. For larger models (13B+), choose servers with 16GB+ RAM and 12GB+ VRAM.

Why Ollama?

  • One-command setup - No Python, no dependencies

  • Model library - Download models with ollama pull

  • OpenAI-compatible API - Drop-in replacement

  • GPU acceleration - Automatic CUDA detection

  • Multi-model - Run multiple models simultaneously (v0.6+)

Quick Deploy on CLORE.AI

Docker Image:

Ports:

Command:

Verify It's Working

After deployment, find your http_pub URL in My Orders and test:

circle-exclamation

Accessing Your Service

When deployed on CLORE.AI, access your Ollama instance via the http_pub URL:

circle-info

All localhost:11434 examples below work when connected via SSH. For external access, replace with your https://your-http-pub.clorecloud.net/ URL.

Installation

Manual Installation

This single command installs the latest version of Ollama, sets up the systemd service, and configures GPU detection automatically. Works on Ubuntu, Debian, Fedora, and most modern Linux distributions.

Running Models

Pull and Run

Model
Size
Use Case

llama3.2

3B

Fast, general purpose

llama3.1

8B

Better quality

llama3.1:70b

70B

Best quality

mistral

7B

Fast, good quality

mixtral

47B

MoE, high quality

codellama

7-34B

Code generation

deepseek-coder-v2

16B

Best for code

deepseek-r1

7B-671B

Reasoning model

deepseek-r1:32b

32B

Balanced reasoning

qwen2.5

7B

Multilingual

qwen2.5:72b

72B

Best Qwen quality

phi4

14B

Microsoft's latest

gemma2

9B

Google's model

Model Variants

New in v0.6+

Ollama v0.6 introduced several major features for production workloads:

Structured Outputs (JSON Schema)

Force model responses to match a specific JSON schema. Useful for building applications that need reliable, parseable output:

Python example with structured outputs:

OpenAI-Compatible Embeddings Endpoint (/api/embed)

New in v0.6+: the /api/embed endpoint is fully OpenAI-compatible and supports batched inputs:

OpenAI client works directly with /v1/embeddings:

Popular embedding models:

Concurrent Model Loading

Before v0.6, Ollama would unload one model to load another. V0.6+ supports running multiple models simultaneously, limited only by available VRAM:

Configure concurrency:

This is especially useful for:

  • A/B testing different models

  • Specialized models for different tasks (coding + chat)

  • Keeping frequently-used models warm in VRAM

API Usage

Chat Completion

circle-info

Add "stream": false to get the complete response at once instead of streaming.

OpenAI-Compatible Endpoint

Streaming

Embeddings

Text Generation (Non-Chat)

Complete API Reference

All endpoints work with both http://localhost:11434 (via SSH) and https://your-http-pub.clorecloud.net (external).

Model Management

Endpoint
Method
Description

/api/tags

GET

List all downloaded models

/api/show

POST

Get model details

/api/pull

POST

Download a model

/api/delete

DELETE

Remove a model

/api/ps

GET

List currently running models

/api/version

GET

Get Ollama version

List Models

Response:

Show Model Details

Pull Model via API

Response:

circle-exclamation

Delete Model

List Running Models

Response:

Get Version

Response:

Inference Endpoints

Endpoint
Method
Description

/api/generate

POST

Text completion

/api/chat

POST

Chat completion

/api/embeddings

POST

Generate embeddings (legacy)

/api/embed

POST

Generate embeddings v0.6+ (batch, OpenAI-compatible)

/v1/chat/completions

POST

OpenAI-compatible chat

/v1/embeddings

POST

OpenAI-compatible embeddings

Custom Model Creation

Create custom models with specific system prompts via API:

GPU Configuration

Check GPU Usage

Multi-GPU

Ollama automatically uses available GPUs. For specific GPU:

Memory Management

Custom Models (Modelfile)

Create custom models with system prompts:

Running as Service

Systemd

Performance Tips

  1. Use appropriate quantization

    • Q4_K_M for speed

    • Q8_0 for quality

    • fp16 for maximum quality

  2. Match model to VRAM

    • 8GB: 7B models (Q4)

    • 16GB: 13B models or 7B (Q8)

    • 24GB: 34B models (Q4)

    • 48GB+: 70B models

  3. Keep model loaded

  4. Fast SSD improves performance

    • Model loading and KV cache benefit from fast storage

    • Servers with NVMe SSD can achieve 2-3x better performance

Benchmarks

Generation Speed (tokens/sec)

Model
RTX 3060
RTX 3090
RTX 4090
A100 40GB

Llama 3.2 3B (Q4)

120

160

200

220

Llama 3.1 8B (Q4)

60

100

130

150

Llama 3.1 8B (Q8)

45

80

110

130

Mistral 7B (Q4)

70

110

140

160

Mixtral 8x7B (Q4)

-

35

55

75

Llama 3.1 70B (Q4)

-

-

18

35

DeepSeek-R1 7B (Q4)

65

105

135

155

DeepSeek-R1 32B (Q4)

-

-

22

42

Qwen2.5 72B (Q4)

-

-

15

30

Phi-4 14B (Q4)

-

50

75

90

Benchmarks updated January 2026. Actual speeds may vary based on server configuration.

Time to First Token (ms)

Model
RTX 3090
RTX 4090
A100

3B

50

35

25

7-8B

120

80

60

13B

250

150

100

34B

600

350

200

70B

-

1200

500

Context Length vs VRAM (Q4)

Model
2K ctx
4K ctx
8K ctx
16K ctx

7B

5GB

6GB

8GB

12GB

13B

8GB

10GB

14GB

22GB

34B

20GB

24GB

32GB

48GB

70B

40GB

48GB

64GB

96GB

GPU Requirements

Model
Q4 VRAM
Q8 VRAM

3B

3GB

5GB

7-8B

5GB

9GB

13B

8GB

15GB

34B

20GB

38GB

70B

40GB

75GB

Cost Estimate

Typical CLORE.AI marketplace rates:

GPU
VRAM
Price/day
Good For

RTX 3060

12GB

$0.15–0.30

7B models

RTX 3090

24GB

$0.30–1.00

13B-34B models

RTX 4090

24GB

$0.50–2.00

34B models, fast

A100

40GB

$1.50–3.00

70B models

Prices in USD/day. Rates vary by provider — check CLORE.AI Marketplacearrow-up-right for current rates.

Troubleshooting

Model won't load

Slow generation

Connection refused

HTTP 502 on http_pub URL

This means the service is still starting. Wait 30-60 seconds and retry:

Next Steps

Last updated

Was this helpful?