Ollama

The easiest way to run LLMs locally on CLORE.AI GPUs.

circle-check

Server Requirements

Parameter
Minimum
Recommended

RAM

8GB

16GB+

VRAM

6GB

8GB+

Network

100Mbps

500Mbps+

Startup Time

~30 seconds

-

circle-info

Ollama is lightweight and works on most GPU servers. For larger models (13B+), choose servers with 16GB+ RAM and 12GB+ VRAM.

Why Ollama?

  • One-command setup - No Python, no dependencies

  • Model library - Download models with ollama pull

  • OpenAI-compatible API - Drop-in replacement

  • GPU acceleration - Automatic CUDA detection

  • Multi-model - Run multiple models simultaneously

Quick Deploy on CLORE.AI

Docker Image:

Ports:

Command:

Verify It's Working

After deployment, find your http_pub URL in My Orders and test:

circle-exclamation

Accessing Your Service

When deployed on CLORE.AI, access your Ollama instance via the http_pub URL:

circle-info

All localhost:11434 examples below work when connected via SSH. For external access, replace with your https://your-http-pub.clorecloud.net/ URL.

Installation

Manual Installation

Running Models

Pull and Run

Model
Size
Use Case

llama3.2

3B

Fast, general purpose

llama3.1

8B

Better quality

llama3.1:70b

70B

Best quality

mistral

7B

Fast, good quality

mixtral

47B

MoE, high quality

codellama

7-34B

Code generation

deepseek-coder-v2

16B

Best for code

qwen2.5

7B

Multilingual

phi4

14B

Microsoft's latest

gemma2

9B

Google's model

Model Variants

API Usage

Chat Completion

circle-info

Add "stream": false to get the complete response at once instead of streaming.

OpenAI-Compatible Endpoint

Streaming

Embeddings

Text Generation (Non-Chat)

Complete API Reference

All endpoints work with both http://localhost:11434 (via SSH) and https://your-http-pub.clorecloud.net (external).

Model Management

Endpoint
Method
Description

/api/tags

GET

List all downloaded models

/api/show

POST

Get model details

/api/pull

POST

Download a model

/api/delete

DELETE

Remove a model

/api/ps

GET

List currently running models

/api/version

GET

Get Ollama version

List Models

Response:

Show Model Details

Pull Model via API

Response:

circle-exclamation

Delete Model

List Running Models

Response:

Get Version

Response:

Inference Endpoints

Endpoint
Method
Description

/api/generate

POST

Text completion

/api/chat

POST

Chat completion

/api/embeddings

POST

Generate embeddings

/v1/chat/completions

POST

OpenAI-compatible chat

Custom Model Creation

Create custom models with specific system prompts via API:

GPU Configuration

Check GPU Usage

Multi-GPU

Ollama automatically uses available GPUs. For specific GPU:

Memory Management

Custom Models (Modelfile)

Create custom models with system prompts:

Running as Service

Systemd

Performance Tips

  1. Use appropriate quantization

    • Q4_K_M for speed

    • Q8_0 for quality

    • fp16 for maximum quality

  2. Match model to VRAM

    • 8GB: 7B models (Q4)

    • 16GB: 13B models or 7B (Q8)

    • 24GB: 34B models (Q4)

    • 48GB+: 70B models

  3. Keep model loaded

  4. Fast SSD improves performance

    • Model loading and KV cache benefit from fast storage

    • Servers with NVMe SSD can achieve 2-3x better performance

Benchmarks

Generation Speed (tokens/sec)

Model
RTX 3060
RTX 3090
RTX 4090
A100 40GB

Llama 3.2 3B (Q4)

120

160

200

220

Llama 3.1 8B (Q4)

60

100

130

150

Llama 3.1 8B (Q8)

45

80

110

130

Mistral 7B (Q4)

70

110

140

160

Mixtral 8x7B (Q4)

-

35

55

75

Llama 3.1 70B (Q4)

-

-

18

35

Benchmarks updated January 2026. Actual speeds may vary based on server configuration.

Time to First Token (ms)

Model
RTX 3090
RTX 4090
A100

3B

50

35

25

7-8B

120

80

60

13B

250

150

100

34B

600

350

200

70B

-

1200

500

Context Length vs VRAM (Q4)

Model
2K ctx
4K ctx
8K ctx
16K ctx

7B

5GB

6GB

8GB

12GB

13B

8GB

10GB

14GB

22GB

34B

20GB

24GB

32GB

48GB

70B

40GB

48GB

64GB

96GB

GPU Requirements

Model
Q4 VRAM
Q8 VRAM

3B

3GB

5GB

7-8B

5GB

9GB

13B

8GB

15GB

34B

20GB

38GB

70B

40GB

75GB

Cost Estimate

Typical CLORE.AI marketplace rates:

GPU
CLORE/day
Approx USD/hr
Good For

RTX 3060 12GB

~100

~$0.02

7B models

RTX 3090 24GB

~150

~$0.03

13B-34B models

RTX 4090 24GB

~200

~$0.04

34B models, fast

A100 40GB

~400

~$0.08

70B models

Prices vary by provider. Check CLORE.AI Marketplacearrow-up-right for current rates. Pay with CLORE tokens for best value.

Troubleshooting

Model won't load

Slow generation

Connection refused

HTTP 502 on http_pub URL

This means the service is still starting. Wait 30-60 seconds and retry:

Next Steps

Last updated

Was this helpful?