Overview

Run large language models (LLMs) on CLORE.AI GPUs for inference and chat applications.

Popular Tools

Tool

Use Case

Difficulty

Ollama

Easiest LLM setup

Beginner

Open WebUI

ChatGPT-like interface

Beginner

vLLM

High-throughput production serving

Medium

Llama.cpp Server

Efficient GGUF inference

Easy

Text Generation WebUI

Full-featured chat UI

Easy

ExLlamaV2

Fastest EXL2 inference

Medium

LocalAI

OpenAI-compatible API

Medium

Model Guides

Model

Parameters

Best For

DeepSeek-V3

671B MoE

Reasoning, code, math

Qwen2.5

0.5B-72B

Multilingual, code

Mistral/Mixtral

7B / 8x7B

General purpose

DeepSeek Coder

6.7B-33B

Code generation

CodeLlama

7B-34B

Code completion

Gemma 2

2B-27B

Efficient inference

Phi-4

14B

Small but capable

GPU Recommendations

Model Size

Minimum GPU

Recommended

7B (Q4)

RTX 3060 12GB

RTX 3090

13B (Q4)

RTX 3090 24GB

RTX 4090

34B (Q4)

2x RTX 3090

A100 40GB

70B (Q4)

A100 80GB

2x A100

Quantization Guide

Format

VRAM Usage

Quality

Speed

Q2_K

Lowest

Poor

Fastest

Q4_K_M

Low

Good

Fast

Q5_K_M

Medium

Great

Medium

Q8_0

High

Excellent

Slower

FP16

Highest

Best

Slowest

hashtagPopular Tools

hashtagModel Guides

hashtagGPU Recommendations

hashtagQuantization Guide

hashtagSee Also

Popular Tools

Model Guides

GPU Recommendations

Quantization Guide

See Also