Overview

Run large language models (LLMs) on CLORE.AI GPUs for inference and chat applications.

Tool
Use Case
Difficulty

Easiest LLM setup

Beginner

ChatGPT-like interface

Beginner

High-throughput production serving

Medium

Efficient GGUF inference

Easy

Full-featured chat UI

Easy

Fastest EXL2 inference

Medium

OpenAI-compatible API

Medium

Fast structured generation

Medium

HuggingFace serving solution

Medium

MMlab serving toolkit

Medium

vLLM fork with extra features

Medium

Machine learning compilation

Hard

Unified API proxy

Medium

Sparse model inference

Hard

Rust-based inference engine

Medium

Model Guides

Latest & Best Models

Model
Parameters
Best For

671B MoE

Reasoning, code, math

671B MoE

Advanced reasoning

TBA

Next-generation DeepSeek

0.5B-72B

Multilingual, code

TBA

Latest Qwen generation

70B

Meta's latest 70B

TBA

Scout & Maverick variants

Specialized Models

Model
Parameters
Best For

6.7B-33B

Code generation

7B-34B

Code completion

4.7B

Fast Chinese/English

TBA

Zhipu AI latest

TBA

Moonshot AI model

1T

Massive open-source LLM

24B

Liquid AI model

TBA

Fast inference model

Efficient Models

Model
Parameters
Best For

2B-27B

Efficient inference

TBA

Google's latest compact

14B

Small but capable

7B / 8x7B

General purpose

675B MoE

Enterprise-grade

TBA

Efficient Mistral variant

GPU Recommendations

Model Size
Minimum GPU
Recommended

7B (Q4)

RTX 3060 12GB

RTX 3090

13B (Q4)

RTX 3090 24GB

RTX 4090

34B (Q4)

2x RTX 3090

A100 40GB

70B (Q4)

A100 80GB

2x A100

Quantization Guide

Format
VRAM Usage
Quality
Speed

Q2_K

Lowest

Poor

Fastest

Q4_K_M

Low

Good

Fast

Q5_K_M

Medium

Great

Medium

Q8_0

High

Excellent

Slower

FP16

Highest

Best

Slowest

See Also

Last updated

Was this helpful?