Text Generation WebUI

Run text-generation-webui for LLM inference on Clore.ai GPUs

Run the most popular LLM interface with support for all model formats.

All examples can be run on GPU servers rented through CLORE.AI Marketplace.

Renting on CLORE.AI

Visit CLORE.AI Marketplace
Filter by GPU type, VRAM, and price
Choose On-Demand (fixed rate) or Spot (bid price)
Configure your order:
- Select Docker image
- Set ports (TCP for SSH, HTTP for web UIs)
- Add environment variables if needed
- Enter startup command
Select payment: CLORE, BTC, or USDT/USDC
Create order and wait for deployment

Access Your Server

Find connection details in My Orders
Web interfaces: Use the HTTP port URL
SSH: ssh -p <port> root@<proxy-address>

Why Text Generation WebUI?

Supports GGUF, GPTQ, AWQ, EXL2, HF formats
Built-in chat, notebook, and API modes
Extensions: voice, characters, multimodal
Fine-tuning support
Model switching on the fly

Requirements

Model Size

Min VRAM

Recommended

7B (Q4)

6GB

RTX 3060

13B (Q4)

10GB

RTX 3080

30B (Q4)

20GB

RTX 4090

70B (Q4)

40GB

A100

Quick Deploy

Docker Image:

atinoda/text-generation-webui:default-nvidia

Ports:

22/tcp
7860/http
5000/http
5005/http

Environment:

EXTRA_LAUNCH_ARGS=--listen --api

Manual Installation

Image:

nvidia/cuda:12.1.0-devel-ubuntu22.04

Ports:

22/tcp
7860/http
5000/http

Command:

apt-get update && apt-get install -y git python3 python3-pip && \
cd /workspace && \
git clone https://github.com/oobabooga/text-generation-webui.git && \
cd text-generation-webui && \
pip install -r requirements.txt && \
python server.py --listen --api

Accessing Your Service

After deployment, find your http_pub URL in My Orders:

Go to My Orders page
Click on your order
Find the http_pub URL (e.g., abc123.clorecloud.net)

Use https://YOUR_HTTP_PUB_URL instead of localhost in examples below.

Access WebUI

Wait for deployment
Find port 7860 mapping in My Orders
Open: http://<proxy>:<port>

Download Models

From HuggingFace (in WebUI)

Go to Model tab
Enter model name: bartowski/Meta-Llama-3.1-8B-Instruct-GGUF
Click Download

Via Command Line

cd /workspace/text-generation-webui

# Download GGUF model
python download-model.py bartowski/Meta-Llama-3.1-8B-Instruct-GGUF

# Download specific file
python download-model.py bartowski/Meta-Llama-3.1-8B-Instruct-GGUF --specific-file Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

Recommended Models

For Chat:


# Llama 2 Chat (7B, fast)
python download-model.py bartowski/Meta-Llama-3.1-8B-Instruct-GGUF

# Mistral Instruct (excellent)
python download-model.py bartowski/Mistral-7B-Instruct-v0.3-GGUF

# OpenHermes (great all-rounder)
python download-model.py bartowski/OpenHermes-2.5-Mistral-7B-GGUF

For Coding:


# CodeLlama
python download-model.py bartowski/CodeLlama-13B-Instruct-GGUF

# DeepSeek Coder
python download-model.py bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF

For Roleplay:


# MythoMax
python download-model.py bartowski/MythoMax-L2-13B-GGUF

Loading Models

GGUF (Recommended for most users)

Model tab → Select model folder
Model loader: llama.cpp
Set n-gpu-layers:
- RTX 3090: 35-40
- RTX 4090: 45-50
- A100: 80+
Click Load

GPTQ (Fast, quantized)

Download GPTQ model
Model loader: ExLlama_HF or AutoGPTQ
Load model

EXL2 (Best speed)


# Install exllamav2
pip install exllamav2

Download EXL2 model
Model loader: ExLlamav2_HF
Load

Chat Configuration

Character Setup

Go to Parameters → Character
Create or load character card
Set:
- Name
- Context/persona
- Example dialogue

Instruct Mode

For instruction-tuned models:

Parameters → Instruction template
Select template matching your model:
- Llama-2-chat
- Mistral
- ChatML
- Alpaca

API Usage

Enable API

Start with --api flag (default port 5000)

OpenAI-compatible API

import openai

openai.api_base = "http://localhost:5000/v1"
openai.api_key = "not-needed"

response = openai.ChatCompletion.create(
    model="any",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Native API

import requests

response = requests.post(
    "http://localhost:5000/api/v1/generate",
    json={
        "prompt": "Write a story about",
        "max_new_tokens": 200,
        "temperature": 0.7
    }
)
print(response.json()["results"][0]["text"])

Extensions

Installing Extensions

cd /workspace/text-generation-webui/extensions

# Silero TTS (voice)
git clone https://github.com/oobabooga/text-generation-webui-extensions

# SuperBoogav2 (RAG/long-term memory)

# Already included, enable in UI

Enable Extensions

Session tab → Extensions
Check boxes for desired extensions
Click Apply and restart

Popular Extensions

Extension

Purpose

silero_tts

Voice output

whisper_stt

Voice input

superbooga

Document Q&A

sd_api_pictures

Image generation

multimodal

Image understanding

Performance Tuning

GGUF Settings

n_gpu_layers: 35    # GPU layers (more = faster)
n_ctx: 4096         # Context length
n_batch: 512        # Batch size
threads: 8          # CPU threads

Memory Optimization

For limited VRAM:

python server.py --listen --n-gpu-layers 20 --no-mmap

Speed Optimization


# Use llama.cpp with cuBLAS
python server.py --listen --loader llama.cpp --n-gpu-layers 50 --threads 8

Fine-tuning (LoRA)

Training Tab

Go to Training tab
Load base model
Upload dataset (JSON format)
Configure:
- LoRA rank: 8-32
- Learning rate: 1e-4
- Epochs: 3-5
Start training

Dataset Format

[
  {"instruction": "Summarize this:", "input": "Long text...", "output": "Summary..."},
  {"instruction": "Translate to French:", "input": "Hello", "output": "Bonjour"}
]

Saving Your Work


# Save models
rsync -avz /workspace/text-generation-webui/models/ backup-server:/models/

# Save characters
rsync -avz /workspace/text-generation-webui/characters/ backup-server:/characters/

# Save LoRAs
rsync -avz /workspace/text-generation-webui/loras/ backup-server:/loras/

Troubleshooting

Model won't load

Check VRAM usage: nvidia-smi
Reduce n_gpu_layers
Use smaller quantization (Q4_K_M → Q4_K_S)

Slow generation

Increase n_gpu_layers
Use EXL2 instead of GGUF
Enable --no-mmap

Out of memory

during generation - Reduce `n_ctx` (context length) - Use `--n-gpu-layers 0` for CPU-only - Try smaller model

Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

GPU

Hourly Rate

Daily Rate

4-Hour Session

RTX 3060

~$0.03

~$0.70

~$0.12

RTX 3090

~$0.06

~$1.50

~$0.25

RTX 4090

~$0.10

~$2.30

~$0.40

A100 40GB

~$0.17

~$4.00

~$0.70

A100 80GB

~$0.25

~$6.00

~$1.00

Prices vary by provider and demand. Check CLORE.AI Marketplace for current rates.

Save money:

Use Spot market for flexible workloads (often 30-50% cheaper)
Pay with CLORE tokens
Compare prices across different providers

PreviousLlama.cpp Server NextExLlamaV2

Last updated 26 days ago

Was this helpful?

hashtagRenting on CLORE.AI

hashtagAccess Your Server

hashtagWhy Text Generation WebUI?

hashtagRequirements

hashtagQuick Deploy

hashtagManual Installation

hashtagAccessing Your Service

hashtagAccess WebUI

hashtagDownload Models

hashtagFrom HuggingFace (in WebUI)

hashtagVia Command Line

hashtagRecommended Models

hashtagLoading Models

hashtagGGUF (Recommended for most users)

hashtagGPTQ (Fast, quantized)

hashtagEXL2 (Best speed)

hashtagChat Configuration

hashtagCharacter Setup

hashtagInstruct Mode

hashtagAPI Usage

hashtagEnable API

hashtagOpenAI-compatible API

hashtagNative API

hashtagExtensions

hashtagInstalling Extensions

hashtagEnable Extensions

hashtagPopular Extensions

hashtagPerformance Tuning

hashtagGGUF Settings

hashtagMemory Optimization

hashtagSpeed Optimization

hashtagFine-tuning (LoRA)

hashtagTraining Tab

hashtagDataset Format

hashtagSaving Your Work

hashtagTroubleshooting

hashtagModel won't load

hashtagSlow generation

hashtagCost Estimate

Renting on CLORE.AI

Access Your Server

Why Text Generation WebUI?

Requirements

Quick Deploy

Manual Installation

Accessing Your Service

Access WebUI

Download Models

From HuggingFace (in WebUI)

Via Command Line

Recommended Models

Loading Models

GGUF (Recommended for most users)

GPTQ (Fast, quantized)

EXL2 (Best speed)

Chat Configuration

Character Setup

Instruct Mode

API Usage

Enable API

OpenAI-compatible API

Native API

Extensions

Installing Extensions

Enable Extensions

Popular Extensions

Performance Tuning

GGUF Settings

Memory Optimization

Speed Optimization

Fine-tuning (LoRA)

Training Tab

Dataset Format

Saving Your Work

Troubleshooting

Model won't load

Slow generation

Cost Estimate