# Text Generation WebUI

Run the most popular LLM interface with support for all model formats.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

## Renting on CLORE.AI

1. Visit [CLORE.AI Marketplace](https://clore.ai/marketplace)
2. Filter by GPU type, VRAM, and price
3. Choose **On-Demand** (fixed rate) or **Spot** (bid price)
4. Configure your order:
   * Select Docker image
   * Set ports (TCP for SSH, HTTP for web UIs)
   * Add environment variables if needed
   * Enter startup command
5. Select payment: **CLORE**, **BTC**, or **USDT/USDC**
6. Create order and wait for deployment

### Access Your Server

* Find connection details in **My Orders**
* Web interfaces: Use the HTTP port URL
* SSH: `ssh -p <port> root@<proxy-address>`

## Why Text Generation WebUI?

* Supports GGUF, GPTQ, AWQ, EXL2, HF formats
* Built-in chat, notebook, and API modes
* Extensions: voice, characters, multimodal
* Fine-tuning support
* Model switching on the fly

## Requirements

| Model Size | Min VRAM | Recommended |
| ---------- | -------- | ----------- |
| 7B (Q4)    | 6GB      | RTX 3060    |
| 13B (Q4)   | 10GB     | RTX 3080    |
| 30B (Q4)   | 20GB     | RTX 4090    |
| 70B (Q4)   | 40GB     | A100        |

## Quick Deploy

**Docker Image:**

```
atinoda/text-generation-webui:default-nvidia
```

**Ports:**

```
22/tcp
7860/http
5000/http
5005/http
```

**Environment:**

```
EXTRA_LAUNCH_ARGS=--listen --api
```

## Manual Installation

**Image:**

```
nvidia/cuda:12.1.0-devel-ubuntu22.04
```

**Ports:**

```
22/tcp
7860/http
5000/http
```

**Command:**

```bash
apt-get update && apt-get install -y git python3 python3-pip && \
cd /workspace && \
git clone https://github.com/oobabooga/text-generation-webui.git && \
cd text-generation-webui && \
pip install -r requirements.txt && \
python server.py --listen --api
```

## Accessing Your Service

After deployment, find your `http_pub` URL in **My Orders**:

1. Go to **My Orders** page
2. Click on your order
3. Find the `http_pub` URL (e.g., `abc123.clorecloud.net`)

Use `https://YOUR_HTTP_PUB_URL` instead of `localhost` in examples below.

## Access WebUI

1. Wait for deployment
2. Find port 7860 mapping in **My Orders**
3. Open: `http://<proxy>:<port>`

## Download Models

### From HuggingFace (in WebUI)

1. Go to **Model** tab
2. Enter model name: `bartowski/Meta-Llama-3.1-8B-Instruct-GGUF`
3. Click **Download**

### Via Command Line

```bash
cd /workspace/text-generation-webui

# Download GGUF model
python download-model.py bartowski/Meta-Llama-3.1-8B-Instruct-GGUF

# Download specific file
python download-model.py bartowski/Meta-Llama-3.1-8B-Instruct-GGUF --specific-file Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
```

### Recommended Models

**For Chat:**

```bash

# Llama 2 Chat (7B, fast)
python download-model.py bartowski/Meta-Llama-3.1-8B-Instruct-GGUF

# Mistral Instruct (excellent)
python download-model.py bartowski/Mistral-7B-Instruct-v0.3-GGUF

# OpenHermes (great all-rounder)
python download-model.py bartowski/OpenHermes-2.5-Mistral-7B-GGUF
```

**For Coding:**

```bash

# CodeLlama
python download-model.py bartowski/CodeLlama-13B-Instruct-GGUF

# DeepSeek Coder
python download-model.py bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF
```

**For Roleplay:**

```bash

# MythoMax
python download-model.py bartowski/MythoMax-L2-13B-GGUF
```

## Loading Models

### GGUF (Recommended for most users)

1. **Model** tab → Select model folder
2. **Model loader:** llama.cpp
3. Set **n-gpu-layers:**
   * RTX 3090: 35-40
   * RTX 4090: 45-50
   * A100: 80+
4. Click **Load**

### GPTQ (Fast, quantized)

1. Download GPTQ model
2. **Model loader:** ExLlama\_HF or AutoGPTQ
3. Load model

### EXL2 (Best speed)

```bash

# Install exllamav2
pip install exllamav2
```

1. Download EXL2 model
2. **Model loader:** ExLlamav2\_HF
3. Load

## Chat Configuration

### Character Setup

1. Go to **Parameters** → **Character**
2. Create or load character card
3. Set:
   * Name
   * Context/persona
   * Example dialogue

### Instruct Mode

For instruction-tuned models:

1. **Parameters** → **Instruction template**
2. Select template matching your model:
   * Llama-2-chat
   * Mistral
   * ChatML
   * Alpaca

## API Usage

### Enable API

Start with `--api` flag (default port 5000)

### OpenAI-compatible API

```python
import openai

openai.api_base = "http://localhost:5000/v1"
openai.api_key = "not-needed"

response = openai.ChatCompletion.create(
    model="any",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
```

### Native API

```python
import requests

response = requests.post(
    "http://localhost:5000/api/v1/generate",
    json={
        "prompt": "Write a story about",
        "max_new_tokens": 200,
        "temperature": 0.7
    }
)
print(response.json()["results"][0]["text"])
```

## Extensions

### Installing Extensions

```bash
cd /workspace/text-generation-webui/extensions

# Silero TTS (voice)
git clone https://github.com/oobabooga/text-generation-webui-extensions

# SuperBoogav2 (RAG/long-term memory)

# Already included, enable in UI
```

### Enable Extensions

1. **Session** tab → **Extensions**
2. Check boxes for desired extensions
3. Click **Apply and restart**

### Popular Extensions

| Extension         | Purpose             |
| ----------------- | ------------------- |
| silero\_tts       | Voice output        |
| whisper\_stt      | Voice input         |
| superbooga        | Document Q\&A       |
| sd\_api\_pictures | Image generation    |
| multimodal        | Image understanding |

## Performance Tuning

### GGUF Settings

```
n_gpu_layers: 35    # GPU layers (more = faster)
n_ctx: 4096         # Context length
n_batch: 512        # Batch size
threads: 8          # CPU threads
```

### Memory Optimization

For limited VRAM:

```bash
python server.py --listen --n-gpu-layers 20 --no-mmap
```

### Speed Optimization

```bash

# Use llama.cpp with cuBLAS
python server.py --listen --loader llama.cpp --n-gpu-layers 50 --threads 8
```

## Fine-tuning (LoRA)

### Training Tab

1. Go to **Training** tab
2. Load base model
3. Upload dataset (JSON format)
4. Configure:
   * LoRA rank: 8-32
   * Learning rate: 1e-4
   * Epochs: 3-5
5. Start training

### Dataset Format

```json
[
  {"instruction": "Summarize this:", "input": "Long text...", "output": "Summary..."},
  {"instruction": "Translate to French:", "input": "Hello", "output": "Bonjour"}
]
```

## Saving Your Work

```bash

# Save models
rsync -avz /workspace/text-generation-webui/models/ backup-server:/models/

# Save characters
rsync -avz /workspace/text-generation-webui/characters/ backup-server:/characters/

# Save LoRAs
rsync -avz /workspace/text-generation-webui/loras/ backup-server:/loras/
```

## Troubleshooting

### Model won't load

* Check VRAM usage: `nvidia-smi`
* Reduce `n_gpu_layers`
* Use smaller quantization (Q4\_K\_M → Q4\_K\_S)

### Slow generation

* Increase `n_gpu_layers`
* Use EXL2 instead of GGUF
* Enable `--no-mmap`

{% hint style="danger" %}
**Out of memory**
{% endhint %}

during generation - Reduce \`n\_ctx\` (context length) - Use \`--n-gpu-layers 0\` for CPU-only - Try smaller model

## Cost Estimate

Typical CLORE.AI marketplace rates (as of 2024):

| GPU       | Hourly Rate | Daily Rate | 4-Hour Session |
| --------- | ----------- | ---------- | -------------- |
| RTX 3060  | \~$0.03     | \~$0.70    | \~$0.12        |
| RTX 3090  | \~$0.06     | \~$1.50    | \~$0.25        |
| RTX 4090  | \~$0.10     | \~$2.30    | \~$0.40        |
| A100 40GB | \~$0.17     | \~$4.00    | \~$0.70        |
| A100 80GB | \~$0.25     | \~$6.00    | \~$1.00        |

*Prices vary by provider and demand. Check* [*CLORE.AI Marketplace*](https://clore.ai/marketplace) *for current rates.*

**Save money:**

* Use **Spot** market for flexible workloads (often 30-50% cheaper)
* Pay with **CLORE** tokens
* Compare prices across different providers


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/language-models/text-generation-webui.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
