# LitGPT

**LitGPT** is a high-performance library for pretraining, finetuning, and deploying 20+ large language models built on PyTorch Lightning. With 12K+ GitHub stars, it's a go-to toolkit for engineers who need clean, hackable LLM training code without the abstraction overhead of HuggingFace Transformers.

Each model in LitGPT is \~1,000 lines of clean PyTorch — no inheritance chains 10 levels deep, no magic. You can read the Llama 3 implementation end-to-end in an afternoon and modify it confidently.

{% hint style="success" %}
All examples can be run on GPU servers rented through [CLORE.AI Marketplace](https://clore.ai/marketplace).
{% endhint %}

***

## What is LitGPT?

LitGPT provides production-ready implementations of state-of-the-art LLMs with a unified training interface:

* **20+ supported models** — Llama 3, Gemma 2, Mistral, Phi-3, Falcon, StableLM, and more
* **Pretrain from scratch** — full pretraining with Flash Attention, FSDP, and gradient checkpointing
* **Finetune efficiently** — full finetuning, LoRA, QLoRA, and Adapter methods
* **Serve with confidence** — built-in inference server with quantization
* **Multi-GPU support** — DDP, FSDP, tensor parallelism out of the box
* **Memory efficient** — 4-bit quantization, gradient checkpointing, activation checkpointing

***

## Server Requirements

| Component | Minimum          | Recommended       |
| --------- | ---------------- | ----------------- |
| GPU       | RTX 3090 (24 GB) | A100 80 GB / H100 |
| VRAM      | 16 GB (7B LoRA)  | 80 GB+ (70B full) |
| RAM       | 32 GB            | 64 GB+            |
| CPU       | 8 cores          | 16+ cores         |
| Storage   | 100 GB           | 500 GB+           |
| OS        | Ubuntu 20.04+    | Ubuntu 22.04      |
| Python    | 3.10+            | 3.11              |
| CUDA      | 11.8+            | 12.1+             |

### VRAM Requirements by Task

| Task              | Model       | VRAM              |
| ----------------- | ----------- | ----------------- |
| Inference (4-bit) | Llama-3 8B  | \~6 GB            |
| LoRA finetune     | Llama-3 8B  | \~16 GB           |
| Full finetune     | Llama-3 8B  | \~80 GB           |
| LoRA finetune     | Llama-3 70B | \~48 GB (2×A100)  |
| Full finetune     | Llama-3 70B | \~640 GB (8×A100) |
| QLoRA finetune    | Llama-3 8B  | \~8 GB            |

***

## Ports

| Port | Service                 | Notes                           |
| ---- | ----------------------- | ------------------------------- |
| 22   | SSH                     | Terminal access & file transfer |
| 8000 | LitGPT Inference Server | REST API for model serving      |

***

## Quick Start with Docker

```bash
# Pull the official LitGPT image
docker pull pytorchlightning/litgpt:latest

# Run interactive container with GPU
docker run -it --gpus all \
  -p 8000:8000 \
  -v $(pwd)/checkpoints:/checkpoints \
  -v $(pwd)/data:/data \
  pytorchlightning/litgpt:latest \
  bash

# Or run a specific command directly
docker run --gpus all \
  -v $(pwd)/checkpoints:/checkpoints \
  pytorchlightning/litgpt:latest \
  litgpt download --repo_id meta-llama/Llama-3.2-3B-Instruct
```

***

## Installation on Clore.ai

### Step 1 — Rent a Server

1. Go to [Clore.ai Marketplace](https://clore.ai/marketplace)
2. Filter for **VRAM ≥ 24 GB** (RTX 3090 or better)
3. Choose a **PyTorch** or **CUDA 12.1** base image
4. Open ports **22** and **8000** in your order settings
5. Select **storage ≥ 200 GB** for model weights

### Step 2 — Connect via SSH

```bash
ssh root@<server-ip> -p <ssh-port>
```

### Step 3 — Install LitGPT

```bash
# Install via pip (recommended)
pip install litgpt

# With all extras (quantization, server, etc.)
pip install 'litgpt[all]'

# Or install from source for latest features
git clone https://github.com/Lightning-AI/litgpt.git
cd litgpt
pip install -e '.[all]'
```

### Step 4 — Verify Installation

```bash
litgpt --help
```

Expected output:

```
Usage: litgpt [OPTIONS] COMMAND [ARGS]...
  
Commands:
  chat       Chat with a model
  convert    Convert model weights
  download   Download model weights
  evaluate   Evaluate a model
  finetune   Finetune a model
  generate   Generate text
  pretrain   Pretrain a model
  serve      Serve a model for inference
```

***

## Downloading Models

LitGPT downloads models from Hugging Face:

```bash
# List available models
litgpt download --list

# Download Llama 3.2 3B (requires HF token for gated models)
litgpt download \
  --repo_id meta-llama/Llama-3.2-3B-Instruct \
  --checkpoint_dir checkpoints/

# Download Mistral 7B (open access)
litgpt download \
  --repo_id mistralai/Mistral-7B-Instruct-v0.3

# Download Gemma 2 2B
litgpt download \
  --repo_id google/gemma-2-2b-it \
  --access_token your-hf-token

# Download Phi-3 (small but powerful)
litgpt download \
  --repo_id microsoft/Phi-3-mini-4k-instruct
```

### Set HuggingFace Token

```bash
# For gated models (Llama, Gemma)
export HF_TOKEN=hf_your-token-here

# Or authenticate via CLI
pip install huggingface_hub
huggingface-cli login
```

***

## Inference (Chat & Generate)

```bash
# Interactive chat
litgpt chat \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct

# Single generation
litgpt generate \
  --prompt "Explain GPU computing in simple terms" \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \
  --max_new_tokens 200

# With temperature and sampling
litgpt generate \
  --prompt "Write a Python function to sort a list" \
  --checkpoint_dir checkpoints/mistralai/Mistral-7B-Instruct-v0.3 \
  --temperature 0.7 \
  --top_p 0.9 \
  --max_new_tokens 500
```

***

## Finetuning

### LoRA Finetuning (Recommended)

LoRA trains a small set of adapter parameters (typically 0.1–1% of total weights) while the base model stays frozen. Llama 3 8B LoRA on 10K examples takes \~2 hours on an RTX 3090 with `r=16`.

```bash
# Prepare your dataset
# Format: JSON lines with {"instruction": "...", "input": "...", "output": "..."}
cat > data/train.json << 'EOF'
{"instruction": "What is GPU cloud computing?", "input": "", "output": "GPU cloud computing provides on-demand access to GPU hardware through the internet, enabling AI training and inference without owning physical hardware."}
{"instruction": "How do I rent a GPU on Clore.ai?", "input": "", "output": "Visit clore.ai/marketplace, filter by GPU specs, select a server, configure ports, and click rent. SSH access is provided immediately."}
EOF

# Finetune with LoRA
litgpt finetune lora \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \
  --data JSON \
  --data.json_path data/train.json \
  --train.epochs 3 \
  --train.micro_batch_size 4 \
  --lora_r 8 \
  --lora_alpha 16 \
  --out_dir out/llama-lora-finetuned

# Monitor training
# LitGPT outputs logs with loss, learning rate, and ETA
```

### QLoRA (4-bit + LoRA)

Use QLoRA to finetune large models on limited VRAM. Llama 3 8B fits on a single RTX 3090 at 24 GB:

```bash
litgpt finetune lora \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-8B-Instruct \
  --quantize bnb.nf4 \
  --train.epochs 3 \
  --train.micro_batch_size 2 \
  --lora_r 16 \
  --lora_alpha 32 \
  --out_dir out/llama-qlora
```

### Full Finetuning

```bash
litgpt finetune full \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \
  --data JSON \
  --data.json_path data/train.json \
  --train.epochs 2 \
  --train.micro_batch_size 2 \
  --train.accumulate_gradients 8 \
  --out_dir out/llama-full-finetuned
```

### Multi-GPU Training

```bash
# Use FSDP across multiple GPUs
litgpt finetune full \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-8B-Instruct \
  --devices 4 \
  --strategy fsdp \
  --train.epochs 3 \
  --out_dir out/llama-multigpu
```

***

## Serving Models (REST API)

```bash
# Start inference server
litgpt serve \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \
  --host 0.0.0.0 \
  --port 8000

# Test the API
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is the capital of France?",
    "max_new_tokens": 100,
    "temperature": 0.7
  }'
```

### Python Client

```python
import requests

response = requests.post(
    "http://<server-ip>:8000/predict",
    json={
        "prompt": "Explain reinforcement learning",
        "max_new_tokens": 500,
        "temperature": 0.8,
        "top_p": 0.9,
    }
)
print(response.json()["output"])
```

***

## Pretraining from Scratch

For training a custom LLM from scratch on your own data:

```bash
# Prepare pretraining data (tokenized and chunked)
python scripts/prepare_redpajama.py \
  --source_path /data/raw_text \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \
  --destination_path /data/tokenized

# Start pretraining
litgpt pretrain \
  --model_name Llama-3.2 \
  --data /data/tokenized \
  --train.micro_batch_size 4 \
  --train.max_tokens 10_000_000_000 \
  --devices 8 \
  --strategy fsdp \
  --out_dir out/my-pretrained-llm
```

***

## Converting and Exporting Models

```bash
# Merge LoRA weights into base model
litgpt merge_lora \
  --checkpoint_dir out/llama-lora-finetuned

# Convert to HuggingFace format for distribution
litgpt convert to_hf \
  --checkpoint_dir out/llama-lora-finetuned/final \
  --output_dir hf_model/

# Export to GGUF format (for Ollama/LlamaCpp)
# Use llama.cpp conversion script after HF export
python llama.cpp/convert.py hf_model/ --outfile model.gguf
```

***

## Evaluating Models

```bash
# Run MMLU benchmark
litgpt evaluate \
  --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \
  --tasks mmlu \
  --num_fewshot 5

# Run multiple benchmarks
litgpt evaluate \
  --checkpoint_dir out/llama-lora-finetuned/final \
  --tasks "mmlu,hellaswag,truthfulqa_mc"
```

***

## Clore.ai GPU Recommendations

LitGPT covers three distinct workloads — inference, LoRA finetuning, and full pretraining — each with different GPU requirements.

| Workload                              | GPU            | VRAM  | Notes                                              |
| ------------------------------------- | -------------- | ----- | -------------------------------------------------- |
| Inference / chat (7–8B models)        | **RTX 3090**   | 24 GB | Fits Llama 3 8B in bf16; \~95 tok/s generation     |
| LoRA finetune (7–8B models)           | **RTX 3090**   | 24 GB | Budget pick; QLoRA keeps VRAM under 10 GB          |
| LoRA finetune (7–8B), fast iteration  | **RTX 4090**   | 24 GB | \~35% faster than 3090; reduces 2hr job to \~1.4hr |
| Full finetune (7B) or QLoRA (70B)     | **A100 40 GB** | 40 GB | 40 GB fits 7B full-precision or 70B 4-bit          |
| Full finetune (13B+) or pretrain runs | **A100 80 GB** | 80 GB | Highest throughput; \~2,800 tok/sec training on 8B |

**Recommended for most users:** RTX 3090 pair (2×24 GB = 48 GB effective with FSDP). Handles QLoRA on 70B models, or full finetune on 7B models with tensor parallelism. Cost on Clore.ai: \~$0.25/hr for two 3090s.

**For pretraining or >70B finetuning:** Use 4×A100 80GB with FSDP. LitGPT's FSDP integration handles sharding transparently — just pass `--devices 4 --strategy fsdp`.

***

## Troubleshooting

### CUDA Out of Memory

```bash
# Reduce batch size
--train.micro_batch_size 1

# Enable gradient checkpointing
--train.gradient_checkpointing true

# Use QLoRA instead of LoRA
--quantize bnb.nf4

# Check GPU memory
nvidia-smi
```

### Download fails / HuggingFace 401

```bash
# Set HF token
export HF_TOKEN=hf_your-token-here
huggingface-cli login

# Or pass directly
litgpt download \
  --repo_id meta-llama/Llama-3.2-3B-Instruct \
  --access_token hf_your-token
```

### Training loss doesn't decrease

```bash
# Check your data format — must be valid JSON Lines
python -c "
import json
with open('data/train.json') as f:
    for i, line in enumerate(f):
        json.loads(line)
        if i < 3: print(f'Line {i}: OK')
print('All lines valid')
"

# Reduce learning rate
--train.lr 1e-5  # Default is often too high for small datasets

# Check data size — LoRA needs at least 100-1000 examples
wc -l data/train.json
```

### Server port 8000 not accessible

```bash
# Verify server is listening
ss -tlnp | grep 8000

# Open firewall
ufw allow 8000/tcp

# Restart server with explicit host
litgpt serve \
  --checkpoint_dir checkpoints/... \
  --host 0.0.0.0 \
  --port 8000
```

### Multi-GPU training hangs

```bash
# Check NCCL connectivity
python -c "import torch; print(torch.cuda.device_count())"

# Try DDP instead of FSDP for smaller models
--strategy ddp

# Set NCCL environment variables
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1  # If InfiniBand is not available
```

***

## Useful Links

* **GitHub**: <https://github.com/Lightning-AI/litgpt> ⭐ 12K+
* **Documentation**: <https://lightning.ai/docs/litgpt>
* **PyTorch Lightning**: <https://lightning.ai>
* **HuggingFace Models**: <https://huggingface.co/models>
* **Discord**: <https://discord.gg/lightning-ai>
* **Clore.ai Marketplace**: <https://clore.ai/marketplace>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/training/litgpt.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
