# LitGPT **LitGPT** 是一个基于 PyTorch Lightning 的高性能库，用于从零预训练、微调和部署 20 多种大型语言模型。拥有 12K+ 的 GitHub 收藏，它是工程师在不需要 HuggingFace Transformers 抽象开销的情况下，获取干净、可修改的 LLM 训练代码的首选工具包。 LitGPT 中的每个模型大约为 \~1,000 行干净的 PyTorch 代码——没有十层深的继承链，也没有魔法操作。你可以在一个下午内端到端阅读 Llama 3 的实现并自信地修改它。 {% hint style="success" %} 所有示例都可以在通过以下方式租用的 GPU 服务器上运行 [CLORE.AI 市场](https://clore.ai/marketplace). {% endhint %} *** ## 什么是 LitGPT？ LitGPT 提供了可投入生产的最先进 LLM 实现，并具有统一的训练接口： * **支持 20+ 种模型** — Llama 3、Gemma 2、Mistral、Phi-3、Falcon、StableLM 等等 * **从零预训练** — 使用 Flash Attention、FSDP 和梯度检查点进行完整预训练 * **高效微调** — 完整微调、LoRA、QLoRA 和 Adapter 方法 * **自信部署** — 内置带量化的推理服务器 * **多 GPU 支持** — 开箱即用的 DDP、FSDP、张量并行 * **内存高效** — 4 位量化、梯度检查点、激活检查点 *** ## 服务器要求 | 组件 | 最低要求 | 推荐配置 | | ------- | --------------- | ----------------- | | GPU | RTX 3090（24 GB） | A100 80 GB / H100 | | 显存 | 16 GB（7B LoRA） | 80 GB+（70B 完整） | | 内存（RAM） | 32 GB | 64 GB+ | | CPU | 8 核 | 16+ 核 | | 存储 | 100 GB | 500 GB+ | | 操作系统 | Ubuntu 20.04+ | Ubuntu 22.04 | | Python | 3.10+ | 3.11 | | CUDA | 11.8+ | 12.1+ | ### 按任务的显存要求 | 任务 | 模型 | 显存 | | -------- | ----------- | ------------------ | | 推理（4 位） | Llama-3 8B | 约 \~6 GB | | LoRA 微调 | Llama-3 8B | 约 \~16 GB | | 完整微调 | Llama-3 8B | 约 \~80 GB | | LoRA 微调 | Llama-3 70B | 约 \~48 GB（2×A100） | | 完整微调 | Llama-3 70B | 约 \~640 GB（8×A100） | | QLoRA 微调 | Llama-3 8B | 约 \~8 GB | *** ## 端口 | 端口 | 服务 | 说明 | | ---- | ------------ | ---------------- | | 22 | SSH | 终端访问与文件传输 | | 8000 | LitGPT 推理服务器 | 用于模型服务的 REST API | *** ## 使用 Docker 快速开始 ```bash # 拉取官方 LitGPT 镜像 docker pull pytorchlightning/litgpt:latest # 以交互模式运行带 GPU 的容器 docker run -it --gpus all \ -p 8000:8000 \ -v $(pwd)/checkpoints:/checkpoints \ -v $(pwd)/data:/data \ pytorchlightning/litgpt:latest \ bash # 或直接运行特定命令 docker run --gpus all \ -v $(pwd)/checkpoints:/checkpoints \ pytorchlightning/litgpt:latest \ litgpt download --repo_id meta-llama/Llama-3.2-3B-Instruct ``` *** ## 在 Clore.ai 上的安装 ### 步骤 1 — 租用服务器 1. 前往 [Clore.ai 市场](https://clore.ai/marketplace) 2. 筛选条件为 **显存 ≥ 24 GB** （RTX 3090 或更好） 3. 选择一个 **PyTorch** 或 **CUDA 12.1** 基础镜像 4. 在你的订单设置中打开端口 **22** 和 **8000** 在你的订单设置中 5. 选择 **存储 ≥ 200 GB** 用于模型权重 ### 步骤 2 — 通过 SSH 连接 ```bash ssh root@ -p ``` ### 步骤 3 — 安装 LitGPT ```bash # 通过 pip 安装（推荐） pip install litgpt # 安装所有扩展（量化、服务器等） pip install 'litgpt[all]' # 或从源码安装以获取最新功能 git clone https://github.com/Lightning-AI/litgpt.git cd litgpt pip install -e '.[all]' ``` ### 步骤 4 — 验证安装 ```bash litgpt --help ``` 预期输出： ``` 用法： litgpt [OPTIONS] COMMAND [ARGS]... 命令： chat 与模型进行对话 convert 转换模型权重 download 下载模型权重 evaluate 评估模型 finetune 微调模型 generate 生成文本 pretrain 预训练模型 serve 提供模型推理服务 ``` *** ## 下载模型 LitGPT 从 Hugging Face 下载模型： ```bash # 列出可用模型 litgpt download --list # 下载 Llama 3.2 3B（受限模型需要 HF Token） litgpt download \ --repo_id meta-llama/Llama-3.2-3B-Instruct \ --checkpoint_dir checkpoints/ # 下载 Mistral 7B（开放访问） litgpt download \ --repo_id mistralai/Mistral-7B-Instruct-v0.3 # 下载 Gemma 2 2B litgpt download \ --repo_id google/gemma-2-2b-it \ --access_token your-hf-token # 下载 Phi-3（小而强大） litgpt download \ --repo_id microsoft/Phi-3-mini-4k-instruct ``` ### 设置 HuggingFace Token ```bash # 对于受限模型（Llama、Gemma） export HF_TOKEN=hf_your-token-here # 或通过 CLI 进行认证 pip install huggingface_hub huggingface-cli login ``` *** ## 推理（聊天与生成） ```bash # 交互式聊天 litgpt chat \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct # 单次生成 litgpt generate \ --prompt "用简单的术语解释 GPU 计算" \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \ --max_new_tokens 200 # 使用温度和采样 litgpt generate \ --prompt "写一个用于对列表排序的 Python 函数" \ --checkpoint_dir checkpoints/mistralai/Mistral-7B-Instruct-v0.3 \ --temperature 0.7 \ --top_p 0.9 \ --max_new_tokens 500 ``` *** ## 微调 ### LoRA 微调（推荐） LoRA 训练一小部分适配器参数（通常是总权重的 0.1–1%），而基础模型保持冻结。在 RTX 3090 上，对 10K 个样本进行 Llama 3 8B 的 LoRA 微调大约需要 \~2 小时，使用 `r=16`. ```bash # 准备你的数据集 # 格式：JSON 行，每行为 {"instruction": "...", "input": "...", "output": "..."} cat > data/train.json << 'EOF' {"instruction": "什么是 GPU 云计算？", "input": "", "output": "GPU 云计算通过互联网按需提供 GPU 硬件访问，使得在不拥有物理硬件的情况下进行 AI 训练和推理成为可能。"} {"instruction": "我如何在 Clore.ai 上租用 GPU？", "input": "", "output": "访问 clore.ai/marketplace，按 GPU 规格筛选，选择服务器，配置端口，然后点击租用。SSH 访问会立即提供。"} EOF # 使用 LoRA 微调 litgpt finetune lora \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \ --data JSON \ --data.json_path data/train.json \ --train.epochs 3 \ --train.micro_batch_size 4 \ --lora_r 8 \ --lora_alpha 16 \ --out_dir out/llama-lora-finetuned # 监控训练 # LitGPT 输出包含损失、学习率和 ETA 的日志 ``` ### QLoRA（4 位 + LoRA）使用 QLoRA 可在受限显存上微调大型模型。Llama 3 8B 可在单块 24 GB 的 RTX 3090 上运行： ```bash litgpt finetune lora \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-8B-Instruct \ --quantize bnb.nf4 \ --train.epochs 3 \ --train.micro_batch_size 2 \ --lora_r 16 \ --lora_alpha 32 \ --out_dir out/llama-qlora ``` ### 完整微调 ```bash litgpt finetune full \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \ --data JSON \ --data.json_path data/train.json \ --train.epochs 2 \ --train.micro_batch_size 2 \ --train.accumulate_gradients 8 \ --out_dir out/llama-full-finetuned ``` ### 多 GPU 训练 ```bash # 在多 GPU 上使用 FSDP litgpt finetune full \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-8B-Instruct \ --devices 4 \ --strategy fsdp \ --train.epochs 3 \ --out_dir out/llama-multigpu ``` *** ## 部署模型（REST API） ```bash # 启动推理服务器 litgpt serve \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \ --host 0.0.0.0 \ --port 8000 # 测试 API curl -X POST http://localhost:8000/predict \ -H "Content-Type: application/json" \ -d '{ "prompt": "法国的首都是哪里？", "max_new_tokens": 100, "temperature": 0.7 }' ``` ### Python 客户端 ```python import requests response = requests.post( "http://:8000/predict", json={ "prompt": "解释强化学习", "max_new_tokens": 500, "temperature": 0.8, "top_p": 0.9, } ) print(response.json()["output"]) ``` *** ## 从头预训练如需在自己的数据上从头训练自定义 LLM： ```bash # 准备预训练数据（已分词并分块） python scripts/prepare_redpajama.py \ --source_path /data/raw_text \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \ --destination_path /data/tokenized # 开始预训练 litgpt pretrain \ --model_name Llama-3.2 \ --data /data/tokenized \ --train.micro_batch_size 4 \ --train.max_tokens 10_000_000_000 \ --devices 8 \ --strategy fsdp \ --out_dir out/my-pretrained-llm ``` *** ## 转换与导出模型 ```bash # 将 LoRA 权重合并到基础模型中 litgpt merge_lora \ --checkpoint_dir out/llama-lora-finetuned # 转换为 HuggingFace 格式以便分发 litgpt convert to_hf \ --checkpoint_dir out/llama-lora-finetuned/final \ --output_dir hf_model/ # 导出为 GGUF 格式（用于 Ollama/LlamaCpp） # 在 HF 导出后使用 llama.cpp 的转换脚本 python llama.cpp/convert.py hf_model/ --outfile model.gguf ``` *** ## 评估模型 ```bash # 运行 MMLU 基准测试 litgpt evaluate \ --checkpoint_dir checkpoints/meta-llama/Llama-3.2-3B-Instruct \ --tasks mmlu \ --num_fewshot 5 # 运行多个基准测试 litgpt evaluate \ --checkpoint_dir out/llama-lora-finetuned/final \ --tasks "mmlu,hellaswag,truthfulqa_mc" ``` *** ## Clore.ai 的 GPU 建议 LitGPT 涵盖三类不同的工作负载——推理、LoRA 微调和完整预训练——每类对 GPU 的要求不同。 | 工作负载 | GPU | 显存 | 说明 | | -------------------- | -------------- | ----- | -------------------------------------- | | 推理 / 聊天（7–8B 模型） | **RTX 3090** | 24 GB | 在 bf16 下可容纳 Llama 3 8B；生成速度约 \~95 令牌/秒 | | LoRA 微调（7–8B 模型） | **RTX 3090** | 24 GB | 预算优选；QLoRA 可将显存保持在 10 GB 以下 | | LoRA 微调（7–8B），快速迭代 | **RTX 4090** | 24 GB | 比 3090 快约 \~35%；将 2 小时的任务缩短到 \~1.4 小时 | | 完整微调（7B）或 QLoRA（70B） | **A100 40 GB** | 40 GB | 40 GB 可容纳 7B 完整精度或 70B 的 4 位量化 | | 完整微调（13B+）或预训练运行 | **A100 80 GB** | 80 GB | 最高吞吐量；在 8B 上训练约 \~2,800 令牌/秒 | **对大多数用户的推荐：** RTX 3090 对（2×24 GB = 48 GB，使用 FSDP 有效）。可处理 70B 模型的 QLoRA，或通过张量并行处理 7B 模型的完整微调。Clore.ai 上两块 3090 的费用约为 \~$0.25/小时。 **用于预训练或 >70B 微调：** 使用 4×A100 80GB 与 FSDP。LitGPT 的 FSDP 集成会透明地处理分片——只需传入 `--devices 4 --strategy fsdp`. *** ## 故障排除 ### CUDA 显存不足 ```bash # 减小批量大小 --train.micro_batch_size 1 # 启用梯度检查点 --train.gradient_checkpointing true # 使用 QLoRA 替代 LoRA --quantize bnb.nf4 # 检查 GPU 内存 nvidia-smi ``` ### 下载失败 / HuggingFace 401 ```bash # 设置 HF token export HF_TOKEN=hf_your-token-here huggingface-cli login # 或直接传入 litgpt download \ --repo_id meta-llama/Llama-3.2-3B-Instruct \ --access_token hf_your-token ``` ### 训练损失不下降 ```bash # 检查你的数据格式——必须是有效的 JSON Lines python -c " import json with open('data/train.json') as f: for i, line in enumerate(f): json.loads(line) if i < 3: print(f'Line {i}: OK') print('所有行有效') " # 降低学习率 --train.lr 1e-5 # 对于小数据集默认值通常过高 # 检查数据规模——LoRA 需要至少 100-1000 个样本 wc -l data/train.json ``` ### 服务器端口 8000 无法访问 ```bash # 验证服务器是否在监听 ss -tlnp | grep 8000 # 开放防火墙 ufw allow 8000/tcp # 使用显式主机重新启动服务器 litgpt serve \ --checkpoint_dir checkpoints/... \ --host 0.0.0.0 \ --port 8000 ``` ### 多 GPU 训练挂起 ```bash # 检查 NCCL 连接性 python -c "import torch; print(torch.cuda.device_count())" # 对于较小模型尝试使用 DDP 替代 FSDP --strategy ddp # 设置 NCCL 环境变量 export NCCL_DEBUG=INFO export NCCL_IB_DISABLE=1 # 如果没有 InfiniBand ``` *** ## 有用的链接 * **GitHub**: ⭐ 12K+ * **文档**: * **PyTorch Lightning**: * **HuggingFace 模型**: * **Discord**: * **Clore.ai 市场**: --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://docs.clore.ai/guides/guides_v2-zh/xun-lian/litgpt.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.