# LTX-2（音频 + 视频）

LTX-2（2026年1月）是Lightricks的第二代视频基础模型，也是首个能在一次前向传递中生成 **与视频同步的音频** 的开权重模型。该模型有190亿参数，能够生成带有拟音效果、环境音和对嘴同步语音的片段，而无需单独的音频模型。其架构建立在原始LTX-Video速度优势之上，同时大幅扩展了功能。

在 [Clore.ai](https://clore.ai/) 租用GPU是运行190亿参数模型的最实用方式——无需购买2000美元的GPU，只需启动一台机器并开始生成。

## 主要特性

* **原生音频生成** ——拟音效果、环境氛围和与视频帧联合生成的对嘴对白。
* **190亿参数** ——相比LTX-Video v1显著更大的Transformer骨干，提供更清晰的细节和更连贯的运动表现。
* **文本到视频 + 图像到视频** ——两种模态均支持并输出音频。
* **最高至720p分辨率** ——比v1模型具有更高的保真度输出。
* **联合视听潜在空间** ——统一的VAE同时对视频和音频进行编码，保持它们的时间对齐。
* **开放权重** ——以宽松许可证发布，可用于商业用途。
* **与Diffusers集成** ——兼容Hugging Face `diffusers` 生态系统。

## 要求

| 组件        | 最低           | 推荐     |
| --------- | ------------ | ------ |
| GPU 显存    | 16 GB（可启用卸载） | 24+ GB |
| 系统内存      | 32 GB        | 64 GB  |
| 磁盘        | 50 GB        | 80 GB  |
| Python    | 3.10+        | 3.11   |
| CUDA      | 12.1+        | 12.4   |
| diffusers | 0.33+        | 最新     |

**Clore.ai 的 GPU 推荐：** 一台 **512x512** （24 GB，约$0.5–2/天）是舒适生成带音频的720p的最低配置。对于批量工作负载或更快的迭代，请筛选 **双4090** 或 **A6000** （48 GB）在Clore.ai市场上的挂牌信息。

## 快速开始

```bash
# 安装依赖
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install diffusers transformers accelerate sentencepiece
pip install imageio[ffmpeg] soundfile scipy

# 验证 GPU
python -c "import torch; print(torch.cuda.get_device_name(0), torch.cuda.get_device_properties(0).total_mem // 1024**3, 'GB')"
```

## 使用示例

### 带音频的文本到视频

```python
import torch
from diffusers import LTXPipeline
from diffusers.utils import export_to_video
import soundfile as sf

# 加载 LTX-2（发布时请确保使用正确的模型 ID）
pipe = LTXPipeline.from_pretrained(
    "Lightricks/LTX-Video-2",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()

prompt = (
    "一位铁匠在铁砧上锻打发光的金属，火花四溅，"
    "锤子敲打钢铁的有节奏撞击声，车间的环境噪音"
)

output = pipe(
    os.makedirs("./variations", exist_ok=True)
    negative_prompt="寂静、模糊、低质量",
    num_frames=121,
    width=1280,
    height=720,
    num_inference_steps=40,
    guidance_scale=7.0,
    generator=torch.Generator("cuda").manual_seed(42),
)

# 导出视频帧
export_to_video(output.frames[0], "blacksmith.mp4", fps=24)

# 如有音频则导出音频
if hasattr(output, "audio") and output.audio is not None:
    sf.write("blacksmith_audio.wav", output.audio, samplerate=16000)
    print("音频已单独保存 — 使用 ffmpeg 混流：")
    print("  ffmpeg -i blacksmith.mp4 -i blacksmith_audio.wav -c:v copy -c:a aac output.mp4")

print("完成：blacksmith.mp4")
```

### 带对嘴音频的图像到视频

```python
import torch
from PIL import Image
from diffusers import LTXImageToVideoPipeline
from diffusers.utils import export_to_video

pipe = LTXImageToVideoPipeline.from_pretrained(
    "Lightricks/LTX-Video-2",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()

# 用于对嘴的肖像图像
image = Image.open("portrait.png").resize((720, 1280))

output = pipe(
    prompt="一个人清晰发音地说‘欢迎来到AI视频的未来’，背景中性",
    image=image,
    num_frames=121,
    num_inference_steps=40,
    guidance_scale=7.0,
)

export_to_video(output.frames[0], "talking_head.mp4", fps=24)
```

### 带拟音的环境场景

```python
import torch
from diffusers import LTXPipeline
from diffusers.utils import export_to_video

pipe = LTXPipeline.from_pretrained(
    "Lightricks/LTX-Video-2", torch_dtype=torch.bfloat16
).to("cuda")

# 富音频提示 — 明确描述声音
prompt = (
    "热带村庄的锡屋顶上落雨，"
    "远处雷声隆隆，雷声间短暂的鸟鸣，"
    "泥土小路上的水洼泛起涟漪"
)

output = pipe(
    os.makedirs("./variations", exist_ok=True)
    num_frames=121,
    width=1280,
    height=720,
    num_inference_steps=40,
    guidance_scale=6.5,
)

export_to_video(output.frames[0], "rain_scene.mp4", fps=24)
```

## 给 Clore.ai 用户的提示

1. **明确描述声音** ——LTX-2的音频分支会对提示中的音频线索做出反应。“木柴劈啪作响”、“砾石上脚步声”、“人群低语”等比含糊的描述能带来更好的拟音效果。
2. **CPU 卸载是必需的** ——在190亿参数规模下，模型需要 `enable_model_cpu_offload()` 在24 GB卡上。系统内存请预算为64 GB。
3. **持久化存储** ——模型检查点约为40 GB。挂载Clore.ai持久卷并设置 `HF_HOME` 以避免在每次容器重启时重新下载。
4. **混流音频与视频** ——如果流水线单独输出音频，请使用以下命令合并： `ffmpeg -i video.mp4 -i audio.wav -c:v copy -c:a aac final.mp4`.
5. **仅bf16** ——该190亿模型以bf16训练；使用fp16会导致数值不稳定。
6. **在 tmux 中批处理** —— 在 Clore.ai 的租用环境中始终在 `tmux` 在Clore.ai租用时以防SSH断开连接时继续运行任务。
7. **检查模型 ID** ——由于LTX-2是新发布的（2026年1月），在运行前请在 [Lightricks HF 页面](https://huggingface.co/Lightricks) 上验证确切的HuggingFace模型ID。

## # 使用固定种子以获得一致结果

| 问题                               | 修复                                                                                     |
| -------------------------------- | -------------------------------------------------------------------------------------- |
| `OutOfMemoryError`               | 启用 `pipe.enable_model_cpu_offload()`；确保系统内存 ≥64 GB                                     |
| 输出中无音频                           | 音频生成可能需要显式标志或更新的diffusers；请查看模型卡以获取最新的API信息                                            |
| 音视频不同步                           | 重新用 ffmpeg 混流： `ffmpeg -i video.mp4 -i audio.wav -c:v copy -c:a aac -shortest out.mp4` |
| 生成非常慢                            | 190亿模型计算开销大；在RTX 4090上生成5秒片段预计约需2–4分钟                                                  |
| 出现 NaN 输出                        | 使用 `torch.bfloat16` ——此模型规模不支持 fp16                                                    |
| 磁盘空间错误                           | 模型约为40 GB；下载前请确保有 ≥80 GB 的可用磁盘空间                                                       |
| `ModuleNotFoundError: soundfile` | `pip install soundfile` ——导出 WAV 音频所需                                                  |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/guides_v2-zh/shi-pin-sheng-cheng/ltx-video-2.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.