# LTX-2 (Audio + Video)

LTX-2 (January 2026) is Lightricks' second-generation video foundation model and the first open-weight model to produce **synchronized audio alongside video** in a single forward pass. At 19B parameters it generates clips with foley sound effects, ambient audio, and lip-synced speech without requiring a separate audio model. The architecture builds on the original LTX-Video's speed advantage while dramatically expanding capability.

Renting a GPU on [Clore.ai](https://clore.ai/) is the most practical way to run a 19B-parameter model — no $2,000 GPU purchase required, just spin up a machine and start generating.

## Key Features

* **Native audio generation** — foley effects, environmental ambience, and lip-synced dialogue produced jointly with video frames.
* **19B parameters** — significantly larger transformer backbone than LTX-Video v1, delivering sharper detail and more coherent motion.
* **Text-to-Video + Image-to-Video** — both modalities supported with audio output.
* **Up to 720p resolution** — higher fidelity output than the v1 model.
* **Joint audio-visual latent space** — a unified VAE encodes both video and audio, keeping them temporally aligned.
* **Open weights** — released under a permissive license for commercial use.
* **Diffusers integration** — compatible with the Hugging Face `diffusers` ecosystem.

## Requirements

| Component  | Minimum                 | Recommended |
| ---------- | ----------------------- | ----------- |
| GPU VRAM   | 16 GB (with offloading) | 24+ GB      |
| System RAM | 32 GB                   | 64 GB       |
| Disk       | 50 GB                   | 80 GB       |
| Python     | 3.10+                   | 3.11        |
| CUDA       | 12.1+                   | 12.4        |
| diffusers  | 0.33+                   | latest      |

**Clore.ai GPU recommendation:** An **RTX 4090** (24 GB, \~$0.5–2/day) is the minimum for comfortable 720p generation with audio. For batch workloads or faster iteration, filter for **dual-4090** or **A6000** (48 GB) listings on the Clore.ai marketplace.

## Quick Start

```bash
# Install dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install diffusers transformers accelerate sentencepiece
pip install imageio[ffmpeg] soundfile scipy

# Verify GPU
python -c "import torch; print(torch.cuda.get_device_name(0), torch.cuda.get_device_properties(0).total_mem // 1024**3, 'GB')"
```

## Usage Examples

### Text-to-Video with Audio

```python
import torch
from diffusers import LTXPipeline
from diffusers.utils import export_to_video
import soundfile as sf

# Load LTX-2 (ensure you have the correct model ID when released)
pipe = LTXPipeline.from_pretrained(
    "Lightricks/LTX-Video-2",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()

prompt = (
    "A blacksmith hammering glowing metal on an anvil, sparks flying, "
    "rhythmic clanging of hammer on steel, ambient workshop noise"
)

output = pipe(
    prompt=prompt,
    negative_prompt="silent, blurry, low quality",
    num_frames=121,
    width=1280,
    height=720,
    num_inference_steps=40,
    guidance_scale=7.0,
    generator=torch.Generator("cuda").manual_seed(42),
)

# Export video frames
export_to_video(output.frames[0], "blacksmith.mp4", fps=24)

# Export audio if available
if hasattr(output, "audio") and output.audio is not None:
    sf.write("blacksmith_audio.wav", output.audio, samplerate=16000)
    print("Audio saved separately — mux with ffmpeg:")
    print("  ffmpeg -i blacksmith.mp4 -i blacksmith_audio.wav -c:v copy -c:a aac output.mp4")

print("Done: blacksmith.mp4")
```

### Image-to-Video with Lip-Sync Audio

```python
import torch
from PIL import Image
from diffusers import LTXImageToVideoPipeline
from diffusers.utils import export_to_video

pipe = LTXImageToVideoPipeline.from_pretrained(
    "Lightricks/LTX-Video-2",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()

# Portrait image for lip-sync
image = Image.open("portrait.png").resize((720, 1280))

output = pipe(
    prompt="A person saying 'Welcome to the future of AI video' with clear enunciation, neutral background",
    image=image,
    num_frames=121,
    num_inference_steps=40,
    guidance_scale=7.0,
)

export_to_video(output.frames[0], "talking_head.mp4", fps=24)
```

### Ambient Scene with Foley

```python
import torch
from diffusers import LTXPipeline
from diffusers.utils import export_to_video

pipe = LTXPipeline.from_pretrained(
    "Lightricks/LTX-Video-2", torch_dtype=torch.bfloat16
).to("cuda")

# Audio-rich prompt — describe sounds explicitly
prompt = (
    "Rain falling on a tin roof in a tropical village, "
    "thunder rumbling in the distance, birds briefly chirping between rolls, "
    "puddles rippling on a dirt path"
)

output = pipe(
    prompt=prompt,
    num_frames=121,
    width=1280,
    height=720,
    num_inference_steps=40,
    guidance_scale=6.5,
)

export_to_video(output.frames[0], "rain_scene.mp4", fps=24)
```

## Tips for Clore.ai Users

1. **Describe sounds explicitly** — LTX-2's audio branch responds to audio cues in the prompt. "Crackling fire", "footsteps on gravel", "crowd murmuring" yield better foley than vague descriptions.
2. **CPU offloading is essential** — at 19B parameters, the model needs `enable_model_cpu_offload()` on 24 GB cards. Budget 64 GB system RAM.
3. **Persistent storage** — the model checkpoint is \~40 GB. Mount a Clore.ai persistent volume and set `HF_HOME` to avoid re-downloading on every container restart.
4. **Mux audio + video** — if the pipeline outputs audio separately, combine with: `ffmpeg -i video.mp4 -i audio.wav -c:v copy -c:a aac final.mp4`.
5. **bf16 only** — the 19B model was trained in bf16; fp16 will cause numerical instability.
6. **Batch in tmux** — always run inside `tmux` on Clore.ai rentals to survive SSH disconnects.
7. **Check model ID** — as LTX-2 is freshly released (Jan 2026), verify the exact HuggingFace model ID on the [Lightricks HF page](https://huggingface.co/Lightricks) before running.

## Troubleshooting

| Problem                          | Fix                                                                                              |
| -------------------------------- | ------------------------------------------------------------------------------------------------ |
| `OutOfMemoryError`               | Enable `pipe.enable_model_cpu_offload()`; ensure ≥64 GB system RAM                               |
| No audio in output               | Audio generation may require explicit flag or updated diffusers; check model card for latest API |
| Audio/video desync               | Re-mux with ffmpeg: `ffmpeg -i video.mp4 -i audio.wav -c:v copy -c:a aac -shortest out.mp4`      |
| Very slow generation             | 19B model is compute-heavy; \~2–4 min per 5-sec clip on RTX 4090 is expected                     |
| NaN outputs                      | Use `torch.bfloat16` — fp16 is not supported for this model scale                                |
| Disk space error                 | Model is \~40 GB; ensure ≥80 GB free disk before downloading                                     |
| `ModuleNotFoundError: soundfile` | `pip install soundfile` — needed for WAV audio export                                            |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.clore.ai/guides/video-generation/ltx-video-2.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
