# LTX-2 (Audio + Video)

LTX-2 (January 2026) is Lightricks' second-generation video foundation model and the first open-weight model to produce **synchronized audio alongside video** in a single forward pass. At 19B parameters it generates clips with foley sound effects, ambient audio, and lip-synced speech without requiring a separate audio model. The architecture builds on the original LTX-Video's speed advantage while dramatically expanding capability.

Renting a GPU on [Clore.ai](https://clore.ai/) is the most practical way to run a 19B-parameter model — no $2,000 GPU purchase required, just spin up a machine and start generating.

## Key Features

* **Native audio generation** — foley effects, environmental ambience, and lip-synced dialogue produced jointly with video frames.
* **19B parameters** — significantly larger transformer backbone than LTX-Video v1, delivering sharper detail and more coherent motion.
* **Text-to-Video + Image-to-Video** — both modalities supported with audio output.
* **Up to 720p resolution** — higher fidelity output than the v1 model.
* **Joint audio-visual latent space** — a unified VAE encodes both video and audio, keeping them temporally aligned.
* **Open weights** — released under a permissive license for commercial use.
* **Diffusers integration** — compatible with the Hugging Face `diffusers` ecosystem.

## Requirements

| Component  | Minimum                 | Recommended |
| ---------- | ----------------------- | ----------- |
| GPU VRAM   | 16 GB (with offloading) | 24+ GB      |
| System RAM | 32 GB                   | 64 GB       |
| Disk       | 50 GB                   | 80 GB       |
| Python     | 3.10+                   | 3.11        |
| CUDA       | 12.1+                   | 12.4        |
| diffusers  | 0.33+                   | latest      |

**Clore.ai GPU recommendation:** An **RTX 4090** (24 GB, \~$0.5–2/day) is the minimum for comfortable 720p generation with audio. For batch workloads or faster iteration, filter for **dual-4090** or **A6000** (48 GB) listings on the Clore.ai marketplace.

## Quick Start

```bash
# Install dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install diffusers transformers accelerate sentencepiece
pip install imageio[ffmpeg] soundfile scipy

# Verify GPU
python -c "import torch; print(torch.cuda.get_device_name(0), torch.cuda.get_device_properties(0).total_mem // 1024**3, 'GB')"
```

## Usage Examples

### Text-to-Video with Audio

```python
import torch
from diffusers import LTXPipeline
from diffusers.utils import export_to_video
import soundfile as sf

# Load LTX-2 (ensure you have the correct model ID when released)
pipe = LTXPipeline.from_pretrained(
    "Lightricks/LTX-Video-2",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()

prompt = (
    "A blacksmith hammering glowing metal on an anvil, sparks flying, "
    "rhythmic clanging of hammer on steel, ambient workshop noise"
)

output = pipe(
    prompt=prompt,
    negative_prompt="silent, blurry, low quality",
    num_frames=121,
    width=1280,
    height=720,
    num_inference_steps=40,
    guidance_scale=7.0,
    generator=torch.Generator("cuda").manual_seed(42),
)

# Export video frames
export_to_video(output.frames[0], "blacksmith.mp4", fps=24)

# Export audio if available
if hasattr(output, "audio") and output.audio is not None:
    sf.write("blacksmith_audio.wav", output.audio, samplerate=16000)
    print("Audio saved separately — mux with ffmpeg:")
    print("  ffmpeg -i blacksmith.mp4 -i blacksmith_audio.wav -c:v copy -c:a aac output.mp4")

print("Done: blacksmith.mp4")
```

### Image-to-Video with Lip-Sync Audio

```python
import torch
from PIL import Image
from diffusers import LTXImageToVideoPipeline
from diffusers.utils import export_to_video

pipe = LTXImageToVideoPipeline.from_pretrained(
    "Lightricks/LTX-Video-2",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()

# Portrait image for lip-sync
image = Image.open("portrait.png").resize((720, 1280))

output = pipe(
    prompt="A person saying 'Welcome to the future of AI video' with clear enunciation, neutral background",
    image=image,
    num_frames=121,
    num_inference_steps=40,
    guidance_scale=7.0,
)

export_to_video(output.frames[0], "talking_head.mp4", fps=24)
```

### Ambient Scene with Foley

```python
import torch
from diffusers import LTXPipeline
from diffusers.utils import export_to_video

pipe = LTXPipeline.from_pretrained(
    "Lightricks/LTX-Video-2", torch_dtype=torch.bfloat16
).to("cuda")

# Audio-rich prompt — describe sounds explicitly
prompt = (
    "Rain falling on a tin roof in a tropical village, "
    "thunder rumbling in the distance, birds briefly chirping between rolls, "
    "puddles rippling on a dirt path"
)

output = pipe(
    prompt=prompt,
    num_frames=121,
    width=1280,
    height=720,
    num_inference_steps=40,
    guidance_scale=6.5,
)

export_to_video(output.frames[0], "rain_scene.mp4", fps=24)
```

## Tips for Clore.ai Users

1. **Describe sounds explicitly** — LTX-2's audio branch responds to audio cues in the prompt. "Crackling fire", "footsteps on gravel", "crowd murmuring" yield better foley than vague descriptions.
2. **CPU offloading is essential** — at 19B parameters, the model needs `enable_model_cpu_offload()` on 24 GB cards. Budget 64 GB system RAM.
3. **Persistent storage** — the model checkpoint is \~40 GB. Mount a Clore.ai persistent volume and set `HF_HOME` to avoid re-downloading on every container restart.
4. **Mux audio + video** — if the pipeline outputs audio separately, combine with: `ffmpeg -i video.mp4 -i audio.wav -c:v copy -c:a aac final.mp4`.
5. **bf16 only** — the 19B model was trained in bf16; fp16 will cause numerical instability.
6. **Batch in tmux** — always run inside `tmux` on Clore.ai rentals to survive SSH disconnects.
7. **Check model ID** — as LTX-2 is freshly released (Jan 2026), verify the exact HuggingFace model ID on the [Lightricks HF page](https://huggingface.co/Lightricks) before running.

## Troubleshooting

| Problem                          | Fix                                                                                              |
| -------------------------------- | ------------------------------------------------------------------------------------------------ |
| `OutOfMemoryError`               | Enable `pipe.enable_model_cpu_offload()`; ensure ≥64 GB system RAM                               |
| No audio in output               | Audio generation may require explicit flag or updated diffusers; check model card for latest API |
| Audio/video desync               | Re-mux with ffmpeg: `ffmpeg -i video.mp4 -i audio.wav -c:v copy -c:a aac -shortest out.mp4`      |
| Very slow generation             | 19B model is compute-heavy; \~2–4 min per 5-sec clip on RTX 4090 is expected                     |
| NaN outputs                      | Use `torch.bfloat16` — fp16 is not supported for this model scale                                |
| Disk space error                 | Model is \~40 GB; ensure ≥80 GB free disk before downloading                                     |
| `ModuleNotFoundError: soundfile` | `pip install soundfile` — needed for WAV audio export                                            |
