LTX-2 (Audio + Video)

Generate videos with native audio — foley, ambience, and lip-sync — using LTX-2 on Clore.ai GPUs.

LTX-2 (January 2026) is Lightricks' second-generation video foundation model and the first open-weight model to produce synchronized audio alongside video in a single forward pass. At 19B parameters it generates clips with foley sound effects, ambient audio, and lip-synced speech without requiring a separate audio model. The architecture builds on the original LTX-Video's speed advantage while dramatically expanding capability.

Renting a GPU on Clore.aiarrow-up-right is the most practical way to run a 19B-parameter model — no $2,000 GPU purchase required, just spin up a machine and start generating.

Key Features

  • Native audio generation — foley effects, environmental ambience, and lip-synced dialogue produced jointly with video frames.

  • 19B parameters — significantly larger transformer backbone than LTX-Video v1, delivering sharper detail and more coherent motion.

  • Text-to-Video + Image-to-Video — both modalities supported with audio output.

  • Up to 720p resolution — higher fidelity output than the v1 model.

  • Joint audio-visual latent space — a unified VAE encodes both video and audio, keeping them temporally aligned.

  • Open weights — released under a permissive license for commercial use.

  • Diffusers integration — compatible with the Hugging Face diffusers ecosystem.

Requirements

Component
Minimum
Recommended

GPU VRAM

16 GB (with offloading)

24+ GB

System RAM

32 GB

64 GB

Disk

50 GB

80 GB

Python

3.10+

3.11

CUDA

12.1+

12.4

diffusers

0.33+

latest

Clore.ai GPU recommendation: An RTX 4090 (24 GB, ~$0.5–2/day) is the minimum for comfortable 720p generation with audio. For batch workloads or faster iteration, filter for dual-4090 or A6000 (48 GB) listings on the Clore.ai marketplace.

Quick Start

Usage Examples

Text-to-Video with Audio

Image-to-Video with Lip-Sync Audio

Ambient Scene with Foley

Tips for Clore.ai Users

  1. Describe sounds explicitly — LTX-2's audio branch responds to audio cues in the prompt. "Crackling fire", "footsteps on gravel", "crowd murmuring" yield better foley than vague descriptions.

  2. CPU offloading is essential — at 19B parameters, the model needs enable_model_cpu_offload() on 24 GB cards. Budget 64 GB system RAM.

  3. Persistent storage — the model checkpoint is ~40 GB. Mount a Clore.ai persistent volume and set HF_HOME to avoid re-downloading on every container restart.

  4. Mux audio + video — if the pipeline outputs audio separately, combine with: ffmpeg -i video.mp4 -i audio.wav -c:v copy -c:a aac final.mp4.

  5. bf16 only — the 19B model was trained in bf16; fp16 will cause numerical instability.

  6. Batch in tmux — always run inside tmux on Clore.ai rentals to survive SSH disconnects.

  7. Check model ID — as LTX-2 is freshly released (Jan 2026), verify the exact HuggingFace model ID on the Lightricks HF pagearrow-up-right before running.

Troubleshooting

Problem
Fix

OutOfMemoryError

Enable pipe.enable_model_cpu_offload(); ensure ≥64 GB system RAM

No audio in output

Audio generation may require explicit flag or updated diffusers; check model card for latest API

Audio/video desync

Re-mux with ffmpeg: ffmpeg -i video.mp4 -i audio.wav -c:v copy -c:a aac -shortest out.mp4

Very slow generation

19B model is compute-heavy; ~2–4 min per 5-sec clip on RTX 4090 is expected

NaN outputs

Use torch.bfloat16 — fp16 is not supported for this model scale

Disk space error

Model is ~40 GB; ensure ≥80 GB free disk before downloading

ModuleNotFoundError: soundfile

pip install soundfile — needed for WAV audio export

Last updated

Was this helpful?