LTX-2 (Audio + Video)
Generate videos with native audio — foley, ambience, and lip-sync — using LTX-2 on Clore.ai GPUs.
LTX-2 (January 2026) is Lightricks' second-generation video foundation model and the first open-weight model to produce synchronized audio alongside video in a single forward pass. At 19B parameters it generates clips with foley sound effects, ambient audio, and lip-synced speech without requiring a separate audio model. The architecture builds on the original LTX-Video's speed advantage while dramatically expanding capability.
Renting a GPU on Clore.ai is the most practical way to run a 19B-parameter model — no $2,000 GPU purchase required, just spin up a machine and start generating.
Key Features
Native audio generation — foley effects, environmental ambience, and lip-synced dialogue produced jointly with video frames.
19B parameters — significantly larger transformer backbone than LTX-Video v1, delivering sharper detail and more coherent motion.
Text-to-Video + Image-to-Video — both modalities supported with audio output.
Up to 720p resolution — higher fidelity output than the v1 model.
Joint audio-visual latent space — a unified VAE encodes both video and audio, keeping them temporally aligned.
Open weights — released under a permissive license for commercial use.
Diffusers integration — compatible with the Hugging Face
diffusersecosystem.
Requirements
GPU VRAM
16 GB (with offloading)
24+ GB
System RAM
32 GB
64 GB
Disk
50 GB
80 GB
Python
3.10+
3.11
CUDA
12.1+
12.4
diffusers
0.33+
latest
Clore.ai GPU recommendation: An RTX 4090 (24 GB, ~$0.5–2/day) is the minimum for comfortable 720p generation with audio. For batch workloads or faster iteration, filter for dual-4090 or A6000 (48 GB) listings on the Clore.ai marketplace.
Quick Start
Usage Examples
Text-to-Video with Audio
Image-to-Video with Lip-Sync Audio
Ambient Scene with Foley
Tips for Clore.ai Users
Describe sounds explicitly — LTX-2's audio branch responds to audio cues in the prompt. "Crackling fire", "footsteps on gravel", "crowd murmuring" yield better foley than vague descriptions.
CPU offloading is essential — at 19B parameters, the model needs
enable_model_cpu_offload()on 24 GB cards. Budget 64 GB system RAM.Persistent storage — the model checkpoint is ~40 GB. Mount a Clore.ai persistent volume and set
HF_HOMEto avoid re-downloading on every container restart.Mux audio + video — if the pipeline outputs audio separately, combine with:
ffmpeg -i video.mp4 -i audio.wav -c:v copy -c:a aac final.mp4.bf16 only — the 19B model was trained in bf16; fp16 will cause numerical instability.
Batch in tmux — always run inside
tmuxon Clore.ai rentals to survive SSH disconnects.Check model ID — as LTX-2 is freshly released (Jan 2026), verify the exact HuggingFace model ID on the Lightricks HF page before running.
Troubleshooting
OutOfMemoryError
Enable pipe.enable_model_cpu_offload(); ensure ≥64 GB system RAM
No audio in output
Audio generation may require explicit flag or updated diffusers; check model card for latest API
Audio/video desync
Re-mux with ffmpeg: ffmpeg -i video.mp4 -i audio.wav -c:v copy -c:a aac -shortest out.mp4
Very slow generation
19B model is compute-heavy; ~2–4 min per 5-sec clip on RTX 4090 is expected
NaN outputs
Use torch.bfloat16 — fp16 is not supported for this model scale
Disk space error
Model is ~40 GB; ensure ≥80 GB free disk before downloading
ModuleNotFoundError: soundfile
pip install soundfile — needed for WAV audio export
Last updated
Was this helpful?