LTX-2 AI Video with Wan2GP — Free Low VRAM Image to Video

LTX-2 by Lightricks is a serious step forward in open-source AI video generation. We're talking 4K resolution at 50 frames per second with native synchronized audio — dialogue, ambient noise, and lip-sync — all generated from a single model. And thanks to Wan2GP's low-VRAM optimizations, this runs on consumer hardware.

What Makes LTX-2 Stand Out

🎬

Native Audio Generation

Synthesizes synchronized dialogue and ambient sound — not just pixels

📺

4K @ 50fps

Production-grade resolution and frame rate from an open-source model

🗺️

Depth Map Control

Use depth maps or multiple keyframes to direct camera movement precisely

⚡

3x Faster Generation

NVFP4 and BF-16 formats deliver up to 3x speed vs older video models

LTX-2 also offers two distinct generation modes: a Fast mode for quick iteration and a Pro mode for cinematic-quality output. We'll run it through Wan2GP, a local AI video tool optimized for low-VRAM consumer hardware that supports LTX-2, WAN 2.2, and Flux.

One-click installer available: Patreon members can skip the manual setup with a one-click Windows installer for Wan2GP, plus a separate ComfyUI installer with pre-built text-to-video and image-to-video workflows. Links in the video description.

Manual Install — Wan2GP with LTX-2

Requirements before starting:

Miniconda — isolates Python environments (search "Miniconda" → anaconda.com)
FFmpeg — required for video processing (ffmpeg.org)
Git — for cloning the repository (git-scm.com)
Python 3.10 — newer versions conflict with required libraries

Create the Conda environment. Open Anaconda Prompt and run:

conda create -n wan2gp python=3.10 -y conda activate wan2gp

Navigate to your install folder using cd, then clone the repository:

git clone https://github.com/deepbeats/wan2gp cd wan2gp

Install base dependencies:

pip install -r requirements.txt

Install PyTorch with CUDA support (Windows NVIDIA GPU — use the specific command from the video description for PyTorch 2.8 with CUDA 12.8).
Install Sage Attention via pre-built wheel. This step is often the trickiest on Windows. Download the pre-built wheel from the linked GitHub repository (releases page) — match it to Python 3.10 + PyTorch 2.8 + CUDA 12.8. Drop the .whl file into your wan2gp folder, then run:

pip install [sage_attention_wheel_filename].whl

Install Triton for Windows using the command from the video description (a Windows-compatible alternative to the standard Triton build).
Launch Wan2GP:

python wgp.py

The terminal will output a local URL — paste it into your browser to open the Wan2GP Gradio interface.

First run: Wan2GP automatically downloads LTX-2 model weights on the first generation. This initial wait is long — plan for it and make sure you have enough disk space before starting.

Running the Three LTX-2 Workflows

Mode 1

Text to Video

Select "Text prompt only" — write a detailed prompt and let the model generate + add audio automatically

Mode 2

Image to Video

Select "Start video with image" — upload a source image and add the camera control LoRA for natural motion

Mode 3

Talking Avatar

Image to video + upload your own voice clip — generates a lip-synced talking head from a single photo

Text to Video

In the Wan2GP UI, select LTX-2 from the model dropdown.
Set Control video process to either Upload audio or Generate video based on soundtrack and text prompt.
Write a highly detailed prompt — LTX-2 is a production-grade model and responds much better to descriptive, specific instructions (lighting, motion, character details). Running your draft through an LLM to expand it works well.
Set resolution to 720p minimum — quality degrades heavily at lower resolutions.
Click Generate. On an RTX 4090, expect 1.5–4 minutes depending on prompt complexity and frame count.

Image to Video

Select Start video with image at the top of the interface.
Upload your source image.
Critical: Download and add the LTX2-camera-control-static LoRA to your Wan2GP loras/ folder. Without this LoRA, image-to-video generations produce endless camera zooms with little subject motion.
Select the LoRA in the interface, set your strength value, and generate. Image-to-video takes slightly longer than text-to-video but gives much more compositional control.

Talking Avatar (Lip-Sync)

Keep Start video with image selected and upload your character photo.
Under Control video process, switch to Generate video based on soundtrack and text prompt.
Upload your voice recording in the audio slot that appears.
In your prompt, describe the character's actions. If they're speaking, include the dialogue explicitly.
Match frame count to audio length — if your audio clip is 5 seconds, set frames accordingly so the video doesn't cut off mid-sentence.
Make sure the camera control LoRA is still active — it's essential for talking avatar generations too.

Advanced Settings

Advanced Mode Tab

CFG scale, seed, frame count — accessible in advanced mode. Be careful with frame count — higher = longer video but much longer generation time.
LoRAs — drop .safetensors files into the loras/ folder, click Refresh, then select and set strength in the UI.

Post-Processing / Upscaling

Available in the Post-processing tab. If you use upscaling, prefer the spatial upscaler — it produces cleaner results than the temporal upscaler.

Quantized Models for Lower VRAM

The model dropdown includes distilled, GGUF, and FP4 quantized variants. If your GPU is under 12 GB VRAM, try the GGUF version — it significantly reduces memory requirements with a modest quality trade-off.

RunPod alternative: If your local GPU is too limited (tested on an RTX 4050 with 6 GB VRAM for reference), RunPod provides RTX 4090 access for a few dollars per session to run full-quality generations.

Tips for Best Results

Detailed prompts are essential. LTX-2 rewards specificity — lighting, movement direction, subject description, atmosphere. Expand a brief idea with an LLM before generating.
Always add the camera control LoRA for image-to-video. Without it, subjects barely move and the camera zooms endlessly.
Match frame count to audio clip length for talking avatar workflows.
720p is the minimum viable resolution — quality drops sharply below this.
Use spatial upscaling in post-processing, not temporal, for cleaner output.

📦 Want to skip the setup?

The Local Lab offers pre-configured AI installer packages so you can get running in minutes, not hours.

Get the Installer →

LTX-2 AI Video — FREE Low VRAM Image to Video with Audio (Wan2GP Tutorial)

What Makes LTX-2 Stand Out

Manual Install — Wan2GP with LTX-2