GitHub

LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync

What is the project about?

LatentSync is an end-to-end lip-sync framework that generates lip-synced videos directly from audio input. It uses a novel approach based on audio-conditioned latent diffusion models, leveraging the capabilities of Stable Diffusion.

What problem does it solve?

The project addresses the problem of generating realistic and temporally consistent lip-synced videos. Existing diffusion-based methods often suffer from temporal inconsistencies. LatentSync aims for higher quality and more stable lip movements synchronized with audio.

What are the features of the project?

  • End-to-end lip-sync generation: Directly generates video frames from audio, without relying on intermediate representations like facial landmarks.
  • Audio-conditioned latent diffusion model: Uses a diffusion model in the latent space of Stable Diffusion, conditioned on audio embeddings from Whisper.
  • Temporal REPresentation Alignment (TREPA): A novel technique to improve temporal consistency in the generated video by aligning frame representations with those from large-scale self-supervised video models.
  • High-quality output: Leverages Stable Diffusion to produce visually appealing and realistic results.
  • Open-source: Includes inference code, data processing pipeline, and training code.

What are the technologies used in the project?

  • Stable Diffusion: A powerful latent diffusion model for image generation.
  • Whisper: An audio encoder from OpenAI used to extract audio embeddings.
  • U-Net: The core architecture of the diffusion model, modified to incorporate audio conditioning via cross-attention.
  • SyncNet: A pre-trained model used to enforce lip-sync accuracy by providing a loss function during training.
  • LPIPS (Learned Perceptual Image Patch Similarity): A perceptual loss function used to improve visual quality.
  • TREPA: Custom temporal consistency module.
  • PyTorch: Deep learning framework.
  • Gradio: For creating a user-friendly demo application.
  • PySceneDetect: For scene detection in the data processing pipeline.
  • face-alignment: For face landmark detection.

What are the benefits of the project?

  • Improved temporal consistency: TREPA significantly enhances the stability of lip movements across frames.
  • High-quality lip-sync: Achieves accurate synchronization between audio and video.
  • End-to-end approach: Simplifies the lip-sync generation process.
  • Open-source and reproducible: Provides code and resources for others to use and build upon.
  • Easy to use: Includes a Gradio app and command-line interface for inference.

What are the use cases of the project?

  • Dubbing: Automatically generating lip-synced videos for different languages.
  • Virtual avatars: Creating realistic talking avatars for games, virtual assistants, and other applications.
  • Video editing: Modifying the lip movements in existing videos.
  • Accessibility: Improving communication for individuals with speech impairments.
  • Animation: Generating lip-sync for animated characters.
  • Content Creation: Creating engaging video content with synchronized audio and visuals.
LatentSync screenshot