GitHub

Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit

What is the project about?

Amphion is a toolkit designed for research and development in audio, music, and speech generation. It aims to provide a platform for reproducible research and help newcomers get started in the field.

What problem does it solve?

  • Provides a unified platform for various audio generation tasks, reducing the need to use multiple disparate tools.
  • Offers reproducible research environments and pre-built models, making it easier to replicate and build upon existing work.
  • Lowers the barrier to entry for junior researchers and engineers by providing visualizations and clear documentation.
  • Addresses the need for high-quality vocoders and consistent evaluation metrics in audio generation.
  • Facilitates the development of real-world applications by supporting large-scale dataset creation.

What are the features of the project?

  • Multiple Generation Tasks: Supports Text-to-Speech (TTS), Singing Voice Synthesis (SVS), Voice Conversion (VC), Accent Conversion(AC), Singing Voice Conversion (SVC), Text-to-Audio (TTA), Text-to-Music (TTM), and more.
  • State-of-the-Art Models: Includes implementations of popular and high-performing models like FastSpeech2, VITS, VALL-E, NaturalSpeech2, Jets, MaskGCT, Vevo, FACodec, Noro, and diffusion-based models.
  • Vocoders: Integrates a variety of neural vocoders, including GAN-based (MelGAN, HiFi-GAN, NSF-HiFiGAN, BigVGAN, APNet), flow-based (WaveGlow), diffusion-based (Diffwave), and auto-regressive (WaveNet, WaveRNN) vocoders.
  • Evaluation Metrics: Provides a comprehensive suite of objective evaluation metrics for assessing generated audio quality, including F0 modeling, energy modeling, intelligibility, spectrogram distortion, and speaker similarity.
  • Dataset Support: Offers unified data preprocessing for numerous open-source datasets (AudioCaps, LibriTTS, LJSpeech, M4Singer, Opencpop, OpenSinger, SVCC, VCTK, etc.) and exclusive support for the Emilia dataset and Emilia-Pipe preprocessing pipeline.
  • Visualization: Includes tools like SingVisio to visualize the internal workings of models, aiding in understanding and education.

What are the technologies used in the project?

  • Python: The primary programming language.
  • PyTorch: Deep learning framework.
  • Transformer Networks: Used in many of the supported models.
  • Generative Adversarial Networks (GANs): Used in vocoders and some generation models.
  • Variational Autoencoders (VAEs): Used in some generation models.
  • Flow-based Models: Used in vocoders and some generation models.
  • Diffusion Models: Used in some generation models and vocoders.
  • Docker: Provides containerization for easy setup and deployment.
  • CUDA: For GPU acceleration.

What are the benefits of the project?

  • Reproducibility: Facilitates reproducible research in audio generation.
  • Accessibility: Makes it easier for newcomers to enter the field.
  • Comprehensive: Offers a wide range of tools and models in one place.
  • State-of-the-Art: Includes implementations of cutting-edge models.
  • Extensible: Designed to support new models and tasks.
  • Open-Source: Free to use for both research and commercial purposes (MIT License).
  • Visualization: Provides tools to understand the models.

What are the use cases of the project?

  • Research: Studying and developing new audio, music, and speech generation models and techniques.
  • Education: Learning about audio generation models and their inner workings.
  • Applications: Building real-world applications such as:
    • Text-to-speech systems for accessibility or entertainment.
    • Voice cloning and modification.
    • Singing voice synthesis.
    • Audio and music generation.
    • Accent conversion.
    • Developing datasets for training.
Amphion screenshot