Hibiki: High-Fidelity Simultaneous Speech-To-Speech Translation

What is the project about?

Hibiki is a streaming speech translation model that performs simultaneous translation, producing both text and speech output in the target language as the source speaker talks. It adapts its flow to accumulate just enough context for real-time translation.

What problem does it solve?

It solves the problem of needing to wait for the end of a sentence or utterance before translation can begin (offline translation). Hibiki provides a solution for real-time, simultaneous translation, making communication more natural and fluid.

What are the features of the project?

Streaming/Simultaneous Translation: Translates speech in real-time, chunk by chunk, as the speaker talks.
Speech and Text Output: Generates both natural-sounding speech and timestamped text translation in the target language.
Voice Transfer: Optionally preserves the speaker's voice characteristics in the translated speech.
Multistream Architecture: Based on the Moshi architecture, it models source and target speech jointly for continuous processing.
Controllable Fidelity: Voice transfer fidelity can be adjusted using Classifier-Free Guidance.
Constant Framerate: Outputs text and audio tokens at a constant 12.5Hz for a continuous audio stream.
Batching Compatible: Uses temperature sampling, making it compatible with batch processing.
Multiple Backends: Provides inference code for PyTorch, Rust, MLX (macOS), and MLX-Swift (iOS).
On-device inference: Hibiki-M can run locally on smartphone hardware.

What are the technologies used in the project?

Deep Learning: Decoder-only transformer model.
Moshi Architecture: Leverages the multistream architecture of Moshi.
Synthetic Data Generation: Uses weakly-supervised contextual alignment and alignment-aware TTS for training data.
MADLAD: Uses a pre-trained MADLAD machine translation system for word-level alignment.
Programming Languages: Python (PyTorch), Rust, Swift (MLX-Swift).
Frameworks/Libraries: PyTorch, MLX, MLX-Swift.
Hugging Face: Models and samples are hosted on Hugging Face.

What are the benefits of the project?

Real-time Communication: Enables more natural and fluid cross-lingual conversations.
Low Latency: Minimizes the delay between source speech and translated output.
High Fidelity: Produces natural-sounding speech with optional voice transfer.
Flexibility: Supports multiple platforms and devices.
Open Source: Code and models are publicly available.

What are the use cases of the project?

Live Interpretation: Simultaneous translation for meetings, conferences, and presentations.
Real-time Communication Tools: Integration into video conferencing and communication apps.
Accessibility: Assisting individuals with hearing impairments or language barriers.
Content Creation: Dubbing videos and podcasts in real-time.
Language Learning: Providing immediate feedback and translation during language practice.
On-device translation: Translation on smartphones.