Meta unveils SeamlessM4T multimodal translation model
Meta researchers have unveiled SeamlessM4T, a pioneering multilingual and multitask model that facilitates seamless translation and transcription across both speech and text.
The internet, mobile devices, social media, and communication platforms have ushered in an era where access to multilingual content has reached unprecedented levels. SeamlessM4T aims to realise the vision of seamless communication and comprehension across languages.
Boasting an impressive array of capabilities, SeamlessM4T encompasses:
- Automatic speech recognition for nearly 100 languages
- Speech-to-text translation supporting nearly 100 input and output languages
- Speech-to-speech translation for nearly 100 input languages and 35 (including English) output languages
- Text-to-text translation for almost 100 languages
- Text-to-speech translation for nearly 100 input languages and 35 (including English) output languages
SeamlessM4T is being made available to researchers and developers under the CC BY-NC 4.0 license, embodying an ethos of open science.
Additionally, the metadata of SeamlessAlign – the largest multimodal translation dataset ever compiled, consisting of 270,000 hours of mined speech and text alignments – has been released. This facilitates independent data mining and further research within the community.
The development of SeamlessM4T addresses a long-standing challenge in the field of multilingual communication. Unlike earlier systems, which were confined by limited language coverage and reliance on separate subsystems, SeamlessM4T presents a unified model capable of comprehensively handling speech-to-speech and speech-to-text translation tasks.
Meta has built upon previous innovations – such as No Language Left Behind (NLLB) and Universal Speech Translator – to create this unified multilingual model. With its impressive performance on low-resource languages and consistently strong performance on high-resource languages, SeamlessM4T holds the potential to revolutionise cross-language communication.
Underpinning the model’s architecture is the multitask UnitY model, which excels in generating translated text and speech.
UnitY supports various translation tasks, including automatic speech recognition, text-to-text translation, and speech-to-speech translation, all from a single model. To train this versatile model, Meta employed advanced techniques such as text and speech encoders, self-supervised encoders, and sophisticated decoding processes.
The result is a model that outperforms previous leaders:
To ensure the accuracy and safety of the system, Meta adheres to a responsible AI framework.
Meta says that extensive research on toxicity and bias mitigation has been conducted, resulting in a model that is more aware of and responsive to potential issues. The public release of the SeamlessM4T model encourages collaborative research and development in the AI community.
As the world becomes more connected, SeamlessM4T’s ability to transcend language barriers is a testament to the power of AI-driven innovation. This milestone brings us closer to a future where communication knows no linguistic limitations, enabling a world where people can truly understand each other regardless of language.
A demo of SeamlessM4T can be found here. The code, model, and data can be downloaded on GitHub.
(Image Credit: Meta AI)
See also: Study highlights impact of demographics on AI training
Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with Digital Transformation Week.
Explore other upcoming enterprise technology events and webinars powered by TechForge here.