The SEAMLESS Communication Team of Towards Massively Multilingual Universal Text-to-Text Translation using Semi-Supervised Learning
Meta’s previous work on speech-to-speech translation and the No Language Left Behind project aimed to provide text-to-text translation for some 200 languages. Making translation systems multilingual can improve their performance even if they only have a small amount of training data, but why this happens is unclear.
Most models only work for text, and may only focus on a small subset of the world’s languages.
• Writing in Nature, the SEAMLESS Communication Team1 addresses these challenges to come up with key technologies that could make rapid universal translation a reality.
To train their AI model, the researchers relied on methods called self-supervised and semi-supervised learning. These approaches help a model learn from huge amounts of data without requiring humans to put labels on the data. It’s possible that the labels are accurate transcripts or translations.
4.5 million hours of multilingual spoken audio was stored in a massive data set before being pre-trained on the part of the model that is responsible for translation. The training helps the model to learn data patterns and it makes it simpler to fine- tune the model for specific tasks without requiring large amounts of training data.
One of the SEAMLESS team’s savviest strategies involved ‘mining’ the Internet for training pairs that align across languages — such as audio snippets in one language that match subtitles in another. Starting with some data that they knew to be reliable, the authors trained the model to recognize when two pieces of content (such as a video clip and a corresponding subtitle) actually match in meaning. They applied this technique to a lot of internet data, collecting over 443,000 hours of audio, matching text and aligning 30,000 hours of speech pairs to further train their model.
There is a Massively Multilingual and Multimodal Machine Translation system which can translate speech, text, and text to text. The results are described in Nature.
Meta has become one of the largest supporters of open-source language technology. Its research team was instrumental in developing PyTorch, a software library for training AI models, which is widely used by companies such as OpenAI and Tesla, as well as by many researchers around the world. The Llama family of large language models2 are used to create applications similar to the ones created by the model introduced here. This level of openness is a huge advantage for researchers who lack the massive computational resources needed to build these models from scratch.
Some transcription models have even been known to ‘hallucinate’5 — come up with entire phrases that were never uttered in audio inputs — and this occurs more frequently for speakers who have speech impairments than it does for those without them (Fig. 1c). Machine-generated errors like wrongly accusing a person in a trial and wrongly prescribe a drug could cause real harm. The damage is disproportionately affecting marginalized populations.
The SEAMLESS researchers quantified the toxicity associated with their model (the degree to which its translations introduce harmful or offensive language)6. This is a step in the right direction, and offers a baseline against which future models can be tested. Extra care must be taken to make certain that the model can translate and decipher certain words in certain languages, when the performance of existing models varies wildly. Computer-vision researchers are looking to improve the poor performance of image-recognition models in under-represented groups and deter the models from making offensive predictions.
The authors also looked for gender bias in the translations produced by their model. Their analysis examined whether the model over-represented one gender when translating gender-neutral phrases into gendered languages: does “I am a teacher” in English translate to the masculine “Soy profesor” or to the feminine “Soy profesora” in Spanish? But such analyses are restricted to languages with binary masculine or feminine forms only, and future audits should broaden the scope of linguistic biases studied8.
MetaSEAMLESSM4T: A Multi-Memory Audio Archive for Open-Universe Translations of L LaMA Lectures
After success of releasing its L LaMA large, Meta will make the SEAMLESSM4T open- source for other researchers who want to build on it.
The team collected millions of hours of audio files of speech, along with human-generated translations of that speech, from the Internet and other sources, such as United Nations archives. The transcripts of some of the speeches were collected by the authors.