BASE TTS, Largest AI Model For Text-to-Speech, Showcases New Capabilities Developed By Amazon Researchers

Amazon researchers have trained the largest text-to-speech AI model ever called BASE TTS, which has new capabilities to speak complex sentences naturally. The model is the largest in its class with 980 million parameters and uses 100,000 hours of public domain voice data. It can handle difficult tasks such as parsing simple sentences, pronouncing foreign words correctly, and producing emotional or whispered speech. The model is still experimental and further research is needed to determine its tipping point for new capabilities.

These capabilities allow the model to produce natural, expressive language without having to be explicitly trained for specific scenarios. The researchers tested three versions of BASE TTS and found that model size and data amount are critical to improving performance. The medium and large versions outperformed existing models and were rated well by human listeners.

“We introduce a speech synthesis model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest text-to-speech model to date, trained on 100,000 hours of public domain speech data, achieving a new state of the art in speech naturalness. It uses an autoregressive transformer with billions of parameters that converts raw texts into discrete codes (“speech codes”), followed by a convolution decoder that gradually and smoothly converts these speech codes into waveforms. Additionally, our speech codes are created using a novel speech tokenization technique that disentangles the speaker's identity and compresses it using byte-pair encoding. “say Amazon researchers.
TTS BASE

As with recent work on language modeling, we use an LLM-based method for the TTS task (Figure 1). The researchers considered a data set D={xi,yi}N i=0, where y is an audio sample and x={x1,— ,xT} is the corresponding text transcription. The audio y={y1,— ,yS} is represented by a sequence of S discrete tokens (speech codes) that are learned using a separately trained speech synthesizer. They use a transformer-based autoregressive model with ϕ parameters to learn the joint probability of text and audio sequences:

The predicted words are concatenated with the speaker recordings and decoded into waveforms using a separately trained speech code decoder consisting of linear and convolutional layers.
Overview of BASE TTS. The speech synthesizer (1) learns a discrete representation modeled by an autoregressive model (2) conditioned by the reference text and the language. The speech code decoder (3) converts the predicted speech representations into a waveform.

This work represents a major advance in speech synthesis by developing the Big Adaptive Streamable Text-to-Speech (BASE TTS) model with “new capabilities.” BASE TTS is also “streamable” and suitable for real-time applications. The researchers highlighted advantages such as spontaneous speech generation and providing a method for improving expressiveness while maintaining low bandwidth. However, they recognized potential risks, such as misuse by malicious actors, and decided not to publish the model or data. The paper was presented at the ICASSP 2024 conference and encourages future research on the new capabilities of TTS models.

The Big Adaptive Streamable Text-to-Speech model comes in three versions with different parameter sizes and data sources. The largest version, BASE-large, includes 980 million parameters and uses 100,000 hours of language, primarily from the public domain, including English, German, Dutch and Spanish. The intermediate version, BASE-medium, has 400 million parameters and uses 10,000 language hours, while the smallest version, BASE-small, has 150 million parameters and is based on 1,000 language hours.

The researchers evaluated the performance of the three models on various tasks that are difficult for text-to-speech models, such as: B. the pronunciation of compound nouns, the expression of emotions, the handling of foreign words, paralinguistics (readable non-words), punctuation, question formulation and the management of syntactic complexities. The results showed that BASE-medium and BASE-large significantly outperform BASE-small and other existing models such as Tortoise and VALL-E in these tasks, and also achieve better ratings from human listeners in terms of quality and naturalness of speech.

These results suggest that model size and data volume are critical factors in the new capabilities of text-to-speech models.

When it comes to the advantages of BASE TTS, researchers highlighted its ability to generate speech on the fly, making the model a suitable choice for real-time applications such as voice assistants or audiobooks. In addition, they proposed a method for encoding and transmitting speech metadata such as emotions, prosody and accent in a separate, low-bandwidth stream, thereby improving expressiveness without compromising audio quality.

Although the researchers believe their work represents a step forward for TTS technology by demonstrating the models' ability to produce natural and varied language for different scenarios, they also acknowledge the potential risks associated with malicious use of their technology are connected. Therefore, they have decided not to publish either the model or the data. The paper, titled “Big Adaptive Streamable TTS with Emergent Abilities,” was presented at the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024.

The risks and successes of Voicebox, VALL-E and BASE TTS

In comparison, Meta recently introduced Voicebox, a new AI voice generation system that synthesizes voice dialogues with a variety of potential use cases, including non-specifically trained voice generation tasks. Although Voicebox can produce realistic and expressive voices in six languages, Meta acknowledged the potential risks of misuse, such as: B. creating deepfakes or scams, and decided not to publish Voicebox at this time. The company emphasizes the need to find a balance between openness and responsibility when developing such technologies.

According to Meta, Voicebox, based on the learning method called flow matching, outperforms existing text-to-speech models in terms of voice quality and naturalness. The training was based on over 50,000 hours of unfiltered audio using public domain audiobook recordings and transcriptions. Researchers say speech recognition models trained with Voicebox's synthetic speech perform almost as well as those trained with real speech, with error rates decreasing by just 1%, compared to declines of 45% to 70% for existing models .

Generative AI raises ethical concerns, including the risk of misuse to create deepfakes. Meta has developed classifiers to distinguish voicebox creations from human voices, highlighting the importance of transparency in AI development. However, although Meta wants to share the research results with the community, Meta has no plans to make Voicebox publicly available due to the risk of the technology being exploited for negative purposes.

In a similar context, Microsoft launched VALL-E, a speech synthesis language model trained on 60,000 hours of English speech data. VALL-E leverages Meta's EnCodec technology, an AI-based audio compression method, to generate discrete audio codec codes from text and audio prompts. Although VALL-E can reproduce the emotions, tone and even the acoustic environment of an audio sample, its use raises similar ethical concerns as voicebox.

In addition, ReadSpeaker has developed a dynamic runtime text-to-speech plugin for the Unreal and Unity game engines, allowing developers to create and edit speech audio signals with near-zero latency. This innovation aims to improve the accessibility of video games by providing on-screen commentary and audio descriptions, while highlighting the need to provide players with higher quality experiences in digital environments and the metaverse.

Amazon researchers' progress with the BASE-TTS model in the field of speech synthesis is undoubtedly a major breakthrough. The model, with its 980 million parameters, surpasses existing models in terms of new capabilities and offers the ability to synthesize complex sentences in a natural way speak. Using 100,000 hours of public domain speech data expands the system's potential to handle complex tasks such as parsing and correctly pronouncing foreign words. However, it is important to note that the model is still experimental and requires further research to determine its limitations in the new capabilities.

Overall, these technological advances open up promising perspectives, but also raise ethical and safety concerns that require particular attention. The balance between innovation and responsibility is crucial in the development and implementation of these new technologies.

Source: Amazon (1, 2)

And you ?

What is your opinion on this topic?

What do you think are the specific criteria that define “natural and expressive” in speech production, and how were these criteria measured when evaluating the BASE TTS model?

How does BASE TTS's streamable generation method differ from other text-to-speech models and how does this feature affect performance in real-time applications?

The researchers emphasized the importance of model size and amount of data. What are the potential trade-offs associated with using massive models in terms of resources, energy and costs and how were these factors taken into account in the evaluation?

See also:

ReadSpeaker introduces the first cross-platform dynamic runtime text-to-speech plugin for Unreal and Unity engines that gives voice to non-player characters

VALL-E: Microsoft's text-to-speech AI can imitate any voice, including a speaker's emotions and tone of voice, with just three seconds of rehearsal

Meta says its new text-to-speech AI model is far too dangerous to make public, and could be used to perfect deepfakes or in scams