Microsoft has developed VALL-E, an artificial intelligence (AI) capable of imitating a voice from a sample of just three seconds. Some demonstrations are very convincing. The company is aware of the danger that arises when such a tool falls into the hands of malicious individuals.
After the “deep fake” in the picture or video, will we see the arrival of the “deep fake” sound? This has been possible since MicrosoftMicrosoft introduced a new artificial intelligence (AI) speech synthesis model called VALL-E. Its peculiarity ? With a simple three-second audio sample, she can imitate and thus simulate a person’s voice. Once it has learned a specific voice, this AI can synthesize that person’s sound while preserving their timbre and emotions.
Microsoft believes that VALL-E could be used for text-to-speech applications, but also, and obviously more worryingly, for editing speech in a recording. It would be possible to edit and modify the audio of a text transcript of a speech. Imagine a politician’s speech modified by this artificial intelligence…
Machine learning in action
For the company, VALL-E is what they call a “codec codec neural language model” and is based on an audio compression technology called EnCodec, which was unveiled by Meta (FacebookFacebook) last October. Unlike other speech synthesis methods, which typically synthesize speech by manipulating waveforms, VALL-E generates audio codec codes from text and acoustic samples. It basically analyzes the sound of a person, breaks that information into tokens using EnCodec, and uses machine learning to match the three-second sample to what it’s learned.
Microsoft relied on the LibriLight sound library. It contains 60,000 hours of English language spoken by over 7,000 speakers, mostly sourced from LibriVox public domain audiobooks. For VALL-E to generate a meaningful result, the voice in the three-second sample must closely match a voice in the training data.
I have to do something about it.Your browser does not support the audio element.Your browser does not support the audio element.Your browser does not support the audio element.
An example. © VALL-E
Microsoft is aware of the danger
To convince you, Microsoft provides dozens of audio samples of the AI model in action. Some are startlingly similar, while others are clearly synthetic and the human ear manages to distinguish that it is an artificial intelligence. Impressively, VALL-E not only preserves the tone and emotions of the person speaking, but can also reproduce the environment and conditions of the recording. Microsoft takes the example of a telephone conversation with the acoustic and frequency characteristics specific to that type of conversation.
When asked about the dangers of such artificial intelligence, Microsoft confirms that the source code is not available and the company is aware that “this can lead to potential risks of misusing the model, such as speakers. To mitigate these risks, it is possible to create a detection model to distinguish whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI principles into practice as we continue to develop the models.”