Microsoft teases deepfake AI that's too powerful to release
VASA-1 framework can turn a still image and a cloned voice file into a plausible video of a person talking
Microsoft this week demoed VASA–1, a framework for creating videos of people talking from a still image, audio sample, and text script, and claims – rightly – it's too dangerous to be released to the public.
These AI-generated videos, in which people can be convincingly animated to speak scripted words in a cloned voice, are just the sort of thing the US Federal Trade Commission warned about last month, after previously proposing a rule to prevent AI technology from being used for impersonation fraud.
Microsoft's team acknowledge as much in their announcement, which explains the technology is not being released due to ethical considerations. They insist that they're presenting research for generating virtual interactive characters and not for impersonating anyone. As such, there's no product or API planned.
"Our research focuses on generating visual affective skills for virtual AI avatars, aiming for positive applications," the Redmond boffins state. "It is not intended to create content that is used to mislead or deceive.
"However, like other related content generation techniques, it could still potentially be misused for impersonating humans. We are opposed to any behavior to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection."
Kevin Surace, Chair of Token, a biometric authentication biz, and frequent speaker on generative AI, told The Register in an email that while there have been prior technology demonstrations of faces animated from a still frame and cloned voice file, Microsoft's demonstration reflects the state of the art.
"The implications for personalizing emails and other business mass communication is fabulous," he opined. "Even animating older pictures as well. To some extent this is just fun and to another it has solid business applications we will all use in the coming months and years."
The "fun" of deepfakes was 96 percent nonconsensual porn, when assessed [PDF] in 2019 by cybersecurity firm Deeptrace.
Nonetheless, Microsoft's researchers suggest that being able to create realistic looking people and put words in their mouths has positive uses.
"Such technology holds the promise of enriching digital communication, increasing accessibility for those with communicative impairments, transforming education, methods with interactive AI tutoring, and providing therapeutic support and social interaction in healthcare," they propose in a research paper that does not contain the words "porn" or "misinformation."
- Devaluing content created by AI is lazy and ignores history
- US Air Force says AI-controlled F-16 fighter jet has been dogfighting with humans
- Stability AI decimates staff just weeks after CEO's exit
- Microsoft claims it didn't mean to inject Copilot into Windows Server 2022 this week
While it's arguable AI generated video is not quite the same as a deepfake, the latter defined by digital manipulation as opposed to a generative method, the distinction becomes immaterial when a convincing fake can be conjured without cut-and-paste grafting.
Asked what he makes of the fact that Microsoft is not releasing this technology to the public for fear of misuse, Surace expressed doubt about the viability of restrictions.
"Microsoft and others have held back for now until they work out the privacy and usage issues," he said. "How will anyone regulate who uses this for the right reasons?"
Surace added that there are already open source models that are similarly sophisticated, pointing to EMO. "One can pull the source code from GitHub and build a service around it that arguably would rival Microsoft's output," he observed. "Because of the open source nature of the space, regulating it will be impossible in any case."
That said, countries around the world are trying to regulate AI-fabricated people. Canada, China, and the UK, among other nations, all have regulations that can be applied to deepfakes, some of which fulfill broader political goals. Britain just this week made it illegal to create a sexually explicit deepfake image without consent. The sharing of such images was already disallowed under the UK's Online Safety Act of 2023.
In January, a bipartisan group of US lawmakers introduced the Disrupt Explicit Forged Images and Non-Consensual Edits Act of 2024 (DEFIANCE Act), a bill that creates a way for victims of non-consensual deepfake images to file a civil claim in court.
And on Tuesday, April 16, the US Senate Committee on the Judiciary, Subcommittee on Privacy, Technology, and the Law held a hearing titled "Oversight of AI: Election Deepfakes."
In prepared remarks, Rijul Gupta, CEO of DeepMedia, a deepfake detection biz, said:
[T]he most alarming aspect of deepfakes is their ability to provide bad actors with plausible deniability, allowing them to dismiss genuine content as fake. This erosion of public trust strikes at the very core of our social fabric and the foundations of our democracy. The human brain, wired to believe what it sees and hears, is particularly vulnerable to the deception of deepfakes. As these technologies become increasingly sophisticated, they threaten to undermine the shared sense of reality that underpins our society, creating a climate of uncertainty and skepticism where citizens are left questioning the veracity of every piece of information they encounter.
But think of the marketing applications. ®