Videos Remember that artificially intelligent software that could transform lifeless still images, such as portrait paintings, into moving heads? Well, you can now take a single photo or picture of someone and animate it to make them say specific words and sentences, using AI algorithms.
This machine-learning code can take a person's mouth and lip-synch it to a given spoken-word audio track, effectively forcing the subject to speak the supplied recording and say things they never actually uttered. The ways in which this could be abused to trick audiences are endless.
This new development, like the research preceding it, feeds into the hand-wringing frenzy over deepfakes, a term that describes content, whether it be images, videos, or audio, that has been doctored and twisted by machine-learning algorithms.
The internet freaked out over portraits of Mona Lisa and photos of dead celebrities like Marilyn Monroe suddenly coming to life, reanimated by the cold clammy hands of neural networks and code. Their eyes blinked, and their mouths moved, but no sound came out.
Now, researchers at the Samsung AI Center, and Imperial College London in the United Kingdom, have gone one step further. They have created fake talking heads that really can speak. Listen to Einstein discussing the wonders of science below. Yes, it’s his face and his voice, but it’s still fake, and clearly fake, nevertheless.
The audio was sourced from a recording of a speech by the E-mc2 super-boffin, and his face is from a photograph. Here’s one that’s more obviously bogus: it’s a photograph of Grigori Rasputin singing popstar Beyonce’s smash hit Halo...
The images are pretty grainy, obviously manipulated in some way, and they’re amusing enough to not really be taken seriously. However, here’s another clip that shows why this type of technology is potentially dangerous:
Normal people like you or me can therefore be visually manipulated, and the doctoring is not always obvious. In the video above, people's faces are animated by the AI software to repeat neutral sentences such as “it’s eleven o’clock” or “I’m on my way to the meeting” with a range of facial expressions, from happy and sad to scared.
Right now, these videos, produced as a result of early academic research, are impressive from a technical standpoint, though ultimately not always entirely convincing.
However, imagine a future in which these fake computer-crafted videos are good enough to fool enough of the population to spread fake news, or doctor evidence to frame people for crimes they haven’t committed – all automatically at the press of a few buttons.
Generators and discriminators
As we've said, the output of the technology described in the team's arXiv paper, emitted this month, is isn’t entirely convincing yet. The resulting video footage is low quality, and lacks small facial movements and features such as the small wrinkles that pool around the nose and lips when real people natter away. The eyes are also lacklustre.
However, considering that the model can create a talking head from just a single input image and audio file, it’s not too bad at this stage. The researchers built the software on top of a generative adversarial network (GAN) that featured one generator and three discriminator networks. This approach pitted the generator against the trio of discriminators: the generator has to produce streams of material, from input pictures and audio, that is convincing enough to get past the discriminators.
The discriminators therefore had to be taught to differentiate between real and fake videos “based on the synchrony or the presence of natural facial expressions,” according to the paper. A total 164,109 samples taken from four datasets of people speaking were used to train the model, and 17,753 clips were used for testing.
A diagram of the different components in the model ... Image credit: Vougioukas et al.
During training, the generator took a still input picture and an audio clip, and from these two sources outputted a series of frames derived from that input snap, with each frame corresponding to a 0.2-second snippet from the input audio. In each frame, the mouth and face were slightly altered to match the associated brief audio sample.
Those frames were then passed into two of the discriminators, which checked the audio and lip movements were aligned; if not, the stream was rejected as fake or unrealistic, with feedback passed to the generator so that it can improve. The third sequence discriminator looked at the video as a whole to see if the transitions between each frame were smooth so that the generated clip looked realistic; if not, it was rejected, and the generator informed.
Once training is complete, the GAN should be good enough to take any input image and audio and synch them up into a deepfake talking head video.
The still images and audio fed into the generator were encoded by two separate convolutional neural networks. To top it all off, there was also a noise generator in the mix to generate filler frames containing eye blinking and other facial motions.
“Our model is implemented in PyTorch and takes approximately a week to train using a single Nvidia GeForce GTX 1080 Ti GPU,” the researchers wrote in their paper. Fake clips of talking heads can be created in real time: a video containing about 75 frames can be generated in just half a second using a GTX 1080 Ti GPU, though it takes longer if a CPU is used.
When the researchers asked 66 people to watch 24 videos – 12 are real, and 12 deepfakes – people could only label them as real or fake correctly about 52 per cent of the time. “This model has shown promising results in generating lifelike videos, which produce facial expressions that reflect the speakers tone. The inability of users to distinguish the synthesized videos from the real ones in the Turing test verifies that the videos produced look natural,” the researchers concluded.
They hope to make their results more convincing in future with more realistic movements. At the moment, for instance, the fake talking heads can’t really move their heads much. ®