Video AI code can breathe life into portrait paintings, photos of dead celebrities, and your Facebook selfies, transforming single still images into moving and talking heads.
In one demonstration of the software's creepy abilities, the Mona Lisa, famous for its ambiguous expression, is animated just like one of the moving paintings in the Harry Potter series. She turns her head, mouths words, and even blinks. Here, see it for yourself in the video below (skip to 5:07 to see the Mona Lisa), published this week:
The technology – developed by a group of researchers at Samsung AI Center, and Skolkovo Institute of Science and Technology in Moscow – relies on convolutional neural networks. The goal is to get an input source image to mimic the motion of someone in a target output video so that the initial picture is converted to a short video clip of a talking head.
There have been lots of similar projects so the idea isn’t particularly novel. But what’s intriguing in this paper, hosted by arXiv, is that the system doesn’t require tons of training examples and seems to work after seeing an image just once. That’s why it works with paintings like the Mona Lisa.
First, an embedder network maps information like the size of the eyes, nose, and mouth in an input image and converts it into vectors. Second, a generator network copies the facial expression of someone in a video by plotting the person’s facial landmarks. Third, a discriminator network pastes the embedded vectors from the input image onto the landmarks in the target video so that the input image mimics the motion in the video.
At the end, a “realism score” is calculated. The score is used to inspect how closely the source image matches the poses in the target video. Before the system is good enough to work on examples with very few input samples like the Mona Lisa, it requires extensive pre-training.
The researchers pre-trained the model using the VoxCeleb2 dataset, a database containing lots of celebrity talking heads. During this process, the same process described earlier is carried out but here the source and target images are just different frames of the same video.
So, instead of getting a painting to puppet someone else from another video, the system actually has a ground truth it can compare itself to. Now, it can be continually trained until the generated frames are similar to the real frames in the training video.
The pre-training phase allows the model to work on inputs where there are very few examples. The results aren’t too bad when there is only one picture available, but gets more realistic when there are more.
Three different examples that uses up to one, eight, or 32 training images. The system takes a source image (first column) and tries to map that image onto the same pose in the ground truth frame (second column). The researchers are comparing their results (fifth column) to different models, including X2Face (third column), PixtopixHD (fourth column). Image credit: Zakharov et al.
At the moment, the results are harmless and although they're impressive, they're pretty creepy. People would scream if they saw an oil painting suddenly come to life in a museum. But what’s more worrying, however, is if the system was applied to images of your own face.
It’s easy to imagine miscreants downloading your profile picture posted on Facebook, Twitter or Instagram to turn you into a virtual puppet to make you do and say things you haven’t done or said.
It wasn’t so long ago that internet perverts were sharing tips on the Reddit forum r/deepfakes on how to paste pictures of their favorite celebrities or ex-girlfriends or wives onto the bodies of pornstars.
The research team declined to comment on their work this week. ®