This article is more than 1 year old

Google's new view of the world takes two pics to make 'DeepStereo' 3D

Machine learning imagines the missing pixels in glorious Googl-o-rama

Video StreetView means Google owns one of the world's larger photo albums, so it's natural for Google to want to create a realistic 3D rendering of the world. That's the aim of a new bit of boffinry from the Chocolate Factory called DeepStereo.

As the group led by Googler John Flynn explain in this Arxiv paper, they found existing interpolation techniques to turn pairs of flat photos into 3D produced “unrealistic and jarring artefacts”. To get around that, the researchers applied Google's Deep Learning algorithms to “fill in the blanks” between pairs of photos,

As they say in their abstract, ”pixels from neighboring views of a scene are presented to the network which then directly produces the pixels of the unseen view.”

The system analyses colour and depth information from the original “posed” images – depth is important so that the system doesn't turn a lamp-post into part of the building behind it.

The depth information is inferred from movement of objects from one frame to the next – a nearer object will move further in (for example) successive StreetView images than the object behind it.

To get things started, they worked with specially-posed pairs of photos, and then set their learning algorithms loose on much larger StreetView photo collections.

“To train our network, we used images of street scenes captured by a moving vehicle. The images were posed using a combination of odometry and traditional structure-from-motion techniques”, they write.

“The vehicle captures a set of images, known as a rosette, from different directions for each exposure. The capturing camera uses a rolling shutter sensor, which is taken into account by our camera model. We used approximately 100K of such image sets during training.”

The image processing takes a stack of images in StreetView and projects it onto a virtual “camera” in the software. Processing is divided into two “towers” – the selection tower, which estimates depth for each pixel in the image; and the colour tower, which predicts colour for interpolated pixels.

In both of these, the paper states, the training data provides the baseline to predict what the output should look like.

As Technology Review writes, it's not just a toy for turning StreetView into realistic 3D panoramas. The work would also be useful in teleconferencing, cinematography, virtual reality, and stop-frame animation.

The group wants to improve its system further, noting that there are still artefacts like loss of resolution and the “disappearance of thin foreground structures”.

They also want to make DeepStereo more efficient, since a single new image currently needs about 12 minutes' processing on a multicore workstation. ®

Youtube Video

More about


Send us news

Other stories you might like