Video A trio of researchers have trained a robot that can pick up new objects it hasn’t seen before.
It’s a trivial task for humans yet an incredibly complex one for machines. When people reach out to grab a mug, it’s common sense to hold it by its handle - it doesn’t matter if the mug is upright, upside down, or tipped on its side. But for robots it’s much more difficult, they can get confused by the different orientations, or distracted by things like the background or lighting conditions.
Researchers at Massachusetts Institute of Technology (MIT) have built a system where you can direct the robot to grasp the object at a specific point. In the experiments, the researchers play with three objects: shoes, hats, and mugs. They train the robot to grab the shoe by its tongue, the hat by its brim, and the mug by its handle.
"Many approaches to manipulation can’t identify specific parts of an object across the many orientations that object may encounter,” said Lucas Manuelli, co-author of the research paper out on arXiv and a PhD student at MIT.
You can watch it in action here.
Dense Object Nets
After the robot is trained, it learns to pick up all shoes by their tongues even if it hasn’t seen that exact shoe before. At the robot’s heart is a computer vision system made up of convolutional neural networks known as Dense Object Nets (DON).
First, a camera attached to the robot arm swivels around and hovers over the shoe to scan it in different orientations. This creates a video from which image stills can be analyzed. The goal is to create what the researchers call a “dense visual descriptor," basically a fancy name for a one-to-one mapping.
Next, the individual pixels from the images taken from video stills are converted to vectors that describes the object’s properties, like overall shape, orientation or color. These vectors go on to create “descriptor images”. They might appear fuzzy, but they hold information about all the different pixels that actually make up the object.
Now, the researchers can choose the pixels that correspond to the tongues of shoes from taken from the camera’s images. The careful mapping between these camera images and the descriptor images, allows the robot to move its pincers over to the shoes’ tongues or mug handles, to pick them up.
Uptight robots that suddenly beg to stay alive are less likely to be switched off by humansREAD MORE
It takes about 20 minutes for the robot to train on a new object, scanning it at different angles to create descriptor images. “We observe that the descriptors are consistent despite considerable differences in color, texture, deformation, and even to some extent underlying shape. The training requirements are reasonably modest – only six instances of hats were used for training yet the descriptors generalize well to unseen hats, including a blue hat, a color never observed during training,” according to the paper.
The robot can also be trained to analyze multiple objects in the same scene. It could pick out a specific hat among a range of hats, despite never having seen those hats during the training process before.
“In factories robots often need complex part feeders to work reliably,” said Peter Florence, lead author of the paper and a PhD student at MIT. “But a system like this that can understand objects’ orientations could just take a picture and be able to grasp and adjust the object accordingly.”
Picking up objects is just the first step of trying to get robots to actually do useful things. The next goal is to try and train the robot to pick up an object up in order to perform a simple task, such as using a cloth to clean a desk. ®