It's a mug's game: Watch AI robot grab a cuppa it hasn't seen before

Machine-learning software taught how to pick up things it hasn't seen before

Video A trio of researchers have trained a robot that can pick up new objects it hasn’t seen before.

It’s a trivial task for humans yet an incredibly complex one for machines. When people reach out to grab a mug, it’s common sense to hold it by its handle - it doesn’t matter if the mug is upright, upside down, or tipped on its side. But for robots it’s much more difficult, they can get confused by the different orientations, or distracted by things like the background or lighting conditions.

Researchers at Massachusetts Institute of Technology (MIT) have built a system where you can direct the robot to grasp the object at a specific point. In the experiments, the researchers play with three objects: shoes, hats, and mugs. They train the robot to grab the shoe by its tongue, the hat by its brim, and the mug by its handle.

"Many approaches to manipulation can’t identify specific parts of an object across the many orientations that object may encounter,” said Lucas Manuelli, co-author of the research paper out on arXiv and a PhD student at MIT.

You can watch it in action here.

Youtube Video

Dense Object Nets

After the robot is trained, it learns to pick up all shoes by their tongues even if it hasn’t seen that exact shoe before. At the robot’s heart is a computer vision system made up of convolutional neural networks known as Dense Object Nets (DON).

First, a camera attached to the robot arm swivels around and hovers over the shoe to scan it in different orientations. This creates a video from which image stills can be analyzed. The goal is to create what the researchers call a “dense visual descriptor," basically a fancy name for a one-to-one mapping.

Next, the individual pixels from the images taken from video stills are converted to vectors that describes the object’s properties, like overall shape, orientation or color. These vectors go on to create “descriptor images”. They might appear fuzzy, but they hold information about all the different pixels that actually make up the object.

Now, the researchers can choose the pixels that correspond to the tongues of shoes from taken from the camera’s images. The careful mapping between these camera images and the descriptor images, allows the robot to move its pincers over to the shoes’ tongues or mug handles, to pick them up.


Uptight robots that suddenly beg to stay alive are less likely to be switched off by humans


It takes about 20 minutes for the robot to train on a new object, scanning it at different angles to create descriptor images. “We observe that the descriptors are consistent despite considerable differences in color, texture, deformation, and even to some extent underlying shape. The training requirements are reasonably modest – only six instances of hats were used for training yet the descriptors generalize well to unseen hats, including a blue hat, a color never observed during training,” according to the paper.

The robot can also be trained to analyze multiple objects in the same scene. It could pick out a specific hat among a range of hats, despite never having seen those hats during the training process before.

“In factories robots often need complex part feeders to work reliably,” said Peter Florence, lead author of the paper and a PhD student at MIT. “But a system like this that can understand objects’ orientations could just take a picture and be able to grasp and adjust the object accordingly.”

Picking up objects is just the first step of trying to get robots to actually do useful things. The next goal is to try and train the robot to pick up an object up in order to perform a simple task, such as using a cloth to clean a desk. ®

Broader topics

Other stories you might like

  • Is computer vision the cure for school shootings? Likely not
    Gun-detecting AI outfits want to help while root causes need tackling

    Comment More than 250 mass shootings have occurred in the US so far this year, and AI advocates think they have the solution. Not gun control, but better tech, unsurprisingly.

    Machine-learning biz Kogniz announced on Tuesday it was adding a ready-to-deploy gun detection model to its computer-vision platform. The system, we're told, can detect guns seen by security cameras and send notifications to those at risk, notifying police, locking down buildings, and performing other security tasks. 

    In addition to spotting firearms, Kogniz uses its other computer-vision modules to notice unusual behavior, such as children sprinting down hallways or someone climbing in through a window, which could indicate an active shooter.

    Continue reading
  • Microsoft promises to tighten access to AI it now deems too risky for some devs
    Deep-fake voices, face recognition, emotion, age and gender prediction ... A toolbox of theoretical tech tyranny

    Microsoft has pledged to clamp down on access to AI tools designed to predict emotions, gender, and age from images, and will restrict the usage of its facial recognition and generative audio models in Azure.

    The Windows giant made the promise on Tuesday while also sharing its so-called Responsible AI Standard, a document [PDF] in which the US corporation vowed to minimize any harm inflicted by its machine-learning software. This pledge included assurances that the biz will assess the impact of its technologies, document models' data and capabilities, and enforce stricter use guidelines.

    This is needed because – and let's just check the notes here – there are apparently not enough laws yet regulating machine-learning technology use. Thus, in the absence of this legislation, Microsoft will just have to force itself to do the right thing.

    Continue reading
  • Cerebras sets record for 'largest AI model' on a single chip
    Plus: Yandex releases 100-billion-parameter language model for free, and more

    In brief US hardware startup Cerebras claims to have trained the largest AI model on a single device powered by the world's largest Wafer Scale Engine 2 chip the size of a plate.

    "Using the Cerebras Software Platform (CSoft), our customers can easily train state-of-the-art GPT language models (such as GPT-3 and GPT-J) with up to 20 billion parameters on a single CS-2 system," the company claimed this week. "Running on a single CS-2, these models take minutes to set up and users can quickly move between models with just a few keystrokes."

    The CS-2 packs a whopping 850,000 cores, and has 40GB of on-chip memory capable of reaching 20 PB/sec memory bandwidth. The specs on other types of AI accelerators and GPUs pale in comparison, meaning machine learning engineers have to train huge AI models with billions of parameters across more servers.

    Continue reading

Biting the hand that feeds IT © 1998–2022