Watch an AI robot program itself to, er, pick things up and push them around

Why can't robots just learn to do things without being told?

Vid Robots normally need to be programmed in order to get them to perform a particular task, but they can be coaxed into writing the instructions themselves with the help of machine learning, according to research published in Science.

Engineers at Vicarious AI, a robotics startup based in California, USA, have built what they call a “visual cognitive computer” (VCC), a software platform connected to a camera system and a robot gripper. Given a set of visual clues, the VCC writes a short program of instructions to be followed by the robot so it knows how to move its gripper to do simple tasks.

“Humans are good at inferring the concepts conveyed in a pair of images and then applying them in a completely different setting," the paper states.

"The human-inferred concepts are at a sufficiently high level to be effortlessly applied in situations that look very different, a capacity so natural that it is used by IKEA and LEGO to make language-independent assembly instructions."

Don’t get your hopes up, however, these robots can’t put your flat-pack table or chair together for you quite yet. But it can do very basic jobs, like moving a block backwards and forwards.

It works like this. First, an input and output image are given to the system. The input image is a jumble of colored objects of various shapes and sizes, and the output image is an ordered arrangement of the objects. For example, the input image could be a number of red blocks and the output image is all the red blocks ordered to form a circle. Think of it a bit like a before and after image.

The VCC works out what commands need to be performed by the robot in order to organise the range of objects before it, based on the ‘before’ to the ‘after’ image. The system is trained to learn what action corresponds to what command using supervised learning.

Dileep George, cofounder of Vicarious, explained to The Register, “up to ten pairs [of images are used] for training, and ten pairs for testing. Most concepts are learned with only about five examples.”

Here’s a diagram of how it works:


A: A graph describing the robot's components. B: The list of commands the VCC can use. Image credit: Vicarious AI

The left hand side is a schematic of all the different parts that control the robot. The visual hierarchy looks at the objects in front of the camera and categorizes them by object shape and colour. The attention controller decides what objects to focus on, whilst the fixation controller directs the robot’s gaze to the objects before the hand controller operates the robot’s arms to move the objects about.

The robot doesn’t need too many training examples to work because there are only 24 commands, listed on the right hand of the diagram, for the VCC controller.

And here’s the robot in action, embedded below. The physical objects laid out in front of the robot do not need to be explicitly the same as the abstract ones represented in the input and output images.

MP4 video

Zero-shot learning

Researchers constructed 546 different tasks or concepts, ranging from four lines of instructions to 23, including different numbers and sizes of objects and the color and texture of the background to see whether the robots could adapt to more realistic settings. Out of the 546 concepts, six were tested on two different robots.

One was the Baxter model from Rethink Robotics, and the other was the UR5 robo arm from Universal Robots.

“We thought of each of those concepts individually,” said George. "Then we wrote little python programs to generate multiple images corresponding to each concept."


Ding dong merrily on high. In Berkeley, the bots are singeing: Self-driving college cooler droid goes up in flames


UR5 was better at carrying out the tasks and could execute more than 90 per cent of the six concepts tested, compared to Baxter’s 70 per cent, the paper claims. Hardware mishaps, such as objects slipping out of the gripper, were common failures. The Baxter robot used by Vicarious was older, and over time the camera had blurred and its motions were less precise.

“Getting robots to perform tasks without explicit programming is one of the goals in artificial intelligence and robotics,” the researchers wrote.

Other types of methods like imitation learning have been introduced, where agents learn by copying demonstrations during the training process. These bots, however, are less likely to be able to adapt to situations beyond the scope of the demos.

Here, there are no explicit demonstrations, something the researchers describe as "zero-shot learning". The agent has to learn and encode abstract concepts and then transfer the learned model to real world scenarios.

It sounds pretty fancy, but it’s still early days and the tasks here are very rudimentary. It’ll be a while yet before any AI robots like these ones succeed in factories. Vicarious said it will be testing "some variations of the idea, in limited settings".

Broader topics

Other stories you might like

  • Is computer vision the cure for school shootings? Likely not
    Gun-detecting AI outfits want to help while root causes need tackling

    Comment More than 250 mass shootings have occurred in the US so far this year, and AI advocates think they have the solution. Not gun control, but better tech, unsurprisingly.

    Machine-learning biz Kogniz announced on Tuesday it was adding a ready-to-deploy gun detection model to its computer-vision platform. The system, we're told, can detect guns seen by security cameras and send notifications to those at risk, notifying police, locking down buildings, and performing other security tasks. 

    In addition to spotting firearms, Kogniz uses its other computer-vision modules to notice unusual behavior, such as children sprinting down hallways or someone climbing in through a window, which could indicate an active shooter.

    Continue reading
  • Cerebras sets record for 'largest AI model' on a single chip
    Plus: Yandex releases 100-billion-parameter language model for free, and more

    In brief US hardware startup Cerebras claims to have trained the largest AI model on a single device powered by the world's largest Wafer Scale Engine 2 chip the size of a plate.

    "Using the Cerebras Software Platform (CSoft), our customers can easily train state-of-the-art GPT language models (such as GPT-3 and GPT-J) with up to 20 billion parameters on a single CS-2 system," the company claimed this week. "Running on a single CS-2, these models take minutes to set up and users can quickly move between models with just a few keystrokes."

    The CS-2 packs a whopping 850,000 cores, and has 40GB of on-chip memory capable of reaching 20 PB/sec memory bandwidth. The specs on other types of AI accelerators and GPUs pale in comparison, meaning machine learning engineers have to train huge AI models with billions of parameters across more servers.

    Continue reading
  • Microsoft promises to tighten access to AI it now deems too risky for some devs
    Deep-fake voices, face recognition, emotion, age and gender prediction ... A toolbox of theoretical tech tyranny

    Microsoft has pledged to clamp down on access to AI tools designed to predict emotions, gender, and age from images, and will restrict the usage of its facial recognition and generative audio models in Azure.

    The Windows giant made the promise on Tuesday while also sharing its so-called Responsible AI Standard, a document [PDF] in which the US corporation vowed to minimize any harm inflicted by its machine-learning software. This pledge included assurances that the biz will assess the impact of its technologies, document models' data and capabilities, and enforce stricter use guidelines.

    This is needed because – and let's just check the notes here – there are apparently not enough laws yet regulating machine-learning technology use. Thus, in the absence of this legislation, Microsoft will just have to force itself to do the right thing.

    Continue reading

Biting the hand that feeds IT © 1998–2022