University, Nvidia team teaches robots to get a grip with OpenAI's CLIP

Gone are the days where you have to show a machine a gazillion images in different poses and lighting conditions


Robots powered by neural networks are frustratingly brittle. They need to see numerous demonstrations of a specific task in simulation before they can begin to execute the same actions in the physical world. A new technique, however, promises to speed up the process.

The researchers – from the University of Washington in the US and Nvidia – were trying to solve the problem of needing to spend long periods of time collecting data to teach neural-network powered bots to recognise and manipulate objects in their environment.

An easy task such as stacking a red block on top of a blue one is complex for machines. They have to be fed lots of images of both red and blue blocks in various poses to learn its shape and colour, and then multiple videos showing which order to stack them in. A robot would also have to detect these blocks to locate them before it can begin to move them around.

Ask it to do the same thing with, say, mugs, however, and its performance will probably tank. It has to be retrained all over again to recognize the new objects even though it just learned how to stack things. It's a painstaking process having to spoon-feed machines thousands of demonstrations using various combinations of objects in different environments to get them to be more robust.

The novel method described by researchers at Washington and Nvidia, however, promises to make the machines smarter. Using a system known as "CLIPort", the team was able to teach a robot gripper how to manipulate objects without having to explicitly train it to recognise the objects first.

The model is made up of two parts: CLIP, a neural network developed by OpenAI trained on images and text scraped from the internet, and a transporter network to classify pixels and to detect spatial relationships between objects. Because CLIP is already pre-trained to identify objects and describe them in text, the researchers can give instructions to the robot in text and it will automatically identify what they are referring to.

"We present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding (what) of CLIP with the spatial precision (where) of Transporter," according to the team's paper on arXiv.

"Our end-to-end framework is capable of solving a variety of language-specified tabletop tasks from packing unseen objects to folding cloths, all without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures."

For example, in the command “pick all the cherries and put them in a box”. The CLIP part of the system will know what “cherries” and “box” look like. The robot doesn’t have to be trained on numerous images of cherries or boxes to know this. Roboticists can then go ahead to the second stage of the training process and just show the mechanical arm the exact motion to grip the cherries and drop them into a container one by one.

The transporter then guides the robot to imitate the action to complete the task in the real world. It can do other things too, like folding a cloth or sweeping beans without having been exposed to images of towels or coffee beans.

"Unlike existing object detectors, CLIP is not limited to a predefined set of object classes," Mohit Shridhar, first author of the paper and a PhD student at the University of Washington, told The Register.

"And unlike other vision-language models, it's not restricted by a top-down pipeline that detects objects with bounding boxes or instance segmentations. This allows us to forgo the traditional paradigm of training explicit detectors for cloths, pliers, chessboard squares, cherry stems, and other arbitrary things."

There are other similar systems that use pre-trained image classifiers like CLIP but they aren't trained on as many object types, Shridhar explained. The new system means that CLIPort-based robots can be fine-tuned on new chores with "very little data."

What's even more useful is that it is better at carrying out the same tasks it was previously taught with new objects it hasn't seen before. The robot can stack a series of blocks in a specific colour order in training and learn to perform the same task on different coloured blocks it hasn't seen before.

You can see it action below.

Youtube Video

The downside of CLIPort, however, is that it still requires over a hundred video demonstrations before it is able to do something fairly successfully. Some tasks are harder than others too; putting a shape into its right hole is particularly difficult for CLIPort when it hasn't seen a demonstration of the task using a different shaped object.

Plus, if there's an object that CLIP hasn't been exposed to during its training process, CLIPort won't know how to recognize it either. Although the system is more robust, it's not quite general enough to know how to perform a task without having seen it done first.

"CLIPort's capabilities are only limited to the actions shown during training demonstrations. If it's trained to 'stack two blocks,' and you ask it 'make a tower of 5 blocks,' it won't know how to do so. All the verbs are also tightly linked to the training demonstrations, in the sense that they won't do anything beyond the action-skills learnt during training," Shridhar added.

CLIPort is specifically designed to keep humans in the loop, he said. A human expert has to teach the robot with demonstrations, and also provide language commands during execution. You can see the code for it here. ®

Editor's note: The headline on this article was revised to clarify that the University of Washington and Nvidia carried out this research using OpenAI's model.

Similar topics


Other stories you might like

Biting the hand that feeds IT © 1998–2021