This article is more than 1 year old

Machines making music, translating Chinese, self-driving trucks, and more

Developments for our future overlords

Roundup Welcome to this week's AI roundup. We have news on a machine learning model used by Google to make music that doesn't sound completely bad, improved translation between English and Chinese from Microsoft, and a new test bed for Waymo's self-driving trucks.

Making music with machine learning - Google researchers have developed a model that uses machine learning to make new music by mixing different melodies together.

MusicVAE, is the latest model to come out of Magenta, a Google research project that focuses on using machine learning for art and music.

Sound is an interesting medium to work with, and other researchers have toyed with AI to make music. The results haven’t always been great (remember this Christmas hit complete with made up lyrics and a creepy computing singing voice?). Neural networks may excel at learning patterns, but having to reproduce them in way that sounds good to the human ear is difficult.

MusicVAE, however, doesn’t sound too bad. You can listen to a short sample in the video below.

Youtube Video

It’s not going to be a chart topper any time soon. But it’s interesting to see how the model took two different melody clips and remixed them to produce something that could pass as a one of those old-school polyphonic ringtones.

It all boils down to recreating a sequence of bleeps and bloops that sounds realistic, like the notes are musically related to one another. MusicVAE focuses on something called “latent space”, where sounds are encoded and mapped to a vector space.

In order to recreate songs that sound more natural, the latent space has to fulfill three requirements, as explained in this blog post.

  • Expression: Any real example can be mapped to some point in the latent space and reconstructed from it.
  • Realism: Any point in this space represents some realistic example, including ones not in the training set.
  • Smoothness: Examples from nearby points in latent space have similar qualities to one another.

“Latent space models are capable of learning the fundamental characteristics of a training dataset and can therefore exclude these unconventional possibilities. Aside from excluding unrealistic examples, latent spaces are able to represent the variation of real data in a lower-dimensional space. This means that they can also reconstruct real examples with high accuracy,” the blog post said.

The model must also have a coherent long-term structure to capture the patterns over longer sequences to create full songs. None of the examples in the playlist are sophisticated and long enough to really be considered as songs, but it’s fun to see how the computers get creative.

To do this, the researchers developed a “hierarchical decoder”. The input sounds are encoded into individual latent codes in latent space, and passed through a “conductor”.

The conductor is a recurrent neural network (RNN) that spits out an output embedding for each bar in the music. This passes through another RNN to gives another set of outputs based on the embeddings created by the conductor RNN. These are then passed through a decoder, to convert the code into sounds.

You can read more about how the hierarchical decoder works here. And play around with MusicVAE in your browser here (scroll down to additional resources), or go straight to the code here.

A win for Microsoft translation team - Microsoft researchers announced they had achieved human parity on machine translation from Chinese to English.

In other words, their neural network model can translate between both languages as well as a human translator. Chinese is a notoriously difficult language to master. But researchers from Microsoft have managed this, but it’s only been tested on a small subset of news articles in the dataset. It has yet to do this for “real-time news stories”.

A team of external bilingual language consultants was hired to compare Microsoft’s results against the human translations for the test dataset.

“Machine translation is much more complex than a pure pattern recognition task. People can use different words to express the exact same thing, but you cannot necessarily say which one is better,” said Ming Zhou, assistant managing director of Microsoft Research Asia and head of the natural language processing group that worked on the project.

There isn’t a lot of detail into how their model works. But in a blog post, Microsoft explained that it was based on four techniques. The first is dual learning, where a Chinese sentence is translated to English, and then translated again from English to Chinese.

This is improved with deliberation networks, which refines the conversion by translating the same sentences repeatedly. Thirdly joint training has the English-to-Chinese translation system translate new English sentences into Chinese in order to obtain new sentence pairs used in training. Finally the model uses agreement regularization, where the translation can be done by reading the sentences from left to right or from right to left.

You can find out more about it here.

Adapt or die - Researchers from Google have published a paper that shows how evolutionary algorithms can help developers design image classifiers using its TPU2 hardware.

Massive neural networks with different components and layers are tricky and time consuming to hand-craft. If you’re like Google and can afford to use several hundreds of GPUs or TPUs for a series of experiments then you can also play around with algorithms that create new neural networks. If you can’t you’ll just have to stick to reading this paper.

The team use a regularized evolutionary algorithm to comb through a “search space” made up of different modules or “cells” that are stacked together to make image classifier models. The algorithm mutates the model by randomly connecting the cells and operations into various combinations.

“In every evolutionary cycle, we select the best of a random sample of individuals to 'reproduce'. The difference is that we also remove the oldest individual. This is similar to what happens in nature, where old individuals die,” the paper explains.

Although regularization technique kills old models that might be effective, it makes the overall models more robust as it makes sure that those that survive in the long run are ones that are still decent after being retrained.

It took 900 TPU2 chips running over 5 days to train a whopping 27,000 different models. The best architectures discovered, nicknamed AmoebaNets, apparently achieved accuracy rates comparable with the best hand-designed image classifiers on ImageNet. The best model had an accuracy of 83.1 per cent.

The paper also compares their evolutionary algorithms with less complex methods such as reinforcement learning and random search to prove that they are useful.

“Evolution is faster than reinforcement learning in the earlier stages of the search,” Esteban Real, first author of the paper and a Senior Software Engineer at Google Brain, explained in a blog post.

"This is significant because with less compute power available, the experiments may have to stop early. Moreover evolution is quite robust to changes in the dataset or search space."

Random search is less effective as the models created aren’t as accurate.

Waymo Trucks - Waymo announced it will be testing self-driving trucks to carry goods to Google’s data centers in Atlanta, Georgia.

It’s already been driving its trucks on the road over the past year in other states including California and Arizona. Human drivers are still required to sit in to monitor the truck and take control if needed.

Waymo explained in a blog post that its self-driving trucks uses the same set of sensors as its self-driving minivans, often spotted in San Francisco.

“Our engineers and AI experts are leveraging the same five million miles we’ve already self-driven on public roads, plus the five billion miles we’ve driven in simulation.” ®

More about


Send us news

Other stories you might like