Neural networks are neat at spotting and reproducing patterns in images and text – yet they still struggle when spitting out audio.
There are numerous examples of artificially intelligent software improvising fairly realistic images of people, buildings, and other objects, from training material. However, when it comes to composing music, machines are way off. The melodies, if you can call them that, sound nonsensical because they don’t have any structure over time like normal music does.
We expect tunes to sustain a structure over a matter of minutes, whereas computers end up flitting about between styles every few seconds.
Pop songs are roughly split into verses and choruses with a repeating melody, yet that's a pattern machine-learning code cannot seem to grasp. Now, a paper by researchers at DeepMind has had a stab at explaining why.
Most research projects train a system by converting the raw sound waves into MIDI files, which the neural network is expected to recreate. This, it seems, strips away the details and nuances that are important when it comes to crafting music that sounds realistic. So instead, the DeepMind gang trained their model directly from raw audio waves, teaching it to produce raw audio waves – a move other teams are also starting to consider.
“Models that are capable of generating audio waveforms directly (as opposed to some other representation that can be converted into audio afterwards, such as spectrograms or piano rolls) are only recently starting to be explored,” the researchers explained in a writeup of their study, emitted late last week.
"This was long thought to be infeasible due to the scale of the problem, as audio signals are often sampled at rates of 16 kHz or higher."
Crucially, text-to-speech systems do not suffer from the same creative blocks as AI songwriters because words in human speech are pretty short – with sounds in the order of hundreds of milliseconds – whereas music requires structure stretching over minutes. Text-to-speech bots just have it easier than music-generating cousins.
“Music is a complex, highly structured sequential data modality,” the DeepMind paper stated.
"When rendered as an audio signal, this structure manifests itself at various timescales, ranging from the periodicity of the waveforms at the scale of milliseconds, all the way to the musical form of a piece of music, which typically spans several minutes.
"Modeling all of the temporal correlations in the sequence that arise from this structure is challenging, because they span many different orders of magnitude."
Putting it all together
So, what did DeepMind come up with after scrutinizing other song-streaming systems. In order to grok patterns in music, they designed their AI software to learn from longer snippets of audio training data. The researchers called this “[enlarging] the receptive fields.”
They did this by adding more convolutional layers to a WaveNet model. The input sound samples were taken from more than 400 hours of recorded solo piano music, from composers such as Chopin and Beethoven. These were then fed into the model via an encoder that converted the raw audio into continuous scalars, or into 256-dimensional one-hot vectors.
It’s a computationally demanding process, and the whole training portion taxed as many as 32 GPUs. “We show that it is possible to model structure across roughly 400,000 timesteps, or about 25 seconds of audio sampled at 16 kHz. This allows us to generate samples of piano music that are stylistically consistent,” the paper stated.
Since the input structure only spans about 25 seconds, the output generated is only consistent “across tens of seconds” as well. Here are a few ten second clips of what some of DeepMind's machine-made music sounds like.
Ten seconds isn’t enough to craft a catchy tune, but it’s interesting to see AI try. Slap a banging beat on it, and you've probably got next year's EDM hit. ®