Here's why AI can't make a catchier tune than the worst pop song in the charts right now

DeepMind tries to train a neural network on classical piano

Neural networks are neat at spotting and reproducing patterns in images and text – yet they still struggle when spitting out audio.

There are numerous examples of artificially intelligent software improvising fairly realistic images of people, buildings, and other objects, from training material. However, when it comes to composing music, machines are way off. The melodies, if you can call them that, sound nonsensical because they don’t have any structure over time like normal music does.

We expect tunes to sustain a structure over a matter of minutes, whereas computers end up flitting about between styles every few seconds.

Pop songs are roughly split into verses and choruses with a repeating melody, yet that's a pattern machine-learning code cannot seem to grasp. Now, a paper by researchers at DeepMind has had a stab at explaining why.

Most research projects train a system by converting the raw sound waves into MIDI files, which the neural network is expected to recreate. This, it seems, strips away the details and nuances that are important when it comes to crafting music that sounds realistic. So instead, the DeepMind gang trained their model directly from raw audio waves, teaching it to produce raw audio waves – a move other teams are also starting to consider.


“Models that are capable of generating audio waveforms directly (as opposed to some other representation that can be converted into audio afterwards, such as spectrograms or piano rolls) are only recently starting to be explored,” the researchers explained in a writeup of their study, emitted late last week.

"This was long thought to be infeasible due to the scale of the problem, as audio signals are often sampled at rates of 16 kHz or higher."

Crucially, text-to-speech systems do not suffer from the same creative blocks as AI songwriters because words in human speech are pretty short – with sounds in the order of hundreds of milliseconds – whereas music requires structure stretching over minutes. Text-to-speech bots just have it easier than music-generating cousins.

Popular models used for text-to-speech, such as SampleRNN and WaveNet, have been explored for music generation, but none of them have been successful in capturing melodies and rhythm.

“Music is a complex, highly structured sequential data modality,” the DeepMind paper stated.

"When rendered as an audio signal, this structure manifests itself at various timescales, ranging from the periodicity of the waveforms at the scale of milliseconds, all the way to the musical form of a piece of music, which typically spans several minutes.

"Modeling all of the temporal correlations in the sequence that arise from this structure is challenging, because they span many different orders of magnitude."

Putting it all together

So, what did DeepMind come up with after scrutinizing other song-streaming systems. In order to grok patterns in music, they designed their AI software to learn from longer snippets of audio training data. The researchers called this “[enlarging] the receptive fields.”

They did this by adding more convolutional layers to a WaveNet model. The input sound samples were taken from more than 400 hours of recorded solo piano music, from composers such as Chopin and Beethoven. These were then fed into the model via an encoder that converted the raw audio into continuous scalars, or into 256-dimensional one-hot vectors.

It’s a computationally demanding process, and the whole training portion taxed as many as 32 GPUs. “We show that it is possible to model structure across roughly 400,000 timesteps, or about 25 seconds of audio sampled at 16 kHz. This allows us to generate samples of piano music that are stylistically consistent,” the paper stated.

Since the input structure only spans about 25 seconds, the output generated is only consistent “across tens of seconds” as well. Here are a few ten second clips of what some of DeepMind's machine-made music sounds like.

DeepMind Audio Clip 1

DeepMind Audio Clip 2

Ten seconds isn’t enough to craft a catchy tune, but it’s interesting to see AI try. Slap a banging beat on it, and you've probably got next year's EDM hit. ®

Similar topics

Other stories you might like

  • Google Pixel 6, 6 Pro Android 12 smartphone launch marred by shopping cart crashes

    Chocolate Factory talks up Tensor mobile SoC, Titan M2 security ... for those who can get them

    Google held a virtual event on Tuesday to introduce its latest Android phones, the Pixel 6 and 6 Pro, which are based on a Google-designed Tensor system-on-a-chip (SoC).

    "We're getting the most out of leading edge hardware and software, and AI," said Rick Osterloh, SVP of devices and services at Google. "The brains of our new Pixel lineup is Google Tensor, a mobile system on a chip that we designed specifically around our ambient computing vision and Google's work in AI."

    This latest Tensor SoC has dual Arm Cortex-X1 CPU cores running at 2.8GHz to handle application threads that need a lot of oomph, two Cortex-A76 cores at 2.25GHz for more modest workloads, and four 1.8GHz workhorse Cortex-A55 cores for lighter, less-energy-intensive tasks.

    Continue reading
  • BlackMatter ransomware gang will target agriculture for its next harvest – Uncle Sam

    What was that about hackable tractors?

    The US CISA cybersecurity agency has warned that the Darkside ransomware gang, aka BlackMatter, has been targeting American food and agriculture businesses – and urges security pros to be on the lookout for indicators of compromise.

    Well known in Western infosec circles for causing the shutdown of the US Colonial Pipeline, Darkside's apparent rebranding as BlackMatter after promising to go away for good in the wake of the pipeline hack hasn't slowed their criminal extortion down at all.

    "Ransomware attacks against critical infrastructure entities could directly affect consumer access to critical infrastructure services; therefore, CISA, the FBI, and NSA urge all organizations, including critical infrastructure organizations, to implement the recommendations listed in the Mitigations section of this joint advisory," said the agencies in an alert published on the CISA website.

    Continue reading
  • It's heeere: Node.js 17 is out – but not for production use, says dev team

    EcmaScript 6 modules will not stop growing use of Node, claims chair of Technical Steering Committee

    Node.js 17 is out, loaded with OpenSSL 3 and other new features, but it is not intended for use in production – and the promotion for Node.js 16 to an LTS release, expected soon, may be more important to most developers.

    The release cycle is based on six-monthly major versions, with only the even numbers becoming LTS (long term support) editions. The rule is that a new even-numbered release becomes LTS six months later. All releases get six months of support. This means that Node.js 17 is primarily for testing and experimentation, but also that Node.js 16 (released in April) is about to become LTS. New features in 16 included version 9.0 of the V8 JavaScript engine and prebuilt Apple silicon binaries.

    "We put together the LTS release process almost five years ago, it works quite well in that we're balancing [the fact] that some people want the latest, others prefer to have things be stable… when we go LTS," Red Hat's Michael Dawson, chair of the Node.js Technical Steering Committee, told The Register.

    Continue reading

Biting the hand that feeds IT © 1998–2021