> The music generated by wavenet clearly sounds like a piano, but lacks compositional structure that most people might be able to follow. I suspect a significant architectural change will be needed for music for reasons discussed in this article.
As someone who's worked a lot on procedural music, I think this is definitely true. I'm always surprised to see ML-based approaches where someone has just trained a system on a bunch of songs, and then hopes the system will produce music with a recognizable structure - even though all the training songs will have had (in general) different chord progressions, different numbers of voices or melodic lines at any given time, etc. Such approaches strike me as akin to training a system on a bunch of short stories and then hoping it will produce a new story with a recognizable plot.
It seems like it would make a lot more sense to remove these hidden dimensionalities, e.g. by annotating the source data with chord or structural information, or by training on lots of different melodies that all share the same chord progression, etc. But it's hard to imagine that with enough layers the network will eventually grok all these hidden details.
I believe the first company to create a very successful cloud based DAW will have the greatest opportunity to vacuum up this data. If producers were pushing real original music source data into your neural network, you basically eliminate the need to do any sort of waveform analysis. Everything turns into discrete numbers, which neural networks are really good at managing vs. the destructive noise of common music files. (120BPM music could conceivably be 2 inputs per second vs. 44100)
Edit: As a matter of fact, you don't even need a whole DAW, really. You just need to be able to read existing DAW files and give users a reason to upload them.
Converting sounds to numbers is certainly nontrivial, but the "ML-for-music" projects I've seen were generally already working with MIDI.
I think the deeper problem is that musical structure that's obvious to the listener (chord progressions, modulations etc.) are realistically going to vary too much across the training data for any ML approach to figure out.
As someone who's worked a lot on procedural music, I think this is definitely true. I'm always surprised to see ML-based approaches where someone has just trained a system on a bunch of songs, and then hopes the system will produce music with a recognizable structure - even though all the training songs will have had (in general) different chord progressions, different numbers of voices or melodic lines at any given time, etc. Such approaches strike me as akin to training a system on a bunch of short stories and then hoping it will produce a new story with a recognizable plot.
It seems like it would make a lot more sense to remove these hidden dimensionalities, e.g. by annotating the source data with chord or structural information, or by training on lots of different melodies that all share the same chord progression, etc. But it's hard to imagine that with enough layers the network will eventually grok all these hidden details.