Converting sounds to numbers is certainly nontrivial, but the "ML-for-music" projects I've seen were generally already working with MIDI.
I think the deeper problem is that musical structure that's obvious to the listener (chord progressions, modulations etc.) are realistically going to vary too much across the training data for any ML approach to figure out.
I think the deeper problem is that musical structure that's obvious to the listener (chord progressions, modulations etc.) are realistically going to vary too much across the training data for any ML approach to figure out.