That would be super cool idea and I have thought about it! To deal with music, we may want to use RNN, LSTM or WaveNet instead of GAN as the backbone generator and use my tl-GAN idea to find the "feature" axes for controlled generation. I would love to contribute to such projects.