Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does this really need AI? Couldn't you just take the maximum of the fourier transform of the rendering multiplied with the audio? Maybe across multiple dimensions.


This is one of variants of an area called "score following" and it's really hard. I remember about 10-15 years ago there was this craze in start-up land which led to many corpses of start-ups who thought solving semantic music problems would be easy with FFTs. But forgot to talk to an actual MIR (music information retrieval) or Musicologist. I remember a friend of mine working for one about 15 years ago and when she describe their deal I was like "have you talked to a real music person? Because that is only going to get you a tiny part of the way there". Nope. And sure enough, dead in the water a couple of years later. I'm studying music cognition right now in a cross discipline masters (music, comp sci) and it is crazy complicated.

In a nutshell: the human brain has an amazing ability to separate a garbage pail mush of audio into meaningful streams that we perceive as different event streams. This enabled us to do things like notice (as my prof says) the difference in white noise between the nearby river and the rustle in the leaves from something else, like an approaching animal. Score following requires doing this to separate the aggregate noise into different instruments and into voices within that instrument, suitable for note by note analysis. This, it turns out, is a wicked hard problem.


That makes sense when you're dealing with raw audio and need to disentangle it. But here we can work in the other direction to using the structure of the sheet to find the right spot in the mess on the other side.

I'm sure there are further difficulties, just saying that your answer isn't entirely satisfying.


Sheet music has far less structure than you think it does.


Can you explain what you mean by that? Sheet music would seem to be incredibly structured to me. Unless we’re talking about sheet music hand scribbled on paper?


Consider the same piece of "sheet music" delivered by different conductors, eg., Beethoven's 5th:

https://www.youtube.com/watch?v=1lHOYvIhLxo&t=24s https://www.youtube.com/watch?v=9aDEq3u5huA

Sheet music is a guide for human beings who understand "musical context" to, with much leeway, create a cohesive sound.

The sound created does not have a deterministic relationship to what's written, or else, why have any live music at all?

And, more importantly, even if it did deterministically produce the same "sound", that sound is layered to us by our prior familiarity with instruments (, room geometry, etc.).

We aren't deconstructing it by "mere frequency", but by meaningfully parsing out & grouping frequencies. There is no naive algorithm to do this.


I'm guessing you've never taken an academic instrumentation or orchestration course. Sheet music is like HL7, a bunch of common practices that at a passing glance look like a standard but have enough exceptions and idiomatic variance to drive you to drink.


This kind of problem has been extensively studied for years and it turns out that it is not that easy. The fundamental is not the strongest harmonic, the harmonics don't line up, what happens when you play multiple notes at the same time, etc. It is a big stinking mess.


There's a game called Rocksmith that has been on the market for many years that basically does this in real time on live guitar playing with far more than enough accuracy to make the game fun.

And that is doing it basically per-note! This task (aligning sheet music with MIDI) would allow for global optimization approaches which should be far easier and more likely to work.


Electric guitar going through a clean (ish) input is a pretty narrowly scoped problem that won't necessarily generalize


MIDI music is far cleaner than analogue input from a guitar.


I'm no expert, but isn't that just autocorrelation? (https://en.wikipedia.org/wiki/Autocorrelation)

Autocorrelation is the algorithm used for Radar (at least, that's what my professors told me). When radar bounces off of another object, there are different "echos" of that radar (ex: an object was 10 miles away and 20 miles away: the 20-mile object will take 2x longer to come back). Its messy and everything.

The radar input is very messy, full of reflections, echos, and more. But autocorrelation takes all of that information, and tells you where the objects were.


Autocorrelation is a good start, IF there is only one note playing at a time, and then it SOMETIMES works. You also get multiple peaks in the autocorrelation and it is not obvious which one to choose as the fundamental frequency.


There is an equivalent problem in audio, which is source localization. That's much simpler than note cognition.


And that's before even separating out by instrument! It's really quite incredible what goes on between the ear and the brain for this kind of thing.


Correlative methods are the wrong model by themselves but could be one part of the right model. Cognition happens by correlation, but also by denoising, by adaptive filtering, by interference cancellation, even by predictive generation. Listening is not a passive process.


Care to elaborate?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: