This is one of variants of an area called "score following" and it's really hard. I remember about 10-15 years ago there was this craze in start-up land which led to many corpses of start-ups who thought solving semantic music problems would be easy with FFTs. But forgot to talk to an actual MIR (music information retrieval) or Musicologist. I remember a friend of mine working for one about 15 years ago and when she describe their deal I was like "have you talked to a real music person? Because that is only going to get you a tiny part of the way there". Nope. And sure enough, dead in the water a couple of years later. I'm studying music cognition right now in a cross discipline masters (music, comp sci) and it is crazy complicated.
In a nutshell: the human brain has an amazing ability to separate a garbage pail mush of audio into meaningful streams that we perceive as different event streams. This enabled us to do things like notice (as my prof says) the difference in white noise between the nearby river and the rustle in the leaves from something else, like an approaching animal. Score following requires doing this to separate the aggregate noise into different instruments and into voices within that instrument, suitable for note by note analysis. This, it turns out, is a wicked hard problem.
That makes sense when you're dealing with raw audio and need to disentangle it. But here we can work in the other direction to using the structure of the sheet to find the right spot in the mess on the other side.
I'm sure there are further difficulties, just saying that your answer isn't entirely satisfying.
Can you explain what you mean by that? Sheet music would seem to be incredibly structured to me. Unless we’re talking about sheet music hand scribbled on paper?
Sheet music is a guide for human beings who understand "musical context" to, with much leeway, create a cohesive sound.
The sound created does not have a deterministic relationship to what's written, or else, why have any live music at all?
And, more importantly, even if it did deterministically produce the same "sound", that sound is layered to us by our prior familiarity with instruments (, room geometry, etc.).
We aren't deconstructing it by "mere frequency", but by meaningfully parsing out & grouping frequencies. There is no naive algorithm to do this.
I'm guessing you've never taken an academic instrumentation or orchestration course. Sheet music is like HL7, a bunch of common practices that at a passing glance look like a standard but have enough exceptions and idiomatic variance to drive you to drink.
In a nutshell: the human brain has an amazing ability to separate a garbage pail mush of audio into meaningful streams that we perceive as different event streams. This enabled us to do things like notice (as my prof says) the difference in white noise between the nearby river and the rustle in the leaves from something else, like an approaching animal. Score following requires doing this to separate the aggregate noise into different instruments and into voices within that instrument, suitable for note by note analysis. This, it turns out, is a wicked hard problem.