Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
TuneNN: A transformer-based network model for pitch detection (github.com/tunenn)
112 points by CMLab on Dec 19, 2023 | hide | past | favorite | 39 comments


Could someone fill me in why would machine learning be necessary for pitch detection? Isn't it something that could just be solved with FFT or it's a much more complicated task?


Pitch is a *subjective* property, inherently tied to the complex processing humans use to perceive sounds. “Simple” physical measures like fundamental frequency of a periodic signal are very closely related, but for real-world audio (aren’t really periodic), the relationship is more complicated.


Could you elaborate a bit more? It seems to me like the note being played would always correspond to the fundamental frequency observed. When is this not the case? Maybe as the note rings out, the fundamental frequency and first few overtones lose power, and all that's still audible are the higher overtones?


There is a nice little rabbit hole to go in to: psychoacoustics of church bells.

https://www.hibberts.co.uk/what-note-do-we-hear-when-a-bell-...

Almost all musical instruments (such as pianos, organs, orchestral instruments and the human voice) have sounds that contain a range of frequencies f, 2f, 3f, 4f and so on where f is the lowest frequency in the sound. The pitch or note we asssign to the sound corresponds to the frequency f. Frequencies with this regular arrangement are called harmonic. The frequencies in the sound of bells, on the other hand, are not harmonic, and the pitch we assign to the sound of a bell is roughly an octave below the fifth partial up ordered by frequency. This partial is called the nominal, because it provides the note name of the bell. There often isn’t a frequency in the bell’s sound corresponding to the pitch we hear.


That's actually not true, perceived pitch can be different from fundamental frequency, because of psychoacoustics. E. g. you can have "missing fundamental" - https://en.wikipedia.org/wiki/Missing_fundamental - or other effects like "sum and difference tones", which are quite popular in spectralism / spectral music


Simple techniques like autocorrelation can still recover a missing fundamental. To answer the GP post, using neural networks for this task is overkill for simple, clean signals but it can be desirable if you need a) extremely high accuracy or b) robust results when there are signal degradations like background noise


We may perceive the same pitch (perhaps with a different timbre) even if the fundamental frequency is missing from a tone [1].

It seems to me like wikipedia agrees that even if the fundamental frequency is missing we could still use some kind of FFT to find the fundamental frequency being played (and therefore the perceived pitch). I might be missing something this is very far from my area of expertise :P

1 - https://en.wikipedia.org/wiki/Missing_fundamental


A transformer-based network model, pitch tracking for musical instruments.

The timbre of musical notes is the result of various combinations and transformations of harmonic relationships, harmonic strengths and weaknesses, instrument resonant peaks, and structural resonant peaks over time.

It utilizes the transformer-based tuneNN network model for abstract timbre modeling, supporting tuning for 12+ instrument types.


This smells like an automated summary.


It is from the submitter. I think it is intended as a submission statement.


That’s the first time I’ve seen/noticed that. I think it makes me feel better?


Not sure what you were feeling bad about, but it’s also directly from the readme, so OP just saw something cool on GitHub and posted it along with a comment with the project’s own description, which is pretty standard on HN (I think, that’s how I’ve done it at least).


How does the accuracy of this compare to CREPE?

https://github.com/marl/crepe

https://github.com/maxrmorrison/torchcrepe

Does anyone know what the current state of the art is, within the Music Information Retrieval community?


CREPE generally has high latency and error rates in instrument pitch recognition, especially for guitar instruments. Our team will release benchmark test data and results later.


High latency - agreed but it depends on whether a GPU is available or not. If it is then theoretically CREPE could be real-time. The error rates for pitch recognition are still quite good though for the full CREPE model. I’m interested to see the data on this claim.


Thanks. I’d love to try TuneNN! Are you releasing a pretrained model? How do I run it on a wav file?


What's the license?

What are your thoughts on PESTO which learns pitch-prediction very well with a small network, and uses a self-supervised objective?

https://arxiv.org/abs/2309.02265

https://github.com/SonyCSLParis/pesto


This is cool! The very best software-based tuning tech out there is probably in piano tuning apps; they cost hundreds of dollars+ and are specifically made to report on harmonics and other piano nuances.

Do you have any comparisons against other pitch detection tech? Accuracy? Delay/Responsiveness? I assume it's much more compute work than a handcoded FFT type pitch detector.

I think it's possible this would find utilization in the piano world if the output offers something new / something that can analyze what a piano tuning maestro can hear and make it accessible to a mid-tier tuner.


Sounds like you know a thing or two about pitch detection... I've been working on a C implementation of YIN and PYIN (a real GPL minefield for someone wanting to provide the end result as MIT/public domain!), and am wondering if it's a good choice for real time, cpu-bound speech pitch detection, or if there's better ways. May I ask what your thoughts are on this?


Have you also considered implementing the Nebula[1] algorithm?

[1] https://github.com/Sleepwalking/nebula


I need non-GPL libraries as a reference. The problem with YIN and especially PYIN is that the MIT-code I've found sometimes looks a bit too similar to earlier code in GPL. Rewriting that into the same but in different code is fairly hard. Here I'm assuming that translating eg. GPL Python or C++ into C would mean the license is retained


Can you not just write it from the paper(s)? Or is that more effort than value to you?

> that translating eg. GPL Python or C++ into C would mean the license is retained

It depends a bit on what exactly "translating" means but you could easily be a derivative work.

Honestly in that situation I wouldn't even look at the code. You might use in to test equivalent behavior after you have your own implementation, but only in a gross sense.


I think I have to look at the code when using other people's MIT licensed code... If they have used something that's GPL or used someone else's code that turns out to be GPL, then it becomes my problem when translating it. And I'm not smart enough to just follow a paper


I have some code here if it interests you: https://github.com/sevagh/pitch-detection

My favorite is the McLeod Pitch Method/MPM. Runs fast enough for realtime purposes in a WASM example too: https://github.com/sevagh/pitchlite


Ha! I've translated your YIN code actually! Your autocorrelation is pretty cool - GPL versions all use an additional FFT. Have been struggling with your PYIN implementation because the beta distribution is copied from the GPL PYIN source, and the paper just references its source code for that part, and as you also found out, it's not a real beta distribution. I asked one of the PYIN authors (Dixon) if he were willing to change the license and he forwarded my mail a week ago - haven't heard back. Then there's the absolute_threshold function that is the same as in the PYIN source where it says "using Jorgen Six'es loop construct". This "loop construct" doesn't have a license, because he doesn't answer the issues about that in his TarsosDSP library, and I'm not sure if I should bother him about a few lines of code. I'm assuming it's a coincidence and that's just a normal way to find the absolute threshold. I really don't want to point fingers here, I'm being paranoid because I try to make sure I don't publish something that can put people in trouble...

So I have been staring at your code for many hours, and the YIN-implementation works well. The PYIN on the other hand.. well I necro posted a while ago in one of your pull requests I think ;)


It sounds like you’ve found it already but th original pYin implementation is in the VAMP plugin. Simon Dixon is my PhD supervisor but he’s quite busy. Feel free to email me questions in my the meantime. j.x.riley@ the same university as Simon. There’s also a Python implementation in the librosa library which might have a better license for your purposes.


Amazing! You are much more thorough about licensing issues than I am.


> And I'm not smart enough to just follow a paper

Don't sell yourself short. This is the sort of thing that is only straightforward if you have the right background.


Based on our current tests, our algorithm shows significantly higher accuracy and robustness compared to traditional digital signal algorithms such as PEF, NCF, YIN, HPS, etc. Our team is working diligently, and we will release benchmark test data and results in the near future.


That's pretty nice. Do you have any idea how it does it?


That's interesting. Can you point to one of these piano tuning apps that are $100+?


It might be worth pointing out that the banjo model is for a four string banjo, given a five string banjo is the more common instrument.


Does anyone know where I should look if I want to detect specific sounds? Like a smoke alarm, food bowl dispenser (its very distinct), cat meowing, 3d printer collision, that sort of thing?



Use any model trained on the AudioSet dataset. There is one called EfficientAT i think that I use regularly and is pretty reliable


You would learn how to do this in the first & second chapters of the fast.ai course.


To the dev: the tuner gives me an incredibly high error window with the following message. It doesn't prompt to access the mic (I think that's related). Ubuntu/KDE/Firefox:

An error occurred running the Unity content on this page. See your browser JavaScript console for more info. The error was: TypeError: 'microphone' (value of 'name' member of PermissionDescriptor) is not a valid value for enumeration PermissionName. checkPermission@https://aifasttune.com/public/web/microphone/microphone.js:3... _Microphone_checkPermission@https://aifasttune.com/public/web/Build/web.framework.js:10:... @https://aifasttune.com/public/web/Build/web.wasm:wasm-functi... @https://aifasttune.com/public/web/Build/web.wasm:wasm-functi... @https://aifasttune.com/public/web/Build/web.wasm:wasm-functi... @https://aifasttune.com/public/web/Build/web.wasm:wasm-functi... @https://aifasttune.com/public/web/Build/web.wasm:wasm-functi... @https://aifasttune.com/public/web/Build/web.wasm:wasm-functi... invoke_iiii@https://aifasttune.com/public/web/Build/web.framework.js:10:... @https://aifasttune.com/public/web/Build/web.wasm:wasm-functi... @https://aifasttune.com/public/web/Build/web.wasm:wasm-functi... @https://aifasttune.com/public/web/Build/web.wasm:wasm-functi... @https://aifasttune.com/public/web/Build/web.wasm:wasm-functi... @https://aifasttune.com/public/web/Build/web.wasm:wasm-functi... @https://aifasttune.com/public/web/Build/web.wasm:wasm-functi... @https://aifasttune.com/public/web/Build/web.wasm:wasm-functi... unityFramework/Module._SendMessageString@https://aifasttune.com/public/web/Build/web.framework.js:10:... ccall@https://aifasttune.com/public/web/Build/web.framework.js:10:... SendMessage@https://aifasttune.com/public/web/Build/web.framework.js:10:... SendMessage@https://aifasttune.com/public/web/Build/web.loader.js:1:3343 loadURL@https://aifasttune.com/public/web/game/fastGameController.js... i@https://aifasttune.com/assets/index-64322640.js:1:777 setup/<@https://aifasttune.com/assets/index-64322640.js:1:611


Thank you for providing error feedback. We will work hard to address it. Currently, the model-related data is relatively large, which may be related to network speed.


I got the same error on Ubuntu/GNOME/Firefox. On Chrome, I don't get an error and I'm correctly prompted for microphone access, but if I grant permission, it does not seem to pick anything up (I've used my mic successfully with other web apps).




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: