Hacker News new | past | comments | ask | show | jobs | submit | cochlear's comments login

You're the author of Glicol, right? I've definitely had my eye on trying it out for a while. The karplus-stress-tester is great; I'm currently using message ports, because they seemed most accessible at first, but I'm happy to know there are other, and better options. I have done quite a bit of hand-optimizing of the code here, and while I think there's probably juice left to squeeze, it has become apparent to me that wasm is probably my next stop.

I've written one other AudioWorklet at this point, which just runs "inference" on a single-layer RNN given a pre-trained set of weights: https://blog.cochlea.xyz/rnn.html. It has similarly mediocre performance.

Thanks for all the great tips, and for your work on Glicol!


How funny, I actually corresponded with one of the authors of the "Spiking Music..." paper when it first showed up on arxiv. I'll definitely give the amp-modeling paper a read, looks to be right up my alley!

Now that I understand the basics of how this works, I'd like to use a (much) more efficient version of the simulation as an infinite-dataset generator and try to learn a neural operator, or NERF like model that, given a spring mesh configuration, a sparse control signal, and a time, can produce an approximation of the simulation in a parallel and sample-rate-independent manner. This also (maybe) opens the door to spatial audio, such that you could approximate sound-pressure levels at a particular point in time _and_ space. At this point, I'm just dreaming out-loud a bit.


This is possible but very very hard! Actually getting the model to converge on something that sounds reasonable will make you pull your hair out. It’s definitely a fun and worthwhile project though. I attempted something similar a few years ago. Good luck!


Thanks for the bug report, I'll look into this and see if I can make it better! There's definitely more optimization juice to be squeezed, I think, and it'd probably be smart to allow the number of nodes in the simulation to be adjusted.

Working with AudioWorklets (https://developer.mozilla.org/en-US/docs/Web/API/AudioWorkle...) has been really cool, and I've been surprised at what's possible, but I _haven't_ yet figured out how to get good feedback about when the custom processor node is "falling behind" in terms of not delivering the next buffer quickly enough.


Same here! Not a physics engine per se, but I've been eyeing Taichi Lang (https://github.com/taichi-dev/taichi) as a potential next stop for running this on a much larger scale.

My assumption has been that any physics engine that does soft-body physics would work in this regard, just at a much higher sampling rate than one would normally use in a gaming scenario. This simulation is actually only running at 22050hz, rather than today's standard 44100hz sampling rate.


Thanks for the bug report, I definitely do see now that things fall apart at certain extreme values. I'll look into it and see if I can fix. My assumption is that it's some sort of numeric overflow or underflow resulting in NaN values, and that I'll probably just need to set the slider boundaries more carefully. Thanks again!


These are both great points and I'll use them to refine my writing on the subject, I appreciate the feedback!

Apologies if it isn't clear, but the animated gifs are meant as an illustration of the iterative encoding process, where the encoder decomposes the signal step-by-step, as in matching pursuit. I'll be sure to clarify that point.

I'll add a paragraph on compression rates/ratios, although that isn't necessarily the main focus here; codecs may compress a signal, but they might also transform it into a more useful, easy-to-understand and easy-to-manipulate representation.


I couldn't agree more. I feel that the block-coding and rasterized approaches that are ubiquitous in audio codecs (even the modern "neural" ones) are a dead-end for the fine-grained control that musicians will want. They're just fine for text-to-music interfaces of course.

I'm working on a sparse audio codec that's mostly focused on "natural" sounds at the moment, and uses some (very roughly) physics-based assumptions to promote a sparse representation.

https://blog.cochlea.xyz/sparse-interpretable-audio-codec-pa...


itneresting. I'm approaching music generation from another perspective:

https://github.com/chaosprint/RaveForce

RaveForce - An OpenAI Gym style toolkit for music generation experiments.


I think I'm beginning to wrap my head around the way modern, "deep" state-space models (e.g Mamba, S4, etc.) leverage polynomial multiplication to speed up very long convolutions.

I'm curious if there are other methods for approximating long convolutions that are well-known or widely-used, outside of overlap-add and overlap-save? I'm in the audio field and interested in learning long _FIR_ filters to describe the resonances of physical objects, like instruments, or rooms. Block-coding, or fixed-frame size approaches reign supreme, of course, but have their own issues in terms of windowing artifacts, etc.

I'm definitely aware that multiplication in the (complex) frequency domain is equivalent to convolution in the time domain and that, because of the fast-fourier transform, this can yield increased efficiency. However, this still results in storing a lot of gradient information that my intuition tells me (possibly incorrectly) is full of redundancy and waste.

Stateful, IIR, or auto-regressive approaches are _one_ obvious answer, but this changes the game in terms of training and inference parallelization.

A couple ideas I've considered, but have not yet tried, or looked too deeply into:

- First performing PCA in the complex frequency domain, reducing the point-wise multiplication that must occur. Without some additional normalization up-front, it's likely this would be equivalent to downsampling/low-pass filtering and performing the convolution there. The learnable filter bank would live in the PCA space, reducing the overall number of learned parameters.

- A Compressed Sensing inspired approach, where we perform a sparse, sub-sampled random set of points from both signals and recover the full result based on the assumption that both convolver and convolvee? are sparse in the fourier domain. This one is pretty half-baked.

I'd love to hear about papers you've read, or thoughts you've had about this problem.


The convolution by FFT overlap and save can have very low intermediate storage (none on GPU with cuFFTDx for example). And most of the time, the IFFT doesn't have to happen right away, lots processing can still be performed in the frequential domain.

Having each of 18k CUDA cores of a L40s perform small 128-points FFTs and with very little sync or overlap manage long filters... is pretty efficient by itself.

There's a lot happening in the HPC world on low-rank (what you're intuiting with PCA), sparse and tiled operations. I have a hard time applying all this to 'simple' signal processing and most of it lacks nicer APIs.

I've seen lots of interesting things with 'irregular FFT' codes and working on reducing either the storage space necessary for FFT intermediate results, sometimes through multi-resolution tricks.

Look up Capon filters and adaptative filtering in general, there's a whole world of tricks there too. You might need a whole lot of SVDs and matrix inversions there...

Bust mostly if you're on a GPU there's a wealth of parallelism to exploit and work-around the 'memory-bound' limits of FFT-based convolution. This thesis https://theses.hal.science/tel-04542844 had some discussion and numbers on the topic. Not complete but inspiring.


The gradient information in backroom can be computed similarly to forwards I think. Certainly the FFT blocks are linear and so now it's a question about the multiplication which is pretty compact.


I'm working on building models that extract sparse and easy-to-interpret representations of musical audio. The work in this post encodes short segments of music from the MusicNet dataset as a set of events with a time-of-occurrence and a low-dimensional vector representing attack envelopes and resonances of both the instrument being played and the room in which the performance occurred. I think this representation could prove superior to current block-coding (fixed-frame sizes) and text-based generation models, at least for musicians who need fine-grained control of generated audio.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: