Hacker News new | past | comments | ask | show | jobs | submit | ffast-math's comments login

No ML frameworks implement it yet, though I'd be happy to work with people from the PyTorch/TF/JAX/CUDNN/CUTLASS/etc. teams (or volunteers) if anyone wants to make this happen.

Also, while you can get 200x compression, I do want to emphasize that there's a speed vs quality tradeoff and the results will vary by problem. We have much more careful statements in the paper about the exact problem setup, tradeoffs, etc. Also, as I've mentioned in other comments, it probably won't help too much on modern GPUs due to their acceleration of dense GEMMs but not shuffles. CPU inference is the killer app here.


IMO it would be super cool and I hope someone does it. There are a lot of interesting tradeoffs around which techniques to use for which matrix sizes and under which assumptions about read vs write ratios, what you have a training set for, whether you can fuse compression intro previous ops, etc.


hm... so maybe a better place would be one of these toolkits like jax where the entire computation is known at optimization time where a blas would potentially have to do some heroic heuristics to try and fully optimize underneath the blas interface.


We found sparse, truncated PCA to be the most competitive baseline. We beat it by a lot (see the paper [1]), but the other big drawback is that trading off the rank vs sparsity was an ugly hyperparameter tuning problem. By ugly, I mean that the results were really sensitive to getting this right, it wasn't easy to set a priori, and took a while to iterate on because the sparse PCA trained much more slowly than any other practical alternative.

There are situations where PCA/SVD is the right approach though. Namely, if you need really little error, our method often can't do that, whereas throwing away dims that explain almost no variance can. Also it's just easier to implement.

[1] https://arxiv.org/pdf/2106.10860.pdf


Exactly. You can run it on sparse inputs. It's just that our implementation doesn't exploit the sparsity, so we don't claim that it will work better.


Definitely. On CPUs, you could make this 2x faster pretty easily with just another execution port for vpshufb / vtbl and a 4bit lo and hi unpack instruction.

Though the real speedup would be allowing dense matmul ASICs to operate on 16-byte tables and 4-bit indices as operands. The reason Bolt and MADDNESS end up so fast is that they produce "sparse" representations that are still contiguous, strided arrays in memory. So the kernels and access patterns are just like those of dense GEMMs (and therefore vectorize-able, etc), but with lookup-adds instead of multiply-adds.

Hopefully-clarifying image: https://imgur.com/a/trOB69U


Fascinating, might try implementing this on an FPGA.


There's definitely a tradeoff between speed and accuracy. We characterize this for various problems in the paper (https://arxiv.org/pdf/2106.10860.pdf), but tl;dr is that it speeds things up more at a given level of error when there's more redundancy in your matrices.

Back-of-the-envelope calculation suggests that this won't beat tensor cores on NVIDIA GPUs. This is basically because ~half the die is an ASIC for dense (and 2:4 sparse) matmuls, with no support for the sparsity structure we induce. If 1:16 sparsity were supported or there were a batched warp_shuffle instruction, we'd get similar speedups for GPUs as we do on CPUs.


Author here. Ask me anything--happy to answer questions.

Also, if you like this kind of work, you might like what I've been building for the past year: Composer [1]. It speeds up neural net training by a lot (e.g., 7x faster for ResNet-50) [2] and, in contrast to Bolt/MADDNESS, is polished, documented code you can get working in <5min.

[1] https://github.com/mosaicml/composer

[2] https://www.mosaicml.com/blog/mosaic-resnet


Thank you for your efforts. I came across your paper/code and posted it here. I was looking to find a technique to cost optimise transformer based question and answering. Presently I am using CPU and getting a GPU is too costly on AWS.

Since I use high level code I don't understand the maths completely. However, I was wondering if your techniques can be beneficial on CPUs?

If I were to use this to improve transformer based architecture what should be my approach?


Thanks for posting it!

It should be possible to get large speedups on CPUs, but the trick will be gradually approximating each of the layers in the model (see my reply to sibling comment). It's not conceptually difficult, but will require a fair amount of C++ work to port the code to GPUs* for training; and it will probably go slower than dense ops on modern GPUs due to tensor cores not supporting our memory layout.

I think of this paper as the first in a two-part series, where the next one takes these fast ops and gets them working in full neural nets. (If anyone wants to do this project, happy to coadvise you / talk about it whenever; I won't have bandwidth to do it myself for the foreseeable future).

*Someone recently started doing this as part of their master's thesis: https://github.com/joennlae/halutmatmul


Thank you, I will try to take this up. What would be the best way to reach out to you?


email. <my first name>@mosaicml.com


Are you aware of any code/projects that convert something like a fully trained resnet50 model into a maddness optimized, approximate model?

An approximate but speedier resnet inference model that runs on a CPU would be useful even if it’s not quite as fast/accurate as a GPU inference model, since currently the cost to run a GPU is typically higher than CPUs.


This master's thesis sort of does it for individual layers, but it doesn't have any fine-tuning yet so it completely wrecks the accuracy: https://github.com/joennlae/halutmatmul.

If someone worked on contributing this functionality to Composer [1] I'd be down to help out. I can't justify building it all on my own right now since we're 100% focused on training speedup, but I could definitely meet and talk through it, help code tricky parts, review PRs, etc.

[1] https://github.com/mosaicml/composer


There's lots of work in this area.

Quantization is a common technique. See for example https://pytorch.org/docs/stable/quantization.html


Not to distract from the clearly good technical work you've done here, but why name it Bolt when there's a Fintech startup with the same name. Among other brand confusion issues (Like the Chevy Bolt).


I named it this in 2017 and was only worried about name collisions with other GitHub repos and ML algorithms. Also it's a backronym for Based On Lookup Tables + sounds at least somewhat evocative of going fast, so it was the best name for an algorithm I could come up with.


I see it noted that this could speed up machine learning inference, but any hope of this being extended to also speed up training? I imagine with 100x speedup in matmuls, albeit approximate matmuls, one could plausibly train on a CPU.


Yes. It's another research project to make this happen, but I think it would be fairly straightforward. The issue is that you can't backprop through the assignment step, so you get no gradient with respect to the input. This mandates a progressive layer freezing strategy. I don't think it would be too hard to get working though; you'd likely just need to train for longer, or start with a pretrained model and fine-tune it as you freeze + approximate the layers.


Have you done any presentations at the big companies yet?


Nope. I'd love to though.


We have a generalization guarantee in Section 4.5. It's not especially tight though; in practice, the errors from different codebooks tend to be mostly independent, and you get nice Gaussian-like concentration. I would look at the empirical results in Section 5 to get a better feel for how it performs in practice.


So this misses a few aspects of why the method works:

- You can't actually get a speedup from the proposed approach. You'd need a lookup table of size b^2 to multiply two b-bit numbers, which will be much slower than just doing the multiplication.

- Most of what we do is perform fewer operations. We're actually slower per op than multiply-adds because those are supported better in hardware. We just compress the matrices by so much that it's worth it. Relatedly, for sufficiently few codebooks, we're actually sublinear in the input size--i.e., we don't even read it all.

- We have some lightweight machine learning in there to make the sublinearity and other approximations not suck. Getting the ML fast enough to beat a BLAS GEMM is far from trivial.


I'm actually not quite sure what you mean by breaking down into two-dimensional operations. We use operations on pairs of vectors, but nothing is assumed to be two-dimensional, and I don't think we suggest imagining anything as a plane? The vectors can be arbitrary and there's no assumption about curvature. Happy to clarify if I can help.


Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: