This is a fascinating idea. Any real acedemic critique is over my head (and I hope others chime in), but some random thoughts:
- "logarithm LUT then add" seems delightfully simple, especially at low precision. I am going to have to read that paper too...
- The concerns about GPU style parallelism may not be as bad in "alternative" architectures. For instance, Centaur came up with a single, serial, but hilariously wide 32,768-bit SIMD core for inference: https://fuse.wikichip.org/news/3256/centaur-new-x86-server-p...
Well so one issue w both GPUs n CPUs which make them bad platforms for this algorithm is that, in both, FLOPS are such an important metric for sales that multiplication is highly subsidized in both those chip types. So huge amounts of area is dedicated to floating point multiplication, meaning the advantage of fgemm (the name of the algorithm is the same as the name of the company) is purely one of energy.
Which is great because if it were software it would be impossible to protect the IP. USPTO is very clear in that sense, i believe in both in re Bilski and in the Alice Corp. case which reached SCOTUS, that algorithms need to be implemented physically, typically meaning in a chip, to be patentable. So because it needs a chip to work, it is good business, if it did not it would be bad business. A chip provides every form of IP protection, all four forms, trade secret, copyright, patent, n even trademark. No other medium has that to my knowledge.
So if you have a CPU or a GPU n want it to do more work in the same amount of time, this paper promises nothing, n it keeps that promise. Nonetheless i'm advancing rapidly to the point of creating the hardware that can cut off 70% of the cost of GEMM. I considered 50% off, same thing at half the price, but it wouldn't be fair to the consumer w my economics. You see 50% discounts all the time, who cares? 70% off, you don't see that all the time. On something you actually want? Especially on a commodity, n it's still good business for me as the lowest-cost producer.
> Well so one issue w both GPUs n CPUs which make them bad platforms for this algorithm is that, in both, FLOPS are such an important metric for sales that multiplication is highly subsidized in both those chip types. So huge amounts of area is dedicated to floating point multiplication, meaning the advantage of fgemm (the name of the algorithm is the same as the name of the company) is purely one of energy.
I'm having trouble understanding this. Are you saying that GPUs invest area on floating-point multipliers because FLOPS are an important marketing metric? The only thing that mattered to us was: how can we make these operations faster within the area and power constraints we have? Reducing energy consumption was thus a major goal.
I wish you luck. If I were in your shoes, I would approach NVidia or Google -- and expect to be hammered with tough questions.
> Are you saying that GPUs invest area on floating-point multipliers because FLOPS are an important marketing metric?
Yes. That is precisely what i'm saying. If i'm mistaken in saying that that's one thing, but as far as it being what i'm saying, it very much is. It's been an important guiding principle for some time now in the project, that recent chips--including FPGA's--tend to have hard IP for floating point multiplication.
Now spending a lot of chip area on getting more FLOPS is not necessarily a bad decision if there is no alternative for achieving fast matrix multiplication. Almost any method is sensible if there was no better alternative available when the decision to use that method was made. In addition, fgemm only really makes sense when matrices contain over 1000 elements per row or column, not sure how much more than 1000 per vector but more than that. Small and in particular small and dense matrices are still best multiplied exactly the way GPUs multiply them, with many floating-point multiplier circuits in parallel. It's not stupid in the least.
Yeah so NVidia n Google have the same business model i'm going for, Google having TPUs in its datacenters that do work that cannot be reverse engineered. Google does not sell TPUs. You can use them by sending Google the work, and you'll benefit from much lower cost and faster speed. NVidia has a similar offering, just not as well-known. That's the correct business model in my analysis, and what fgemm will sell. Sell the work.
> Google having TPUs in its datacenters that do work that cannot be reverse engineered
Help me understand: TPUs cannot be reverse engineered because the user doesn't have access to the physical device, but other devices like GPUs can?
Can you show some examples of reverse-engineering of GPUs that has been performed on the basis of having physical access to the dies? Are you aware of any reverse engineering done on them using other means? How much has this reverse engineering prevented e.g. NVidia from being financially successful? Finally, since patents are freely available to the public once they have been granted, does that nullify some concerns regarding reverse engineering?
I'm not an entrepeneur, so take this with a fistful of salt, but having worked at places like NVidia, I would never try to compete head to head with them, as a startup. Very few semiconductor startups achieve any success, and the ones that do start by finding a very particular market niche where the established players aren't even trying to play.
What about us peasants who need multiplication to actually get work done instead of playing FLOPs status games? Not everyone is bottlenecked on something as specific as matrix multiplication.
Also the claims about huge amounts of area being dedicated to multiplication are false. ALU size is mostly irrelevant.
this is an interesting idea; in some sense rotating a point in space is only multiplying a 3-item or 4-item vector (where this idea wouldn't be useful), but rotating n points is multiplying a 3×n or 4×n matrix by the transformation vector, so if the algorithm pans out, you should be able to do that kind of stuff too; n can be pretty large
> A chip provides every form of IP protection, all four forms, trade secret, copyright, patent, n even trademark. No other medium has that to my knowledge.
IANAL, but I do not believe semiconductor masks are copyrightable under US law (my limited understanding is that there is essentially due to the fact that the mask is inherently functional and/or aspects of the merger doctrine). There is a separate sui generis mask work protection via 17 U.S.C. §§ 901-914.
Edit: Moreover, I'm unsure how you figure a chip itself is protected by trade secret, since reverse engineering an IC is not terribly difficult.
I don't know why trade secret applies, but i remember reading it does. Perhaps in the rationale, or the preimage. It doesn't make all that much sense, come to think. I think Intel tried it? Intel for sure used copyright to protect chips. Hey thanks, i did not know about 17 U.S.C. §§ 901-914.
- "logarithm LUT then add" seems delightfully simple, especially at low precision. I am going to have to read that paper too...
- The concerns about GPU style parallelism may not be as bad in "alternative" architectures. For instance, Centaur came up with a single, serial, but hilariously wide 32,768-bit SIMD core for inference: https://fuse.wikichip.org/news/3256/centaur-new-x86-server-p...
- The silicon simplification also seems relevant to Samsung's in memory computing effort: https://www.servethehome.com/samsung-hbm2-pim-and-aquabolt-x...
- I wonder if this would be relevant to llama.cpp's CPU inference?