Uh, you clearly didn't get the proof. N-bit multiplications performed with a single addition and a single move. Proof involves showing superiority over the number of additions in Russian Peasants.
Not that tiny, n not such little energy, if Intel AVX-512 requires throttling the chip down to 60% of full speed (n similar slowdowns are necessary on Xeon Phi, 1.4 GHz to 1.2 GHz for full) SIMD when using "heavy" operations like multiplication (and also count-leading-zeroes aka CLZ) so it's a lot of energy for sure.
Sorting is very fast since i designed a sorting algorithm tailor-made for this problem, which is about as fast as your approach. Difference being the numbers are 32-bit, so iterating through the entire array of all possible 24-bit mantissas (i know they're 23 bit, but the implicit initial 1 of normal (vs denormal) values is explicit here) would be way too slow. Otherwise you'd be right, 8-bit values you can just use counting-sort, no problem. Or 14 bits, same deal. Now also notice there's a difference between the matrix size and the mantissa size, we care about the mantissa size, so it's 24 bits.
well, probably. it likely depends on how the move works; muxes aren't free and driving long wires isn't either
but the confusion in the comment you are replying to is that it thinks you are deriving a floating-point matrix multiply algorithm, when in fact you are deriving an integer matrix multiply algorithm
floating-point adds are slightly more expensive than floating-point multiplies
integer multiplies are enormously more expensive than integer adds (in power and area, though not in time)
One thing I highly recommend is trying it for yourself, just with pen and paper. Think of ten two-digit numbers under 40 for it to work nicely. Just numbers under 40, 1-100 would require like 20 numbers for it to work as well as it does in realistic examples. Write them in one line, then write them sorted on the next line, with lines connecting them to where they were before. Then underneath each number write down the difference between that number n the one before it, this is called taking the first differences. Repeat the sorting followed by taking first differences until you have only two numbers, two being an arbitrary limit. You may then pretend the row and column are the same, so expand with the same vector using the lines drawn, and prefix sum where first difference was performed.
Writing 'n' instead of 'and' when discussing mathematics is generally a very bad idea. Your example is a good one, but your use of abbreviations in your writing is horrid.
True. Too late to edit. Yeah that is confusing, guess I gotta write differently when discussing that. I just shave characters for character counts, which are often a problem particularly on Twitter. It's not because I don't like typing out the whole word, I generally refrain from that sort of abbreviation. It is also stylistically unique--I use a unique style for the same reason as, and vindicating, Auguste Rodin who faced problems due to his statues being too literal.
I don’t follow the Rodin reasoning — was he actually critiqued for being too literal? I thought he was fairly unconventional when everyone else was literal. Maybe that’s what you’re referring to? His ‘fragmented’ style?
Yeah i address that, that's a gotcha and if the numbers are exponentially distributed the algorithm does not work. It is not universal. It depends in part on the exponent for which the numbers are exponentially distributed, n the other optimizations you use. This is the purpose of ongoing experiments. The Fibonacci series is an interesting case, since you get rid of the two largest numbers in each pass.
Yeah hey the paper will not be that painful to read if you can already perform the reduction steps. I'll answer further questions.
Yeah so for floating point an exponential distribution is bounded in how many elements it can contain for a given exponent, so it works out quite nicely. It does not work on bignums.
unless i fucked it up, it looks like you can insert a separate renormalization step before the sorting where you shift each number to the left by a variable amount, like a floating-point unit always does with the mantissa (except subnormals), and that seems to solve the exponential distribution problem; it always seems to get down to a single item from 10000 34-bit items in about 15 steps
no wait, it doesn't really solve it, because a vector of the first 1000 fibonacci numbers still takes 485 iterations. but the last number in that vector is a 694-bit number. it does seem to improve it enormously
i thought this might make it work much worse (because in a sense it's adding bits to the numbers: what used to be a 1-bit number might now have n-bit-wide differences with the numbers before and after it) but at least in random tests it seems to make a huge improvement
just to clarify, what i'm doing (with unsigned integers) is
def normalize(v):
for vi in v:
while vi < 2**34:
vi *= 2
yield vi
def nreductions(v):
while True:
v = list(sortu(normalize(v)))
yield v
v = list(diffs(v))
with 256-bit numbers and a 2**256 normalization target it seems to typically be about 30 or 40 reduction steps, not sure if those qualify as bignums to you
the shifts of course have to be undone in the other direction, just like the permutations, but i don't think that's a problem?
(oh, now i see that in §3.1 'alignment' you are already doing something like this, except that you're shifting right to reduce the number of duplicates and eliminate one extra bit of differencing per iteration, not left to reduce the dynamic range of the data. for smallish numbers that seems to be roughly as effective, but left-shift normalizing works a lot better than right-shift aligning for 256-bit numbers)
i haven't tried doing any actual vector multiplies with this algorithm yet so if i did fuck it up i wouldn't have noticed
this is a pretty exciting algorithm, thanks for sharing