More

daniel-cussen · on July 9, 2023

Thanks, help finding typos is appreciated! It'll make things easier for when this gets published in a peer-reviewed journal, which is in the works.

daniel-cussen · on July 9, 2023

Fgemm SpA, coming soon.

daniel-cussen · on July 9, 2023

No comment.

daniel-cussen · on July 9, 2023

Uh, you clearly didn't get the proof. N-bit multiplications performed with a single addition and a single move. Proof involves showing superiority over the number of additions in Russian Peasants.

daniel-cussen · on July 9, 2023

Not that tiny, n not such little energy, if Intel AVX-512 requires throttling the chip down to 60% of full speed (n similar slowdowns are necessary on Xeon Phi, 1.4 GHz to 1.2 GHz for full) SIMD when using "heavy" operations like multiplication (and also count-leading-zeroes aka CLZ) so it's a lot of energy for sure.

daniel-cussen · on July 8, 2023

Unfortunately i cannot answer questions as unlike i chip i must now sleep. It's 2:42 AM here.

First thing in the morning i'll be on it.

daniel-cussen · on July 8, 2023

Yeh, it bears many resemblances to Babbage's difference engine.

kragen · on July 8, 2023

yeah, sorry for making the slide rule joke before even reading the abstract

daniel-cussen · on July 8, 2023

Sorting is very fast since i designed a sorting algorithm tailor-made for this problem, which is about as fast as your approach. Difference being the numbers are 32-bit, so iterating through the entire array of all possible 24-bit mantissas (i know they're 23 bit, but the implicit initial 1 of normal (vs denormal) values is explicit here) would be way too slow. Otherwise you'd be right, 8-bit values you can just use counting-sort, no problem. Or 14 bits, same deal. Now also notice there's a difference between the matrix size and the mantissa size, we care about the mantissa size, so it's 24 bits.

daniel-cussen · on July 8, 2023

One addition and one move replacing one multiplication? Absolutely makes things much cheaper.

kragen · on July 9, 2023

well, probably. it likely depends on how the move works; muxes aren't free and driving long wires isn't either

but the confusion in the comment you are replying to is that it thinks you are deriving a floating-point matrix multiply algorithm, when in fact you are deriving an integer matrix multiply algorithm

floating-point adds are slightly more expensive than floating-point multiplies

integer multiplies are enormously more expensive than integer adds (in power and area, though not in time)

daniel-cussen · on July 8, 2023

One thing I highly recommend is trying it for yourself, just with pen and paper. Think of ten two-digit numbers under 40 for it to work nicely. Just numbers under 40, 1-100 would require like 20 numbers for it to work as well as it does in realistic examples. Write them in one line, then write them sorted on the next line, with lines connecting them to where they were before. Then underneath each number write down the difference between that number n the one before it, this is called taking the first differences. Repeat the sorting followed by taking first differences until you have only two numbers, two being an arbitrary limit. You may then pretend the row and column are the same, so expand with the same vector using the lines drawn, and prefix sum where first difference was performed.

So:

11 39 23 28 31 19 32 05 01 09

sort

01 05 09 11 19 23 28 31 32 39

first differences

1 4 4 2 8 4 5 3 1 7

sort and remove duplicates

1 2 3 4 5 7 8

first differences

1 1 1 1 1 2 1

sort and remove duplicates

1 2

reduction complete

smlacy · on July 8, 2023

Writing 'n' instead of 'and' when discussing mathematics is generally a very bad idea. Your example is a good one, but your use of abbreviations in your writing is horrid.

daniel-cussen · on July 8, 2023

I think you're right in this case. Alright I'll edit if I still can. I don't use any variables anyway in my post.

EDIT: alright I fixed all the single letter abbreviations.

svnt · on July 8, 2023

almost:

> Then underneath each number write down the difference between that number n the one before it

daniel-cussen · on July 10, 2023

True. Too late to edit. Yeah that is confusing, guess I gotta write differently when discussing that. I just shave characters for character counts, which are often a problem particularly on Twitter. It's not because I don't like typing out the whole word, I generally refrain from that sort of abbreviation. It is also stylistically unique--I use a unique style for the same reason as, and vindicating, Auguste Rodin who faced problems due to his statues being too literal.

svnt · on July 11, 2023

I don’t follow the Rodin reasoning — was he actually critiqued for being too literal? I thought he was fairly unconventional when everyone else was literal. Maybe that’s what you’re referring to? His ‘fragmented’ style?

casey2 · on July 8, 2023

Rather than pen and paper I recommend using a computer if you have access to one.

   (-⟜»∘⍷∧)⍟(1+↕3) 11‿39‿23‿28‿31‿19‿32‿5‿1‿9

⟨ ⟨ 1 4 4 2 8 4 5 3 1 7 ⟩ ⟨ 1 1 1 1 1 2 1 ⟩ ⟨ 1 1 ⟩ ⟩

   -⟜»∘⍷∧ vec

⟨ 7 2 4 1 3 2 2 4 10 12 3 10 1 6 1 5 17 2 8 ⟩

   +´-⟜»∘⍷∧ vec

100

   ⍷∧ vec

⟨ 7 9 13 14 17 19 21 25 35 47 50 60 61 67 68 73 90 92 100 ⟩

   +`-⟜»∘⍷∧ vec

⟨ 7 9 13 14 17 19 21 25 35 47 50 60 61 67 68 73 90 92 100 ⟩

klysm · on July 8, 2023

Dropping the name of the language that utilizes these hieroglyphics might be helpful to those who are unfamiliar.

daniel-cussen · on July 10, 2023

I was going to say the same as donkeybeer, agree it is surely an APL variant.

donkeybeer · on July 9, 2023

Probably an apl

kragen · on July 8, 2023

i feel like this sometimes runs into problems

like, if after 7 iterations, we have

1 2 3 7 9 12 18 23 27 57 72 95 129 680 718 994 2631 10770 18047 785265

that gets down to four items after 14 iterations

1 54 2814 716356

but because that's roughly exponential it takes a beastly number of iterations to get down to 2 items

i guess i should read the paper

daniel-cussen · on July 8, 2023

Yeah i address that, that's a gotcha and if the numbers are exponentially distributed the algorithm does not work. It is not universal. It depends in part on the exponent for which the numbers are exponentially distributed, n the other optimizations you use. This is the purpose of ongoing experiments. The Fibonacci series is an interesting case, since you get rid of the two largest numbers in each pass.

Yeah hey the paper will not be that painful to read if you can already perform the reduction steps. I'll answer further questions.

Yeah so for floating point an exponential distribution is bounded in how many elements it can contain for a given exponent, so it works out quite nicely. It does not work on bignums.

Nice to see someone use lower-case i like i do!

kragen · on July 8, 2023

the paper is not painful at all

unless i fucked it up, it looks like you can insert a separate renormalization step before the sorting where you shift each number to the left by a variable amount, like a floating-point unit always does with the mantissa (except subnormals), and that seems to solve the exponential distribution problem; it always seems to get down to a single item from 10000 34-bit items in about 15 steps

no wait, it doesn't really solve it, because a vector of the first 1000 fibonacci numbers still takes 485 iterations. but the last number in that vector is a 694-bit number. it does seem to improve it enormously

i thought this might make it work much worse (because in a sense it's adding bits to the numbers: what used to be a 1-bit number might now have n-bit-wide differences with the numbers before and after it) but at least in random tests it seems to make a huge improvement

just to clarify, what i'm doing (with unsigned integers) is

    def normalize(v):
        for vi in v:
            while vi < 2**34:
                vi *= 2
            yield vi

    def nreductions(v):
        while True:
            v = list(sortu(normalize(v)))
            yield v
            v = list(diffs(v))

with 256-bit numbers and a 2**256 normalization target it seems to typically be about 30 or 40 reduction steps, not sure if those qualify as bignums to you

the shifts of course have to be undone in the other direction, just like the permutations, but i don't think that's a problem?

(oh, now i see that in §3.1 'alignment' you are already doing something like this, except that you're shifting right to reduce the number of duplicates and eliminate one extra bit of differencing per iteration, not left to reduce the dynamic range of the data. for smallish numbers that seems to be roughly as effective, but left-shift normalizing works a lot better than right-shift aligning for 256-bit numbers)

i haven't tried doing any actual vector multiplies with this algorithm yet so if i did fuck it up i wouldn't have noticed

this is a pretty exciting algorithm, thanks for sharing