FizzBuzz – SIMD Style

Const-me · on March 11, 2021

> there is no modulo operation.. So what could we do to solve the FizzBuzz task?

You don’t need modulo instruction when the divisor is known at compile-time.

Good compilers routinely implement integer division and modulo as a sequence of faster instructions like multiplication and bit shifts. Here’s examples: https://godbolt.org/z/4noTxP

Works for SIMD just fine. Take a look at what clang auto-vectorizer did: https://godbolt.org/z/q4bcfj You see what happened to the code? It turned into a bunch of AVX2 instructions, and the loop is gone.

dragontamer · on March 11, 2021

If anyone is interested in knowing how that trick works: if you take a 32x32-bit multiplication problem "a x b", you have a 64-bit result. If you drop the 32-lowest bits, you're performing the operation: "floor(a x b / 4294967296)").

Turns out that you can represent a great many division problems in terms of a * (b / 4294967296).

For example, a / 2 coincides with a * 2147483648/4294967296. a / 3 coincides with a * 1431655765 / 4294967296.

------

That's the basis of the trick. The devil is in the details however. You'll find that floor(a x b / 4294967296) is not exactly equivalent to integer division: some values are "off" slightly. A few adds, or subtracts can correct for this error.

Some values are "perfect" and don't need correction. The compiler will build off of these perfect values and combine them with adds and bitshifts.

-----

Before compilers were good, embedded engineers were basically expected to know how to do this. Most embedded CPUs had no division operator (including ARMv7 and older, but also 8051 and other venerable microprocessors).

The "imperfect" division was probably most used back then. Compilers however have to implement bit-perfect results (exactly equivalent to a division operator). Embedded engineers can often make the decision that "imperfect division" was good enough.

Const-me · on March 12, 2021

Good comment, however for a / 2 and other powers of two modern compilers are instead optimizing division into shifts and modulo into bitwise and. Both are very fast instructions, faster than integer multiplication.

Even JIT compilers do these optimizations: https://sharplab.io/#gist:ce7f7677cbb9f43efe74aeeadb75deb1

xiphias2 · on March 11, 2021

My favorite is still the TensorFlow version:

https://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/

1f60c · on March 11, 2021

> interviewer: How far are you intending to take this?

>

> me: Oh, just two layers deep -- one hidden layer and one output layer

Actual LOL.

Thanks for sharing!

psykotic · on March 11, 2021

Any FizzBuzz example almost feels too silly to deserve a serious comment. But here goes.

As you noticed, the FizzBuzz pattern has a period of 15. So if you want to go fast, just go directly to a 15x vectorized unroll with unaligned stores.

Assuming 32-byte vectors like with AVX2, you initialize two vector registers v0, v1 to hold the pattern of 15 4-byte values (the 16th value doesn't matter) corresponding to 0 through 15. You also have corresponding delta vector registers d0, d1 whose entries are +15 for the "pass-through" values and 0 for the -1/-2/-3 sentinel values. The loop kernel is just this:

    store_unaligned(p, v0)
    store_unaligned(p + 8, v1)
    v0 += d0
    v1 += d1
    p += 15

And of course you still need the remainder loop.

You could remove the unaligned stores with a monstrous lcm(15, 16) = 240x unroll (16 shifted instances of the above kernel) but you wouldn't be able to keep all the necessary loop state in registers.

_old_dude_ · on March 11, 2021

Your version written in Java

  private static final int[] VALUES = range(0, 16).map(FizzBuzz::fizzbuzz).toArray();
  private static final int[] DELTAS = stream(VALUES).map(v -> v < 0? 0: 15).toArray();

  private static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_256;

  private static int fizzbuzz(int i) {
    return i % 15 == 0? FIZZ_BUZZ: i % 5 == 0? BUZZ: i % 3 == 0? FIZZ: i;
  }

  public static int[] simdFizzBuzz(int length) {
    var result = new int[length];

    var v1 = IntVector.fromArray(SPECIES, VALUES, 0);
    var v2 = IntVector.fromArray(SPECIES, VALUES, 8);
    var d1 = IntVector.fromArray(SPECIES, DELTAS, 0);
    var d2 = IntVector.fromArray(SPECIES, DELTAS, 8);

    var i = 0;
    var upperBound = length - length % 15;
    for(; i < upperBound; i += 15) {
      v1.intoArray(result, i);
      v2.intoArray(result, i + 8);
      v1 = v1.add(d1);
      v2 = v2.add(d2);
    }
    for(; i < length; i++) {
      result[i] = fizzbuzz(i);
    }
    return result;
  }

quite cool indeed

psykotic · on March 11, 2021

There's a small bug in your code. The loop kernel writes out 16 entries despite only advancing by 15 entries, so your upper bound calculation needs adjustment. I'm also worried the JVM's compiler won't be able to eliminate the array bounds checks because of how you calculate the upper bound with the modulus. Writing it like this should fix both issues:

    // We need i + 15 < length to keep the last written entry in bounds.
    for (; i < length - 15; i += 15)

_old_dude_ · on March 11, 2021

yes, right, if length is a multiple of 15, there is an issue.

And there is no primitive for masking on AVX2, those are only available on AVX-256.

if we want to minimize the iterations of the post loop, it should be

  var upperBound = length % 15 == 0? length - 15: length - length % 15;

but as you said, i'm not sure the VM will eliminate the bound checks in that case so your solution seems to be the best

  var upperBound = length - 15

vardump · on March 11, 2021

> And there is no primitive for masking on AVX2, those are only available on AVX-256.

Sure there is a masking MOV in the original AVX (thus including AVX2), VMASKMOV. Works for both masked loads and stores.

https://www.felixcloutier.com/x86/vmaskmov

Notably: "Faults occur only due to mask-bit required memory accesses that caused the faults. Faults will not occur due to referencing any memory location if the corresponding mask bit for that memory location is 0. For example, no faults will be detected if the mask bits are all zero."

A nitpick: there's no such thing as AVX-256, you probably meant AVX-512.

_old_dude_ · on March 11, 2021

Thanks, i've overlooked that

    var mask = VectorMask.fromLong(SPECIES, 0b0111_1111);

    var i = 0;
    var upperBound = length - length % 15;
    for(; i < upperBound; i += 15) {
      v1.intoArray(result, i);
      v2.intoArray(result, i + 8, mask);
      ...

gunnarmorling · on March 11, 2021

Yeah, that looks really neat, thanks for sharing that idea and even putting it together in Java here! I'll add it to the repo (https://github.com/gunnarmorling/simd-fizzbuzz) in a bit, and run the JMH benchmark for it, for the fun of it.

jepcommenter · on March 11, 2021

I guess you can also prefill result[i]=i and then apply fizzbuzz mask to avoid addition altogether

jepcommenter · on March 11, 2021

Oh, wait, the original code receives input array while above loops from zero to n which is something else.

bjoli · on March 11, 2021

I actually have one use for FizzBuzz!

I have used FizzBuzz as an introduction to wheels. Any repeating sequence will do, but everyone knows FizzBuzz. It is a great example of that because you replace a simple algorithm with one that is also simple and about twice as fast in my language of choice (using circular, singly linked lists for the wheel).

The article does a similar thing, but using masks to achieve the speedup.

quietbritishjim · on March 11, 2021

What is "wheel" in this context? The main one I can think of in programming is Python packages. I even considered you were talking about literal wheels! But it sounds like you're talking about some sort of data structure, like a ring buffer?

bjoli · on March 11, 2021

I call it a wheel because I was first exposed to them in wheel factorisation: https://en.m.wikipedia.org/wiki/Wheel_factorization

quietbritishjim · on March 11, 2021

> I call it a wheel

Call what a wheel? What is the "it" here?

It doesn't seem to be a standard term but since you went to the trouble of posting a comment surely you were hoping it would be understood. Are you talking about modulo arithmetic? Or hierarchical timing wheels (the top hit for "wheel data structure")?

bjoli · on March 11, 2021

Any kind of circular data that repeats itself. In the case of wheel factorisation, it is a wheel "rolling" over the number line.

Any data structure will do. Either a circular one or one with O(1) access time. It is the concept used in the article with a vector mask repeating every 15 elements.

It is a technique. I have just always used a circular list for it :)

bjoli · on March 12, 2021

And the Link I posted explains it just fine. Anyway, I actually wrote some code to generate wheels not too long ago: https://git.sr.ht/~bjoli/wheel-utils

quietbritishjim · on March 13, 2021

OK, thanks. I think your concept of a wheel could be summarised as "use of a ring buffer for modulo arithmetic". (Of course, like you said, a "ring buffer" doesn't have to be a linked list with start and end links connected - an array that you just index carefully works too.)

savant_penguin · on March 11, 2021

The author should consider some feature engineering. I suspect divisibility by all numbers from 1 to 1000 would really help. Not sure which are the best features, but tensorflow is still really efficient with large inputs so I'd just add them all and see what happens

floatboth · on March 11, 2021

> The pattern of expected -1, -2, and -3 values repeats every 15 input values.

That's cheating!! :D