> there is no modulo operation.. So what could we do to solve the FizzBuzz task?
You don’t need modulo instruction when the divisor is known at compile-time.
Good compilers routinely implement integer division and modulo as a sequence of faster instructions like multiplication and bit shifts. Here’s examples: https://godbolt.org/z/4noTxP
Works for SIMD just fine. Take a look at what clang auto-vectorizer did: https://godbolt.org/z/q4bcfj You see what happened to the code? It turned into a bunch of AVX2 instructions, and the loop is gone.
If anyone is interested in knowing how that trick works: if you take a 32x32-bit multiplication problem "a x b", you have a 64-bit result. If you drop the 32-lowest bits, you're performing the operation: "floor(a x b / 4294967296)").
Turns out that you can represent a great many division problems in terms of a * (b / 4294967296).
For example, a / 2 coincides with a * 2147483648/4294967296. a / 3 coincides with a * 1431655765 / 4294967296.
------
That's the basis of the trick. The devil is in the details however. You'll find that floor(a x b / 4294967296) is not exactly equivalent to integer division: some values are "off" slightly. A few adds, or subtracts can correct for this error.
Some values are "perfect" and don't need correction. The compiler will build off of these perfect values and combine them with adds and bitshifts.
-----
Before compilers were good, embedded engineers were basically expected to know how to do this. Most embedded CPUs had no division operator (including ARMv7 and older, but also 8051 and other venerable microprocessors).
The "imperfect" division was probably most used back then. Compilers however have to implement bit-perfect results (exactly equivalent to a division operator). Embedded engineers can often make the decision that "imperfect division" was good enough.
Good comment, however for a / 2 and other powers of two modern compilers are instead optimizing division into shifts and modulo into bitwise and. Both are very fast instructions, faster than integer multiplication.
Any FizzBuzz example almost feels too silly to deserve a serious comment. But here goes.
As you noticed, the FizzBuzz pattern has a period of 15. So if you want to go fast, just go directly to a 15x vectorized unroll with unaligned stores.
Assuming 32-byte vectors like with AVX2, you initialize two vector registers v0, v1 to hold the pattern of 15 4-byte values (the 16th value doesn't matter) corresponding to 0 through 15. You also have corresponding delta vector registers d0, d1 whose entries are +15 for the "pass-through" values and 0 for the -1/-2/-3 sentinel values. The loop kernel is just this:
You could remove the unaligned stores with a monstrous lcm(15, 16) = 240x unroll (16 shifted instances of the above kernel) but you wouldn't be able to keep all the necessary loop state in registers.
There's a small bug in your code. The loop kernel writes out 16 entries despite only advancing by 15 entries, so your upper bound calculation needs adjustment. I'm also worried the JVM's compiler won't be able to eliminate the array bounds checks because of how you calculate the upper bound with the modulus. Writing it like this should fix both issues:
// We need i + 15 < length to keep the last written entry in bounds.
for (; i < length - 15; i += 15)
Notably: "Faults occur only due to mask-bit required memory accesses that caused the faults. Faults will not occur due to referencing any memory location if the corresponding mask bit for that memory location is 0. For example, no faults will be detected if the mask bits are all zero."
A nitpick: there's no such thing as AVX-256, you probably meant AVX-512.
var mask = VectorMask.fromLong(SPECIES, 0b0111_1111);
var i = 0;
var upperBound = length - length % 15;
for(; i < upperBound; i += 15) {
v1.intoArray(result, i);
v2.intoArray(result, i + 8, mask);
...
Yeah, that looks really neat, thanks for sharing that idea and even putting it together in Java here! I'll add it to the repo (https://github.com/gunnarmorling/simd-fizzbuzz) in a bit, and run the JMH benchmark for it, for the fun of it.
I have used FizzBuzz as an introduction to wheels. Any repeating sequence will do, but everyone knows FizzBuzz. It is a great example of that because you replace a simple algorithm with one that is also simple and about twice as fast in my language of choice (using circular, singly linked lists for the wheel).
The article does a similar thing, but using masks to achieve the speedup.
What is "wheel" in this context? The main one I can think of in programming is Python packages. I even considered you were talking about literal wheels! But it sounds like you're talking about some sort of data structure, like a ring buffer?
It doesn't seem to be a standard term but since you went to the trouble of posting a comment surely you were hoping it would be understood. Are you talking about modulo arithmetic? Or hierarchical timing wheels (the top hit for "wheel data structure")?
Any kind of circular data that repeats itself. In the case of wheel factorisation, it is a wheel "rolling" over the number line.
Any data structure will do. Either a circular one or one with O(1) access time. It is the concept used in the article with a vector mask repeating every 15 elements.
It is a technique. I have just always used a circular list for it :)
And the Link I posted explains it just fine. Anyway, I actually wrote some code to generate wheels not too long ago: https://git.sr.ht/~bjoli/wheel-utils
OK, thanks. I think your concept of a wheel could be summarised as "use of a ring buffer for modulo arithmetic". (Of course, like you said, a "ring buffer" doesn't have to be a linked list with start and end links connected - an array that you just index carefully works too.)
The author should consider some feature engineering. I suspect divisibility by all numbers from 1 to 1000 would really help. Not sure which are the best features, but tensorflow is still really efficient with large inputs so I'd just add them all and see what happens
You don’t need modulo instruction when the divisor is known at compile-time.
Good compilers routinely implement integer division and modulo as a sequence of faster instructions like multiplication and bit shifts. Here’s examples: https://godbolt.org/z/4noTxP
Works for SIMD just fine. Take a look at what clang auto-vectorizer did: https://godbolt.org/z/q4bcfj You see what happened to the code? It turned into a bunch of AVX2 instructions, and the loop is gone.