Arduino has always been a naked cash grab disguised as a "hacker-friendly nonprofit." The gross margin on their boards is >90%, and yeah, the software is mostly a ripoff of wiring.
The software was based on Processing. It was never a secret, just open source working as intended. It doesn't look the same any more.
A non-profit is still a business. Success is necessary for existence.
Think about the number of companies that have been created to make, or heavily specialize in Arduino clones and accessories without having to pay Arduino a cent because the designs were intentionally open-sourced. It doesn't sound like a naked cash grab to me.
Great businesses can still be cash grabs. In fact, many of them are. Very few of those dress themselves up as a nonprofit (aside from Arduino and OpenAI).
1. I guess the most time consuming part is multiplication, right? What kind of FFT do you use? Schönhage-Strassen, multi-prime NTT, ..? Is it implemented via floating-point numbers or integers?
2. Not sure if you encountered this, but do you have any advice for small mulmod (multiplication reduced by prime modulus)? By small I mean machine-word size (i.e. preferably 64-bits).
3. For larger modulus, what do you use? Is it worth precomputing the inverse by, say, Newton iteration or is it faster to use asymptotically slower algorithms? Do you use Montgomery representation?
4. Does the code use any kind of GCD? What algorithm did you choose?
5. Now this is a bit broad question, but could you perhaps compare the traditional algorithms implemented sequentially (e.g. GMP) and algorithm suitable to run on GPUs? I mean, does it make sense to use, say, a quadratic algorithm amenable to parallel execution, rather than a asymptotically faster (and sequential) algorithm?
3. Answered in point 1., we use IBDWT and get modular reduction for free through the circular convolution. This works nicely for Mersenne modulus.
4. GCD is not used in PRP, but it is used in P-1 (Pollard's P-1 algo). We use GMP GCD on the CPU (as it's a very rare operation, and GMP/CPU is fast enough). I understand the complexity of the GCD as implemented in GMP is logarithmic which is good.
5. For our dimension it does not make sense to use a quadratic algo instead of a NlogN one; We absolutely need the NlogN provided by convolution/FFT.
2. This was discussed to some length over the years on mersenneforum.org [1]. There is a lot of wisdom stored there but hard to find, and many smart & helpful guys, so feel free to ask there. This is an operation of interest because it's:
- involved in TF (Trial Factoring), including on GPUs
- involved in NTT transforms ("integer FFTs")
1. Yes, the core of the algorithm is the modular squaring. The squaring is similar to a multiplication, of course. In general, the fast multiplication is implemented via convolution, via FFTs which results in a N x log(N) time complexity of the multiplication.
What we do is modular squaring iterations:
x := x^2 mod M,
where M== 2^p - 1, i.e. M is the Mersenne number that we test.
Realize that working modulo 2^p - 1 means that 2^p == 1, which corresponds to a circular convolution of size p bits. We use the "Irrational Base Discrete Weighted Transform", IBDWT [1] introduced by Crandall/Fagin to turn this into a convolution of a convenient size N "words", where each word contains about 18bits, so Words ~= p/18. For example our prime of interest M52 was tested with a FFT of size 7.5M == 1024 * 15 * 512.
The FFT is a double precision (FP64) floating point FFT. Depending on the FFT size we can make use of about 18bits per FFT "word", where a "word" corresponds to one FP64 value.
Some tricks involved up to this point are: one FFT size halving and the modular reduction for free because of IBDWT. Another FFT size halving because turning the real input/output values into complex numbers in the FFT.
The FFT implementation that we found appropriate for GPUs is the "matrix FFT", which splits the FFT of size N=A*B into sub-FFTs of size A, one matrix multiplication with about A*B twiddle factors, and sub-FFTs of size B. In practice we split the FFT into three dimensions, e.g. for M52 we used:
7.5M == 1024 * 15 * 512.
We implement in a workgroup one FFT of size 1024 or 512. These are usually base-4 FFTs, with transpositions using LDS (Local Data Share, local per-workgroup memory in OpenCL).
After the inverse FFT, we also need to do Carry propagation which properly turns the convolution into a multi-word multiplication.
For performance we merge a few logical kernels that are invoked in succession into a single big kernel, where possible. The main advantage of doing so is that the data does not need to transit through "global memory" (VRAM) anymore but stays local to the workgroup, which is a large gain.
So, to recap:
- multiplication via convolution
- convolution via FP64 FFT, achieving about 18bits per FP64 word
- modular reduction for free through IBDWT
Thanks for the reply! If you don't mind asking more: what do you use for polynomial GCD? Apparently it is quite fast, do you use some standard algorithm implemented well or is there some kind of algorithmic improvement? Is it described somewhere, say a paper or a book? Have you tried to benchmark it against NTL, for example? Thanks again!
No, this doesn't help at all. It will only stop the header from being parsed twice if it's #included twice in the same .c file, not from being parsed twice if it's #included from two different .c files.
In STB-style headers the implementation is inside a separate ifdef/endif block which is only activated at a single include point in the whole project. At all other include points the compiler will only parse the public API declarations and skip the implementation block (which is so much faster than parsing and compiling the implementation that the difference to a header which only contains the API declaration doesn't matter much even in large projects).
For instance here is the start of the implementation block in stb_image.h, above that is just "cheap" public API declarations:
While I'm a big fan of keeping declarations and implementations as close as possible, I find that the need to explicitly #define STB_SOMETHING_IMPLEMENTATION before including the header somehow spoils the simplicity of the solution. At this point, I'd rather use a single file that acts as either a .h or a .c based on __INCLUDE_LEVEL__ (a GCC extension, not sure if available for MSC).
Could you be more specific? I think for that to work one would also need the upper half of 64x64 multiplication and `vpmullq` provides only the lower half. You could break one 64x64 multiplication into four 32x32 multiplications (i.e. emulate the full 64x64 = 128 bits multiplication) but I was under the impression that this was slow.
I assume that as you say, whoever used this instruction was using it for multiplying 32-bit numbers.
On AMD Zen 4 and Intel Cannon Lake or newer (when AVX-512 is supported), the fastest method to multiply big numbers is to use the IFMA instructions, which reuse the floating-point multipliers to generate 104-bit products of 52-bit numbers.
Do you think the very fast division on M1 has any implications for 128/64 narrowing division as well? Do you know of a faster way than the method by Moller and Granlund? Do you plan on to include 128/64 division in libdivide?
And I asked this question before but the parent post got flagged so I'm trying once more: at the very bottom of the Labor of Division (Episode V) post [1], is it really possible for the second `qhat` (i.e. `q0`) to be off by 2? Do you have any examples of that?
I haven't yet read the Moller and Granlund paper, but its narrowing division using precomputed reciprocal would be a natural fit for libdivide. (libdivide does have a narrowing divide, but it is Algorithm D based).
Regarding the second question, it is possible to be off by 2. Consider (base 10) 500 ÷ 59. The estimated quotient qhat is 50 ÷ 5 = 10, but the true digit is 8. So if our partial remainder is 50, we'll be off by 2 in the second digit.