More

oxxoxoxooo · 2025-10-07T13:36:25 1759844185

If you ever wondered, how Arduino came about: The Untold History of Arduino (https://arduinohistory.github.io/).

pclmulqdq · 2025-10-07T13:51:15 1759845075

Arduino has always been a naked cash grab disguised as a "hacker-friendly nonprofit." The gross margin on their boards is >90%, and yeah, the software is mostly a ripoff of wiring.

crumpled · 2025-10-07T15:59:22 1759852762

The software was based on Processing. It was never a secret, just open source working as intended. It doesn't look the same any more.

A non-profit is still a business. Success is necessary for existence.

Think about the number of companies that have been created to make, or heavily specialize in Arduino clones and accessories without having to pay Arduino a cent because the designs were intentionally open-sourced. It doesn't sound like a naked cash grab to me.

theon144 · 2025-10-07T17:57:09 1759859829

I have troubles calling something a "cash grab" when it's been arguably the single most influential project in the hacker/maker/DIY electronics space.

I don't doubt the boards could've been sold cheaper, but they clearly were doing something right given how much it changed the hobbyist landscape

pclmulqdq · 2025-10-07T20:58:00 1759870680

Great businesses can still be cash grabs. In fact, many of them are. Very few of those dress themselves up as a nonprofit (aside from Arduino and OpenAI).

oxxoxoxooo · 2024-10-27T10:12:22 1730023942

Hi! Please, I also have a few questions:

1. I guess the most time consuming part is multiplication, right? What kind of FFT do you use? Schönhage-Strassen, multi-prime NTT, ..? Is it implemented via floating-point numbers or integers?

2. Not sure if you encountered this, but do you have any advice for small mulmod (multiplication reduced by prime modulus)? By small I mean machine-word size (i.e. preferably 64-bits).

3. For larger modulus, what do you use? Is it worth precomputing the inverse by, say, Newton iteration or is it faster to use asymptotically slower algorithms? Do you use Montgomery representation?

4. Does the code use any kind of GCD? What algorithm did you choose?

5. Now this is a bit broad question, but could you perhaps compare the traditional algorithms implemented sequentially (e.g. GMP) and algorithm suitable to run on GPUs? I mean, does it make sense to use, say, a quadratic algorithm amenable to parallel execution, rather than a asymptotically faster (and sequential) algorithm?

mpreda · 2024-10-27T12:17:13 1730031433

3. Answered in point 1., we use IBDWT and get modular reduction for free through the circular convolution. This works nicely for Mersenne modulus.

4. GCD is not used in PRP, but it is used in P-1 (Pollard's P-1 algo). We use GMP GCD on the CPU (as it's a very rare operation, and GMP/CPU is fast enough). I understand the complexity of the GCD as implemented in GMP is logarithmic which is good.

5. For our dimension it does not make sense to use a quadratic algo instead of a NlogN one; We absolutely need the NlogN provided by convolution/FFT.

mpreda · 2024-10-27T12:10:15 1730031015

2. This was discussed to some length over the years on mersenneforum.org [1]. There is a lot of wisdom stored there but hard to find, and many smart & helpful guys, so feel free to ask there. This is an operation of interest because it's:

   - involved in TF (Trial Factoring), including on GPUs
   - involved in NTT transforms ("integer FFTs")

[1] http://mersenneforum.org/

mpreda · 2024-10-27T11:59:17 1730030357

1. Yes, the core of the algorithm is the modular squaring. The squaring is similar to a multiplication, of course. In general, the fast multiplication is implemented via convolution, via FFTs which results in a N x log(N) time complexity of the multiplication.

What we do is modular squaring iterations:

x := x^2 mod M,

where M== 2^p - 1, i.e. M is the Mersenne number that we test.

Realize that working modulo 2^p - 1 means that 2^p == 1, which corresponds to a circular convolution of size p bits. We use the "Irrational Base Discrete Weighted Transform", IBDWT [1] introduced by Crandall/Fagin to turn this into a convolution of a convenient size N "words", where each word contains about 18bits, so Words ~= p/18. For example our prime of interest M52 was tested with a FFT of size 7.5M == 1024 * 15 * 512.

The FFT is a double precision (FP64) floating point FFT. Depending on the FFT size we can make use of about 18bits per FFT "word", where a "word" corresponds to one FP64 value.

Some tricks involved up to this point are: one FFT size halving and the modular reduction for free because of IBDWT. Another FFT size halving because turning the real input/output values into complex numbers in the FFT.

The FFT implementation that we found appropriate for GPUs is the "matrix FFT", which splits the FFT of size N=A*B into sub-FFTs of size A, one matrix multiplication with about A*B twiddle factors, and sub-FFTs of size B. In practice we split the FFT into three dimensions, e.g. for M52 we used: 7.5M == 1024 * 15 * 512.

We implement in a workgroup one FFT of size 1024 or 512. These are usually base-4 FFTs, with transpositions using LDS (Local Data Share, local per-workgroup memory in OpenCL).

The convolution is formed of:

   - forward FFT
   - element-wise multiplication
   - inverse FFT

After the inverse FFT, we also need to do Carry propagation which properly turns the convolution into a multi-word multiplication.

For performance we merge a few logical kernels that are invoked in succession into a single big kernel, where possible. The main advantage of doing so is that the data does not need to transit through "global memory" (VRAM) anymore but stays local to the workgroup, which is a large gain.

So, to recap:

   - multiplication via convolution
   - convolution via FP64 FFT, achieving about 18bits per FP64 word
   - modular reduction for free through IBDWT

[1] https://en.wikipedia.org/wiki/Irrational_base_discrete_weigh...*

oxxoxoxooo · 2024-10-27T15:14:17 1730042057

Thank you very much for the answers, very informative!

And congratulations on the discovery!

oxxoxoxooo · on May 9, 2024

Please, what do you use for bigints? GMP?

SPACECADET3D · on May 9, 2024

I am using GMP indeed, through the Rust crate `rug`.

oxxoxoxooo · on May 10, 2024

Thanks for the reply! If you don't mind asking more: what do you use for polynomial GCD? Apparently it is quite fast, do you use some standard algorithm implemented well or is there some kind of algorithmic improvement? Is it described somewhere, say a paper or a book? Have you tried to benchmark it against NTL, for example? Thanks again!

oxxoxoxooo · on Feb 28, 2024

Prior art:

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

https://arxiv.org/abs/1602.02830

Ternary Neural Networks for Resource-Efficient AI Applications

https://arxiv.org/abs/1609.00222

kandu · on Feb 28, 2024

Also: training neural networks by turning connections on and off, or by just flipping the sign of the weights: https://arxiv.org/abs/2006.16627

oxxoxoxooo · on Jan 7, 2024

True. Except, it's trivial to mitigate this: one only needs to wrap the whole library under one giant #ifndef. Like here, for example: https://github.com/sheredom/utf8.h/blob/b7ed0a28eb92803c81d6...

teo_zero · on Jan 7, 2024

No, this doesn't help at all. It will only stop the header from being parsed twice if it's #included twice in the same .c file, not from being parsed twice if it's #included from two different .c files.

sp1rit · on Jan 7, 2024

You'd still have to parse & compile the contents of the header for each translation unit, instead of only once.

flohofwoe · on Jan 7, 2024

In STB-style headers the implementation is inside a separate ifdef/endif block which is only activated at a single include point in the whole project. At all other include points the compiler will only parse the public API declarations and skip the implementation block (which is so much faster than parsing and compiling the implementation that the difference to a header which only contains the API declaration doesn't matter much even in large projects).

For instance here is the start of the implementation block in stb_image.h, above that is just "cheap" public API declarations:

https://github.com/nothings/stb/blob/f4a71b13373436a2866c5d6...

teo_zero · on Jan 7, 2024

While I'm a big fan of keeping declarations and implementations as close as possible, I find that the need to explicitly #define STB_SOMETHING_IMPLEMENTATION before including the header somehow spoils the simplicity of the solution. At this point, I'd rather use a single file that acts as either a .h or a .c based on __INCLUDE_LEVEL__ (a GCC extension, not sure if available for MSC).

oxxoxoxooo · on Nov 6, 2023

> Implementations of big number math can and does contain bugs. (I used to hunt for those via fuzzing, which turned up an amazing number of them.)

I'm curious, can you give some examples what kind of bugs did you discover?

hannob · on Nov 6, 2023

CVE-2015-3193 in OpenSSL, CVE-2016-1938 in NSS. There were more, it's been a while.

Ultimately: things like a^b or a/b would return wrong results for certain inputs.

oxxoxoxooo · on Nov 6, 2023

Thanks!

oxxoxoxooo · on Aug 23, 2023

And another one:

The Genuine Sieve of Eratosthenes https://www.cs.hmc.edu/~oneill/papers/Sieve-JFP.pdf

oxxoxoxooo · on Sept 26, 2022

> This instruction is used in some bignum code

Could you be more specific? I think for that to work one would also need the upper half of 64x64 multiplication and `vpmullq` provides only the lower half. You could break one 64x64 multiplication into four 32x32 multiplications (i.e. emulate the full 64x64 = 128 bits multiplication) but I was under the impression that this was slow.

adrian_b · on Sept 26, 2022

I assume that as you say, whoever used this instruction was using it for multiplying 32-bit numbers.

On AMD Zen 4 and Intel Cannon Lake or newer (when AVX-512 is supported), the fastest method to multiply big numbers is to use the IFMA instructions, which reuse the floating-point multipliers to generate 104-bit products of 52-bit numbers.

oxxoxoxooo · on May 12, 2021

Hi fish,

thanks for very interesting article, again!

Do you think the very fast division on M1 has any implications for 128/64 narrowing division as well? Do you know of a faster way than the method by Moller and Granlund? Do you plan on to include 128/64 division in libdivide?

And I asked this question before but the parent post got flagged so I'm trying once more: at the very bottom of the Labor of Division (Episode V) post [1], is it really possible for the second `qhat` (i.e. `q0`) to be off by 2? Do you have any examples of that?

[1] https://ridiculousfish.com/blog/posts/labor-of-division-epis...

ridiculous_fish · on May 13, 2021

I haven't yet read the Moller and Granlund paper, but its narrowing division using precomputed reciprocal would be a natural fit for libdivide. (libdivide does have a narrowing divide, but it is Algorithm D based).

Regarding the second question, it is possible to be off by 2. Consider (base 10) 500 ÷ 59. The estimated quotient qhat is 50 ÷ 5 = 10, but the true digit is 8. So if our partial remainder is 50, we'll be off by 2 in the second digit.

stephencanon · on May 13, 2021

> is it really possible for the second `qhat` (i.e. `q0`) to be off by 2?

Yes. I don't have an example in front of me, though. I think there may be one in Knuth vol 2. I'll take a look after toddler bedtime is over =)

oxxoxoxooo · on May 5, 2021

I think that I shall never envision

An op unlovely as division

An op whose answer must be guessed

And then, through multiply, assessed;

An op for which we dearly pay,

In cycles wasted every day.

Division code is often hairy;

Long division's downright scary.

The proofs can overtax your brain,

The ceiling and floor may drive you insane.

Good code to divide takes a Knuthian hero

But even God can't divide by zero!

-- Henry S. Warren Jr.

Hacker's Delight