128 bit address space is overkill, but I wonder if a hardware-backed 128 bit integer type would be useful for bit twiddling - or whether spreading the bits over two 64-bits integers is 'good enough' assuming that both halfs reside in the same cache line, and it's unlikely that a single bit twiddling operation needs to affect both halfs.
I guess I will find out soon because I started a home computer emulator experiment in Zig where I'm essentially mapping mainboard wires to bits in a wide integer, and where 64-bits definitely won't enough, but 128 bits most likely will cover most target systems. Very curious about the x86-64 and ARM compiler output.
I am actually currently working on a CPU that has an auxiliary unit with a 521-bit integer type (although many instructions are 512-bit). These have some interesting effects, but I can't say they are tremendously useful for bit hacks, at least not at the cost of the hardware to support them. Multiplications of this size are a very costly operation, for example. Even an addition or a leading zero count is a significant area. Vector units are about as good as you will get for most of this stuff.
This unit, by the way, is intended primarily for cryptography (521 is the length of a large mersenne prime). Cryptography operations are the only times I have ever seen an integer wider than 64 bits in the wild, and most people use bignum libraries.
The same pressures apply to 128-bit cores. It will make a lot of hardware a lot bigger and more complicated.
128 bits of address space would be useful if you're building an architecture with one level of storage.
And there actually has been one family of computers that have used 128-bit pointers. IBM S/38 and its successors did that. It has a machine independent instruction set that is then compiled down to actual machine code, and that uses 128-bit pointers for future proofing.
There have been a few research OSes designed around a concept of "all memory is persistent". Iirc this allows some simplifications. For example, no need for a file system: the copy-from-disk-to-RAM is pointless when there's no difference between RAM & disk.
This didn't catch on as the tech to make it practical wasn't there. And existing OSes were 'good enough' due to memory-mapped files, smart caching techniques etc. Plus boatloads of software using that.
But if all background storage were treated as one giant RAM, I could see some cloud/AI big boys crossing that 64-bit size boundary. Or some science/engineering projects like CERN.
That said: it's the apps, really. If OS + biggest app working sets easily fit in a 64-bit address space, then why throw 2x the bits at it? And disk <-> RAM transfers (+filesystems, cache etc) are a solved problem.
I thought this was where optane was going to take us, back to the days of "your memory is also your persistent storage" like an old PDP/8 with core memory...
> 128 bit address space is overkill, but I wonder if a hardware-backed 128 bit integer type would be useful for bit twiddling - or whether spreading the bits over two 64-bits integers is 'good enough' assuming that both halfs reside in the same cache line, and it's unlikely that a single bit twiddling operation needs to affect both halfs.
Yeah, but can compilers actually do the necessary magic when encountering something like:
x = ((y & (1<<101)|(1<<102)) >> 10);
My impression was always that the vector extensions are good for SIMD operations, but not "wide integer" operations, but I might be wrong of course (e.g. is bit-shifting across "lanes" even possible?)
It's not really a compiler issue tho. SIMD is meant to pointwise map an operation across multiple "lanes" in a single operation. You can't have lane interdependence on the result.
SSE2 adds the PSRL/PSLL operators, which are basically i128 shift operators on vector registers (i.e., shift continues between lanes), so you can pretty easily map i128 to vector registers if you're only doing and/or/xor/shifts.
No, it doesn't shift across 64-bit boundaries. Take a look at the gcc output in your link
```
movdqa xmm0, XMMWORD PTR [rdi]
movdqa xmm1, xmm0
psrlq xmm0, 10
psrldq xmm1, 8
psllq xmm1, 54
por xmm0, xmm1
movdqa xmm1, XMMWORD PTR .LC0[rip]
```
That's a lot of psr and psl instructions for a "single 128-bit wide shift"...
On 64 bit computers you've been able to use 128 bit unsigned ints as a gcc extension for a long time now and the bit twiddling stuff works exactly as you'd expect. The relevant types are __int128 / unsigned __int128.
Clang has something similar and my understanding is that c23's _BitInt will let you do this.
Yes, but compilers seem to disagree on whether they use a pair of 64-bit registers, or an SSE register under the hood (reusing link from a reply): https://godbolt.org/z/xEqrx5dY4
while the back-of-the-napkin math is surprisingly not terrible, of course it's a log scale so Mar's Law is relevant: "Everything is linear if plotted log-log with a fat magic marker".
Mashey was really talking about workstations, not PCs. of course, the line is blurred or non-existent now. what if we look at x86ish PCs?
presume that in 1995, a nicely appointed high-end PC was a 486DX-33 (w/487) and 16MB of ram. that requires 24 bits of physical address. using 3/2*bits per year estimate, we find that we need more than 32 bits around 1995+(3/2*(32-24)) = 2007.
AMD64 came out in 2003/4. but my own recollection is it wasn't really the sunset for 32 bit PCs until just about that timeframe (2007 ish). so that's not too far off.
now apply forward to when we "use up" 48 bits (or 52). and it would be around 2031-2037. possible?
now the other thing, is that (imho) a nicely appointed PC today in 2024 is 64GB (36 bits) vs. 16GB (24 bits) in 1995. does this track? not really. the 2 bits every 3 years would predict that we'd want 64GB machines in 2011? that's not really realistic. and it'd predict that we'd want ~16TB PCs by 2025.
it seems apparent that the exponential growth that was happening in the 1980s and 1990s has either slowed substantially or is no longer exponential for perhaps the last 20ish years.
i think the lesson (and for Moore's law too) is that apparent exponential growth is not going to continue forever.
> For many years DRAM gets 4X larger every 3 years, or 2 bits/3 years.
This trend in DRAM scaling has stopped quite a while ago, though I don't know when exactly. I think it was before 4 GB became common in desktop computers.
At some point we'll run out of atoms in the solar system for the physical memory ;)
(a large virtual address space is of course useful on its own, for instance never reusing a memory address, but we're not even close at scratching the 64-bit barrier - e.g. AFAIK x86-64 CPUs have are limited to 48 bits virtual and 52 bits physical address range).
Exponential growth. 640k is obviously not enough. Nor is double that. Or double that. Or double that. But keep doubling long enough, and eventually the length of time before you need to do it again will be considerable.
We might be there. And 128 bits is a lot of bits. You know what bit width you need to represent twice as many states as a 64 bit value?
It kinda has been advancing, but the advancement has been happening on the SIMD/SSE/AVX side of things (up to 512 now).
Regarding the non-SIMD side of x86, what are good driving needs to process 128-bit values among the normal instruction stream?
- 128 bits is 16 bytes, and that's a decent maximum length for many textual identifiers. These can be loaded, compared, and processed with single instructions now and possibly be processed without intermediate memory accesses.
- It would be super convenient to load a GUID/UUID or IPv6 in a single register/instruction I guess. Would Intel get an `RDUUIDV4` instruction and be able to generate them natively?
- A bigger space for `mmap()` could have interesting possibilities.
- 64 bits of additional randomization available for memory paging could make security techniques like ASLR a bit better.
Data-level parallel processing (or SIMD vector width) is a different thing to address space width. If you want to see some _really_ wide units, look at GPUs.
For UUIDs, 16-byte short strings and IPv6, there's no real reason the SIMD units couldn't do the work there. (Granted the existing vector units may be a bit short on features for working "across" lanes - I'm not sure how capable they are at dealing with null-terminated C strings for example).
In principle at least (and allowing for some snags around memory alignment), C++ std::strings with the short string optimisation (which stores the string's data on the local stack, if it's less than a certain number of bytes) can already be loaded into vector registers and indeed never materialised as memory at all. How much this happens in pratise I wouldn't like to say, but it's not that hard to roll your own stringid_16 or whatever with conversion operators.
You don't need 128 bits for memory addressing, but for just processing - yes, and in fact 128 bits is far less than we're using already! If you look at https://github.com/ggerganov/llama.cpp you'll see this line:
> AVX, AVX2 and AVX512 support for x86 architectures
You are assuming each atom represents one bit. But each atom could represent 2^N states (charge, spin, location, etc) in which case you could store 2^(166+N) bits on Earth!
I was referring to addressing each atom individually. Indeed, you could possibly store more than 1 bit of information at each address. Maybe call it an "atombyte".
Similarly in computer RAM, we don't address each "bit" individually but actually each "byte" (8 bits). Or maybe we address each word?