IEEE Spectrum in the early 80's had an article on minimal instruction sets. The author made a salient point that 7 instructions did most of the heavy lifting, another 7 did the most of the rest required. He did highlight that one mistake made with instruction sets was not separating the addressing modes from the instruction itself.
There have also been various projects looking at the micro-instruction based machines and noting that extremely complex instruction sets could be designed so that the actual requirements of a programmer or project be placed in microcode. One example was an instruction to search a tree structure for a required value (as a single machine instruction).
Not that I can do anything about it, my conclusion over nearly 4 decades is that we have failed to take advantage of the increasingly denser silicon to make higher level machines, in particular machines that recognise the difference an instruction and a data element.
We can write very high level software languages but all have to be run on extremely low level hardware. Too much commodity and not enough variability.
I have managed to rummage around and found the article. It was actually in IEEE Micro May 1982, "A Unique Microprocessor Instruction Set" Dennis A Fairclough.
The article looked at the statistical usage of instructions and placed them into 8 broad groups, Data Movement, Program Modifying, Arithmetic, Compare, Logical, Shift, Bit and I/O & Misc groups.
The result of the analysis was as follows:
For groups data movement and Program Modifying, 1 Instruction - MOVE [Cummulative usage 75%]
For group Arithmetic, 4 Instructions - ADD, SUB, MULT, DIV [Cumulative usage 87.5%]
For group Compare, 1 previous Instruction - SUB [Cumulative usage 93.75%]
For group Logical, 3 Instructions - AND, OR, XOR [Cumulative usage 96.88%]
For group Shift, 1 Instruction - SHIFT [Cumulative usage 98.44%]
For group Bit, 1 Instruction - MOVEB [Cumulative usage 99.22%]
For group I/O & Misc, depends on whether i/o is memory mapped or otherwise, so either 0 or 1 Instruction.
Some possible extended instructions included INC, DEC, I/DBRC and MOVEM.
The address fields (and in some cases additional flags) determine source and destination, etc.
So, in relation to the VLIW question, the instruction length is determined more by the addressing modes allowed than by the specific instruction itself (which is encoded in say 4 bits).
Are you talking about Very Long Instruction Words (VLIW)? IIRC, that's the concept of having the scheduling done in the compiler. Instead of the compiler spitting out op codes and operands, each part of the instruction space controls specific functional units. So if you stick something in the adder, the documentation might specify the next x instructions need to be nops so it can complete.
The idea being if scheduling is done at compile time, you can update the compiler and update the scheduling/performance. Unfortunately, creating compilers for these systems was incredibly difficult, hence the failure of ix64/EPIC.
There also used to be the Transmeta processors which were x86 compatible and did the translation to VLIW in software (CMS - Code Morphing Software).
I used to have a hp thin client with a Transmeta Efficeon where I installed Debian and it worked remarkably well for a 2004 processor, handily beating some Atoms and VIA Nanos in real-life workloads. Frankly I don't know what led to its demise.
> Not that I can do anything about it, my conclusion over nearly 4 decades is that we have failed to take advantage of the increasingly denser silicon to make higher level machines, in particular machines that recognise the difference an instruction and a data element.
That's basically the CISC concept (https://en.wikipedia.org/wiki/Complex_instruction_set_comput...) all over again. The problem still being: which instructions do you choose to implement in hardware and how do you make a compiler smart enough to recognize when to use them? Remember, CISC went out of style because all of those high level instruction weren't being used when processor manufacturers generated statistics on commonly used software.
This is where the programmer or even compiler writer should have control, by allowing access to a micro-programmable base that supplies a standard set of instructions. The capability of adding instructions at need being under control of the machine administrator. For most tasks, this additional capability will not be needed, but for those who need to high optimisation, then they can do so with aplomb.
Especially for those who are writing VM's for their respective high-level languages. As with anything, security of the hardware would need to be considered.
One set of minicomputer hardware that I used to program on decades ago (Honeywell DPS6-92's) had a micro-programmable cpu. We didn't use it, as the base level hardware and associated software were sufficient for our needs. But I had heard that various organisations had incorporated their own instructions when needed.
The choice of normal instructions can be quite small, but the ability to put in user defined instructions would allow certain models of thought to be tested and/or developed at a hardware level.
> ...allowing access to a micro-programmable base...
What you're probably looking for is something like Forin's eMIPS (extensible MIPS) project that includes a block FPGA cells as part of the execution pipeline to allow creation of new instructions as needed.
The knee-jerk answer upon seeing the title was: "Too many".
So many of these instructions I never see anyone use, ever.
No doubt there are some folks using them, but as mere mortals, do the rest of us really need all these features to control our small amount of commodity hardware? As a user, I have modest goals. Is it not true that Torvalds wrote his kernel with a similarly modest goal in mind: control over his own commodity computer?
The situation resembles that of an overcomplex software program where a majority of the features are unused by an even larger majority of its users. In other words the depth of features benefit only the very few people who use them.
Given the choice between several alternatives with differing levels of features I tend to opt for software that is less featureful and hence more simple. Call me simple-minded if you wish. The same goes for processors, although when it comes to hardware how much choice to we really have as end users? (Hobbyist boards excluded.)
For a taste of some non-x86 assmbler, I enjoyed experimenting with a MIPS simulator still found at spimsimulator.sourceforge.net. I can report that the non-GUI portion at least still compiles relatively cleanly on BSD. This simulator has been mentioned on HN several times.
I have a no problem with using a processor with fewer instructions even if I have to sacrafice something by making that choice -- I leave it to the experts to detail those sacrafices and why I would be a fool to make them. NB: I am already a fool so it may not be worth the effort.
How many HN readers have tried RISC-V? A poll for those who have not: will RISC-V inspire you to purchase a new computer?
I often come across problems, like intermittent graphical glitches, in Ubuntu Linux applications and think to myself "maybe I'll try fixing that bug". I start digging casually, and find that the real issue is not in the application but somewhere in the graphics stack. Or maybe the kernel. Oh no, it's in a closed-source driver. Then I despair and start thinking thoughts like "wouldn't it be nice to rewrite all this from first principles? That way I could get it right!"
Then I do some more research and find that all this flaky software is built on proprietary, minimally-documented hardware with its own stack of bugs, except those bugs will never get fixed because the IP is top-secret and the only ten people who understand it have already moved on to build the next product.
So RISC-V/lowRISC is enticing because it promises an architecture that is powerful enough to be more than a toy or academic exercise, and also fully open from the ground up, which the public can iterate on and finally fix bugs - or at least understand them.
(Yes I know, I'm mixing complaints about GPUs into a CPU conversation here...)
I'm also encouraged by the slowdown in Moore's Law - alternative architectures have historically been steamrollered by Intel's phenomenal process engineering capability. If process nodes reach a plateau, the mad miniaturization march of the last 50 years will pause for breath and let a much wider and deeper variety of hardware hackers get involved.
Keep in mind that the numbers presented in the article are large because the majority of them come from a nearly cartesian product; if you consider the ALU-ish operations alone, there's the operation (add, sub, adc, sbb, and, or, cmp, xor, etc.), addressing mode (reg, mem[imm], mem[reg], mem[reg+reg * scale], mem[reg+imm], mem[reg * scale+imm], mem[reg+reg * scale+imm]), width and type (int8, int16, int32, int64, float32, float64, float80, plus vector variants: int8x8, int16x4, int32x2, int64x1, float32x4, ... ).
Approximately from the list above there are already 8x8x16 = 1024 "instructions", which within an order of magnitude correlates nicely with the estimates given in the article. The rest (probably dozens at most) are mainly for OS-oriented system management, and special operations done in hardware like AES (without which the equivalent software implementation would be orders of magnitude slower).
The main "simplification" which RISCs have done is restricted the operand types, so that e.g. most if not all the ALU ops must use register addressing modes, and all the other addressing modes (of which there are not many) are restricted to memory-register move instructions. That turns parts of the cartesian product into a sum, reducing the instruction count by the above measures, but IMHO is the wrong thing to do since now it means all software has to contain more instructions to do what the hardware would otherwise be able to figure out (and possibly optimise execution of) in a CISC. For example, x86 can express mem[reg + reg * scale + imm] = reg + mem[reg + reg * scale + imm] in a single instruction (decoded into multiple uops, which can be scheduled into whatever hardware resources are available) while a RISC would require several.
I'd be more interested in seeing the maximum amount of individual op-codes CLang and GCC can output for x86_64. I suspect there are many they don't use because their either obsolete, unnecessary or for very specific operations such as those used by video codecs.
Prediction: RISC-V will at best take on in some niche markets. If RISC-V is lucky, it will maybe displace MIPS.
--
Usual P.S.: The ISA has become less and less relevant to processor performance since the mid 90s (when Intel and AMD introduced micro-ops to the x86 world by dynamic translation) and is now less relevant than ever before. What is relevant is developer mindshare, good compilers, availability of optimized algorithms, and manufacturers pouring billions into creating better scheduling and execution for µops.
I have RISC-V hardware (got a handful of HiFive1 boards) and they're definitely the fastest arduino compatibles to date. If SiFive made an SoC for Chromebooks (maybe in collaboration with a large manufacturer like Samsung), I would buy them and suggest them to friends. I would love to have a big and wide RISC-V desktop workstation (couple hundred gigs of ram and a couple dozen cores) once all the prerequisites are in place, but that stuff takes more time it seems. The Shakti folks over at IIT Madras seem like they'll be the first ones to tape out RISC-V workstations, servers, and HPC nodes.
Privilege separation models are finnicky, so it's probably not a good idea to rush that.
I think that if we can all throw our weight behind RISC-V, the availability, diversity, and good functioning of computer platforms will improve drastically.
My favorite legacy x86 instructions are the BCD opcodes like AAA (Ascii Adjust after Addition). They are apparently now slower than doing the conversion and operations yourself, but are kept in there for compatibility with the original 8086.
They likely decode to the same sequence of uops internally; I benchmarked them and they're basically the same speed as the equivalent sequence of simpler operations, but a lot shorter. See this item for more information:
> No doubt there are some folks using them, but as mere mortals, do the rest of us really need all these features to control our small amount of commodity hardw
They aren't for mere mortals though, they are for compilers.
>will RISC-V inspire you to purchase a new computer?
Not directly, but whatever in it inspired the lowRISC team will likely have me buying one of their boards. Mostly for the minion cores on them, which meet a need I have.
From my understanding we can't get much higher sequential instructions / second than we already have.
So the instructions have to do more each (CISC), or we have to do a lot of instructions in parallel. Maybe RISC could shine in massively parallel processing units.
Although the Intel CPUs have a CISC instruction set, internally they are converted to RISC like uOPS in the early instruction decode stage. So an CISC instruction that increments a memory location is converted into uOPs to load from memory into an internal register, increment of that register followed by a store to memory. These days, the uOPS are so powerful that they do the opposite at times like merge adjacent compare and branch instruction into one uOP.
LLVM's answer to this question is about 14,600. This comes from treating memory and registers operands as different (i.e., add %rax,%rbx and add %rax,(%rbx) are two different instructions, although note that add %rax,%fs:16(%rbx,%rdi,4) is counted as the same instruction as add %rax,(%rbx)).
I will also point out that LLVM's list is not exhaustive--it's definitely missing a few operand types for a few instructions (e.g., nop %rax).
This is the wrong way to go about it. x86 mnemonics bear little resemblance to the encoded binary machine code. For example, what x86 lumps into a single "mov" mnemonic, is actually half a dozen different underlying instructions, e.g load, store, reg-reg, and a few special cases.
It's the wrong question too. Perhaps what you're looking for is: "What is the number of combinations in which an instruction can be decoded?" This would need to lump together all the multi-bit fields (such as immediates) as one (or a few if there's special values). This would be a measure of the expressivity of instruction set, and somewhat a measure of the encoding efficiency — how much it can cover the execution unit inputs?
It's a much easier thing to answer for most of the "RISC" oriented architectures, e.g ARM 32 and 64 bit. It's basically the set of valid binary encodings of instructions, compressing together all the immediate values.
All the different instructions are listed in intel's instruction reference manuals. What is not very wise about this article is that he's using AT&T syntax. It adds one pointless level between the actual assembly and the programmer, and it has these syntax differences compared to reference manuals, which makes it frustrating to use (both read and write).
I made a Python package which documents x86(-64) instructions in a ready-to-use way: https://github.com/Maratyszcza/Opcodes (also `opcodes` on PyPI). With this package, its easy to collect ISA stats, e.g.
import opcodes.x86_64
isa = opcodes.x86_64.read_instruction_set()
print(sum(len(instruction.forms) for instruction in isa))
>>> 6020
As an another example, here is number of instruction forms (e.g. mnemonic name + operand types) over time on Intel CPUs: http://imgur.com/a/AVPcq
Maybe I should add that sometimes very different instructions get assigned to the same mnemonic. Like the venerable mov. Moving data from typical general purpose registers is very different from moving data to and from debug registers and control registers, and have different opcodes, yet they have the same mnemonic. You won't find them in typical programs because they need Ring 0.
I quite like RISC-V , it's simple, elegant and efficient. Undergraduate students can quite easily put together competitive in order cores. That itself is a testament to it. No wonder, considering Patterson and Hennessey literally wrote the book on computer architecture!
In the undergraduate level computer architecture class I just finished, it was pretty much stated as fact that the simpler design and fixed instruction format of RISC based architectures makes them much more suitable for pipelining, and hence leads to better performance than those based on CISC. Why are x86/64 processors used for most high performance applications then?
Because your course is based on outdated materials.
In the 80s and 90s academia and high-end computing was taken by a RISC illusion. RISC architectures were simple and clean and easy to implement at higher performance than any CISC processor of the time, so people assumed that of course RISC would be the feature, right? Well, no. CISC, or, to be more truthful, x86 was a mass-market item with competing manufacturers with deep pockets (like Intel and AMD).
In the mid 90s both Intel and AMD broke RISCs neck (though it was not immediately clear perhaps what a groundbreaking innovation this was): They introduced processors that translated x86 instructions to other instructions (µops) that were then scheduled and executed. Turns out, this is pretty efficient for a number of reasons (1- you get existing software 2- x86 code can be really dense; RISC had instruction bloat issues 3- decoding insns is only a very small part of what a CPU does and does only consume a minority in transistors 4- you can change the µops, they are an implementation detail, to best suit the exact core at hand; something impossible for RISC). This allowed x86 all benefits of what RISC previously had for itself.
By ~2000 most RISC architectures were dead (Alpha, SGI, ...) with only POWER and SPARC surviving in their niche markets.
In summary: The importance of the ISA was vastly overestimated by researchers, who assumed that this was a blocking issue to make x86 go faster. In the end, x86 won, because x86 had the bigger market with more money in it.
> By ~2000 most RISC architectures were dead (Alpha, SGI, ...) with only POWER and SPARC surviving in their niche markets.
You forgot ARM, which is a major omission.
> In the end, x86 won, because x86 had the bigger market with more money in it.
It's not the end yet. :) x86's dominance in the future is hardly assured, and in fact if you measure by total number of microprocessors sold, x86 is a small niche--300-400 million sales per year compared to 15 billion ARM chips sold.
Only the condition code was removed. Otherwise the parent is right and there really is an enormous amount of instructions. Fortunately less than intel, but still a lot.
So? Samsung also sells a lot more smartphones than Apple, yet Apple makes far more profit from them. You can't pay R&D with revenue.
There are probably more PowerPC chips sold via laser printers and such than x86 chips. Conversely, 8051 derivates probably outnumber all of these. So what. Meaningless argument.
Even with the complexity of Thumb-1, Thumb-2, and NEON, I'd give the edge to ARM in simplicity. At least the instruction encodings for ARM make sense, even today. x86 instruction decoding is absurd.
More importantly, ARM has a way to massively reduce its decoding complexity, via AArch64. AArch64 is a separate RISC instruction set, with room explicitly left in the design for dropping 32-bit ARM in the future (32-bit support is optional, though in practice today everyone includes it). ARM is effectively in the process of making a transition to a more RISC-like ISA, unlike Intel which has no realistic way to get there.
> Even with the complexity of Thumb-1, Thumb-2, and NEON, I'd give the edge to ARM in simplicity. At least the instruction encodings for ARM make sense, even today. x86 instruction decoding is absurd.
If you write down the x86 instruction encoding in octal, it makes a lot more sense:
> Why are x86/64 processors used for most high performance applications then?
Because (1) you can overcome those issues by throwing enormous manpower at them; (2) the business success of Windows, coupled with Intel's patent portfolio to keep out competitors, historically generated enough revenue to fund that manpower.
(There are signs that this may not sustain itself forever, with Intel's failure in mobile and the performance disappointment of Kaby Lake, but regardless of how well the strategy will continue to perform in the future, it worked for decades.)
Regarding single-threaded performance it just seems increasingly likely that we finally hit the wall of what the current technology is able to support. I don't expect major advances here until other manufacturing tech is used.
I have a naive understanding, but the compilers had to do a lot more work with the RISC systems. Also, is anything really your classic RISC any longer? Lots of new instructions and instructions that take a lot longer than 1 cycle are there. Anecdotally when looking at the assembly for Intel or Arm, they are often very close.
> Anecdotally when looking at the assembly for Intel or Arm, they are often very close.
First: Modern ARM chips (i.e. what you will typically find in a mobile phone) support three instruction sets: A32, T32 (Thumb-2 on modern ARM chips) and A64 (if you don't know what they mean, google it up). So it hardly makes sense to talk of "ARM assembly" if you don't specify which instruction set you mean. I know there is UAL, which in some sense unifies the instructions of A32 and T32 on assembly level.
There are very central differences that you will observe when looking at the disassembly of all ARM instruction sets vs. x86-32 or x86-64.
- Typical x86 ALU instructions have 2 operands. Typical ARM ALU instruction will have 3. This is very typical for CISC vs. RISC instruction sets.
Side note: A compiler often does register allocation by graph coloring. If 3-operand instructions are used this is an NP-hard problem. On the other hand if 2-operand instructions are used, we obtain a chordal graph, for which the graph coloring problem can be solved in polynomial time: https://en.wikipedia.org/w/index.php?title=Chordal_graph&old...
- x86 ALU instructions also accept a memory both as source and destination (side note: if we add a register to the value at a memory address, on modern x86 cores this is mapped to 3 mu ops: load value, add, storeobly value). On the other hand (typical for RISC) ARM is a load-store architecture. So ALU instructions don't accept memory address contents as operands - instead one has to load the content of the address into a register first, do the ALU instruction and use a store instruction to store the result in memory.
Side note: On x86 "{alu_instruction} r/m, r" (e.g. add [edx], eax) is atomic with respect to interrupts, e.g. when an interrupt occurs either this instruction has been executed completely or not at all. You want to make it also atomic with respect to code executing on other cores? Just add a LOCK prefix.
- Even in 32 bit mode ARM already offers many registers (16 GPRs) and in A64 even 32. x86-32 only offers 8 and x86-64 16. Having more and orthogonal registers is typical for RISC vs. CISC. Thus ARM's call convention uses registers for passing the parameters (again very typical for RISC) and uses a link register for passing the return address. x86-32 instead typically passes return address and parameters on the stack (I know that under 32 bit Windows there also exists the __fastcall call convention that uses registers for passing parameters, too. But this is an exception). But admitted: under x86-64 on the Linux and Windows call convention they also use registers for passing parameters.
- At least A32 and T32 cannot directly call functions that are too far away from the current instruction. So the linker has to insert veneers, which make this possible. One of course will not find veneers on x86.
- It is not easy to access memory addresses relative to the instruction pointer on x86-32 (under x86-64 it is easily possible), while on ARM it is. So ARM assembly code will often contain small const variable values next to the function. One hardly sees such code on x86-32 (though on x86 it is not really necessary, since there are instructions that put some arbitrary constant into a register; A32 and T32 on the other hand allow only to encode some kind of constants, bases on a rather complicated scheme that I will not explain here; so this trick is rather helpful for writing ARM code).
TLDR: Better look again at the disassembly carefully.
Also non-ancient ARM chips add the movw and movt instructions so you can load a 32 bit constant simply in 2 insns rather than having to play games with sequences of arithmetic insns or use a constant pool.
I accept this point and admit to be a little bit biased, since I learned a lot about ARM disassembly by reverse engineering code that runs on an ARMv5 ("ancient" by your standards) core.
Why are x86/64 processors used for most high performance applications then?
Well, you should start from this and then work backwards because it's just an empirical fact that x86_64 is fast and that RISC isn't. There are many reasons for that, none of them fair, but in the end, it's still just a fact.
Yes, RISC-V is simple+elegant. If that's what you want, great. True, RISC-V is open source. If that's what you need, awesome. Now weigh those qualitative features against the cold hard $1B/year that Intel invests in making the mess that is x86 run fast. Add to that the billions that ARM+Samsung+NVidia+Apple+... spend on making ARM fast. It's not a fair fight. It really isn't.
But know this. If RISC had anything tangible, any fundamental advantage, then we'd be living in a RISC world by now and we aren't (Patterson+Ditzel was 37 years ago). If you want to think worse is better, sure. Whatever. If you want to call ARMv8 MIPS-ish, that's your constitutional right.
Skylake is a superscalar, speculative, out-of-order, renaming, hyperthreaded multicore beast. You may program it in x86 but those instructions get translated+cached as μops, very wide microinstructions.
BTW, anyone who thinks that μops are actually RISC under the hood severely needs to re-read The Case for the Reduced Instruction Set Computer and then say, Inside Nehalem [1] or Micro-operation cache: A power aware frontend for variable instruction length ISA [2]; microprogramming (vertical+horizontal) predates RISC by decades [3]. These μops are wide, 150+ bits. Ain't nothing reduced about that.
In 2017, believing in RISC is only slightly more acceptable than believing in Mill. Moreover, I say this as someone who went to Berkeley and read Patterson+Ditzel in 252. I should preach the religion.
In one of the Mill talks, I've long since forgotten which, someone asks Ivan Godard a question about RISC. His response is something like "there was a brief window in the eighties where if you had a RISC machine you could get the whole computer onto one chip." I don't enough experience to know if this is right, but it strikes me as a nice clean explanation. It also explains why x86 won since then, because a couple generations later it was possible to get the whole computer onto one chip with x86 as well.
(I may have misquoted badly, in particular I'm not sure about the dates)
Your dates sound about right. The eighties. Mead+Conway was just out and fabs of a certain size became accessible. The question then was what could you do with these fabs and transistor budgets? Berkeley+Stanford did RISC+MIPS. Clark did the Geometry Engine at Stanford+SGI.
So what could you do with X transistors? But then X became stupid large. The 68000 (1979) had 40,000 transistors. Now the Apple A10 has 3.3-billion transistors. So you can imagine that architectural design assumptions dating from 1980 will need to be revisited.
Where did you get that figure? I'm curious because the figure I've seen tossed around is 68,000 transistors (the story going that this is where the model number came from).
It's much more likely (like infinitely more likely) that the 68000 name stems from the earlier 6800 8-bit product line. There's nothing definitive saying 68,000 transistors and I've read both 40,000 and 68,000. I believe that 68,000 was just marketing. FWIW, Motorola also employed 68,000 people in 1980 (approximately):
This was also the very brief time when memory was actually faster than the core, making it possible to contemplate things like large fixed-with instruction encodings, since the main bottleneck with the early CISCs was instruction decoding and not fetch bandwidth; it is unlikely that, had this period not existed, RISC would have developed in the way that it had.
It's ironic that Ditzel went from Berkeley RISC to Bell Labs and helped develop CRISP [1]. This was the invention of the Decoded Instruction Cache which solved the fetch/decode bandwidth problem. You could have a CISC and not have to worry about decode bandwidth. You could have a wide μop and not have to worry about fetch bandwidth.
My own understanding is more simple... people care mostly about application availability and x86 was chosen as the primary platform for Microsoft Windows so anything that didn't support the x86 instruction set was at a disadvantage. These days with interpreted or JIT compilation, things are less certain but those days running things on bare metal mattered.
Yes, RISC-V is simple+elegant. If that's what you want, great. True, RISC-V is open source. If that's what you need, awesome. Now weigh those qualitative features against the cold hard $1B/year that Intel invests in making the mess that is x86 run fast.
I wonder how hard it would be for Intel to substitute an alternative decoder in place of the current x86/x64. Given the existing micro- and macro-fusion capabilities, is there anything that would prevent them from reusing the loop streamer and decoded instruction cache and entire back-end for a different instruction set? While there are probably legal issues with ARM or Power emulation (are there?), it seems like it wouldn't be too hard for them to quickly put together a high-performing hyperthreaded superscalar out-of-order alternative for RISC-V should they ever want to.
BTW, anyone who thinks that μops are actually RISC under the hood severely needs to re-read ...
Could you expand on this? I take you to be saying that Intel µops aren't really the same as classic RISC at all, but you seem to be the only person in this thread that holds this position. The majority appear to be saying straight-out that current Intel is actually RISC under the hood. Personally, I suspect that you are right, and that it would be useful for you to make this claim more forcefully. Or maybe I'm the only one that believes this, and I'm misinterpreting you?
Microprogramming is a technology which goes back to the 50s. CISCs were microprogrammed. The VAX was microprogrammed. It was the status quo circa 1980. RISC as a term really dates to 1980 (although the ideas go back to the 801 and Crays machines).
The big idea of RISC was to avoid microprogramming and have a simple to decode ISA which was directly executed by a short simple pipeline. ld/st + lots of registers + C compiler. Awesome.
Indeed in Patterson+Ditzel's version of Luther's 95 theses, they say:
Microprogrammed control allows the implementation of complex architectures more cost-effectively than hardwired control
They say this in the section Reasons For Increased Complexity. RISC is revolting from that complexity, from microprogramming, so how can μops be RISC if that's what they were revolting from?
I think people say this because no one knows what microprogramming is anymore. You might read one paper in a graduate architecture seminar. And no one but no one writes microprograms.
The decode stage of Haswell translates add RAX,RBX into a '150b' μop. If can you read Agner Fog and you'll know how many μops, what the latencies are, etc. These are empirically determined properties of the μop but that's it. You know the Decoded ICache characteristics. But you don't know the ISA.
After all that's done, you're left with maybe a 14-pipestage data path. That's not RISC. Complicated instructions (AAA) get interpreted. That's not RISC. Multiple functional units can calculate 2+3 operand effective addresses. That's not RISC. This all is REALLY not RISC.
μops are micro-instructions, not RISC instructions.
BTW, the alternate decoder kinda wouldn't work so well because the microarchitecture, functional units, datapath, caches and registers are really set up for x86. Folks wanted something like that at Transmeta, a Java processor, but the HW was designed for x86.
It may be worth mentioning that the BCD suite of instructions are not supported in x64. But if you are looking for an example of modern microcoded x64 to avoid, BTR/BTC/BTS with a memory operand is a prime example. Which does bring up the point that talking about "x86/x64" can be misleading in a similar way as discussions of the "C/C++" language.
the alternate decoder kinda wouldn't work so well because the microarchitecture, functional units, datapath, caches and registers are really set up for x86.
Are there specific areas where you see problems? For example, I'd think the existing abstraction between limited number of architectural registers and the hundreds of physical registers would work fine for alternative ISA's. And I don't immediately see why caching wouldn't work unaltered.
Folks wanted something like that at Transmeta, a Java processor, but the HW was designed for x86.
I searched for more information about what became of Transmeta after I posted last night, but didn't find much about the technical issues they encountered. Do you know if there is a good post-mortem?
While searching, I did come across a couple interesting CPU's that are going the other way:
The main problem I see mentioned when going the other way (ARM emulating x86) is dealing with "flags". Since recent Intel processors already support "flagless" variants of most of the instructions (SARX, MULX, etc) this doesn't seems like it could be insurmountable.
BTS with a memory operand. Now that would be bad. What compiler people learned was to avoid the CISC-iest instructions in favor of RISCy ld/st+registers instructions. What Intel learned was to make RISCy instructions fast. However after translation to uops, the instruction encoding itself just didn't matter.
Also, register renaming helps a LOT; so you don't need a 32 direct registers when 16 registers renamed to 180 internal registers will more than do. Register renaming predates RISC by 15 years (Tomosulo). I'd almost say it's anti-RISC.
BTW, Elbrus + Transmeta are kinda joined at the hip. Babayan consulted at Sun with Ditzel and is now at Intel. Half the Transmeta people went to NVidia (eventually) and did an x86 before switching to ARM (Denver).
Eventually people will see RISC as the provisional idea it is. Register renaming is a Great Idea. Uop translation is a Great Idea. RISC is provisional.
> The main problem I see mentioned when going the other way (ARM emulating x86) is dealing with "flags".
I see the much larger problem in ARM emulating the memory model of x86, which gives much stronger guarantees on ordering and synchronization than the weak memory model used by ARM processors.
nicely said. though I think the Mill deserves more credit. Assuming that the team doesn't die or simply get acqui(patent)hired, they seem more likely to have material impact than RISCv which exists simply to swap out Arm.
To me the most interesting part is that Mill has attracted a quite amazing amount of supportive interest from people who are talented, experienced and knowledgeable. Essentially a who's who of comp.arch...
Why did VHS prevail when Betamax was a superior format? Why did TCP/IP come out on top when other more capable protocols existed at the time? Why did the FAT filesystem end up becoming so wildly popular?
x86 took over because of a number of factors, and over time improvements made it more RISC like internally. Read up on modern CPU design to see how RISC, CISC or whatever is just an API of sorts to the internal workings of the CPU. The two philosophies have become so blurred few people even use the term RISC or CISC any more. It's irrelevant.
VHS won because Betamax initially couldn't fit an entire movie into one tape. VHS's quality was inferior, but it was superior in that it didn't require changing tapes partway through. By the time Betamax was able to fit a movie onto one tape, it was too late.
Often when an "inferior" technology wins, it's because you're focusing too narrowly, and the winner is outright better in some way.
For x86, Intel's process and design expertise overcame the shortcomings of the ISA, and then some. You're not necessarily buying x86, but rather buying the best silicon that happens to speak x86.
I don't think you're disagreeing with me here. The "worse is better" principle often boils down to a few fundamentals: "A functioning program is always better than a non-functioning one" and "an understandable product or specification is always better than one that cannot be understood".
Intel also made, and probably still does make RISC chips. The i960 and i860 are just two examples. https://en.wikipedia.org/wiki/Intel_i860 They never took off on any appreciable scale.
> You're not necessarily buying x86, but rather buying the best silicon that happens to speak x86.
Or more specifically, you're buying the best chip that runs your existing software and works with the toolset you're familiar with.
This is why Intel's Itanium project was doomed from the day they announced it. Nobody was going to re-write everything to work with their new instruction set.
It's notable that Apple managed to go from 68K to PPC to x86 to x64 almost seamlessly, but they did that at the expense of backwards compatibility. You can run many Windows apps from the 3.1 days in current versions of Windows without emulation, but you can't run Mac software from the 68K days without it. Two different approaches.
The Intel world is ruled, if not held hostage to backwards compatibility concerns. "Better" means "more traditional".
> You can run many Windows apps from the 3.1 days in current versions of Windows without emulation
Only on 32 bit versions of Windows (and even this only sometimes works by this time). But note that the main reason why you cannot run Win16 applications on 64 bit Windows is not:
- that you cannot run 16 bit applications in long mode (you can - look up the details of in the Intel documentation if you don't believe; just generate a suitable segment descriptor)
- Virtual 8086 is not supported in Long Mode (it is indeed not - but this is only important for applications running in real mode (e.g. DOS applications) and thus is not of importance for Win16 applications)
as it is often claimed in the internet, but has to do with Windows' architecture:
"Note that 64-bit Windows does not support running 16-bit Windows-based applications. The primary reason is that handles have 32 significant bits on 64-bit Windows. Therefore, handles cannot be truncated and passed to 16-bit applications without loss of data. Attempts to launch 16-bit applications fail with the following error: ERROR_BAD_EXE_FORMAT.".
I didn't really mean to disagree. (Although I know that's usually the default assumption!) A lot of people don't realize that VHS had real advantages besides just adoption, and I mostly just wanted to elaborate on that a bit.
On paper Beta was better, and that's why the broadcast industry used it almost exclusively. For consumers VHS was good enough for their needs, and the EP mode in particular, where you could record multiple hours of television on one cassette (6?) made it far superior.
VHS looked like garbage, Beta always looked better, but VHS was cheaper, more convenient, and ubiquitous.
You're right about the length thing being a hassle. Friends who had Beta decks always had to jockey tapes in the middle of a movie, not unlike later when you had to flip a laserdisc. It was always a point of ridicule. Looks great, if not amazing, but you were always X minutes away from having to get up and change tapes.
VHS could handle even the longest movies with ease, though often the quality would suffer accordingly.
Actually Beta wasn't last as it wasn't a two-way race. Philips VCR, Video 2000, laserdisc, RCA Selectrivision, and more. The initial cost of a Beta player or recorder was generally higher than VHS, and Sony waited a long time to license the tech. JVC licensed VHS early and widely.
Sony had actually bet on one hour initially because TV shows were mostly half an hour or an hour. This left movies and sporting events mainly as the issues with recording time. They eventually brought out longer tapes, but it was a combination of factors up to that point that kept Beta from leading.
Now, compare and contract MiniDisc vs. CD, Memory Stick vs SD, UMD vs. Download vs. SD, and where Sony actually won with Blu-Ray vs. HD-DVD. MiniDisc went away. Memory Stick mostly went away. UMD went away. But where Sony widely licensed the superior technology early on, it won.
In addition, we seem to be going back to a more CISC like model. RISC was appealing when you could have simple instructions and take advantage of that to get high clock speeds. Now, that avenue is gone. Now, may CPU's are adding special instructions (for example AES or SHA acceleration instructions) or wide instructions (SSE, AVX, etc). We are likely to see more in the future, and may see specialized instructions to speed up neural nets.
Neural nets are honestly best suited to a ZISC type architecture... In any case, specialized deep learning chips should be on their way.
Not to toot my own horn here, but we're one of the hopeful deep learning chip startups, I'd be happy to answer any more questions or perhaps elaborate on why I think ZISC is better for neural nets.
I don't actually think that the Google TPU is the paragon of deep learning processor architecture, there are a lot of things that I don't think are done very well.
In any case, that being said, the TPU is less of a CISC and much more like a lightweight RISC control mechanism.
I don't really agree. What is labelled a "CISC instruction set" here would more usually just be called a control interface. But maybe I'm missing some sort of context here. (Contrasting it with "ZISC" for marketing purposes?)
It's almost always better to have a dedicated chip implement some algorithm, if it makes economic sense to do so. From hard drive controllers to Google's TPU.
Perhaps it's a confusion of terminology, because most people I know would call it as asip. It has the flexibility to implement any linear operation after all.
I wouldn't consider heterogeneous computing and vector instructions "going back to a CISC-like model". RISC is actually continuing a slow but steady march forward, with the simplicity of AArch64 being a notable step along the way.
I tend to think of "CISC" as complex instruction encodings, smaller register files, complex sets of addressing modes, lack of orthogonality in instructions, etc. None of these aspects are gaining in popularity for new designs. The only one I can think of that gained any popularity is Thumb-like compressed instruction sets, which aren't so much "CISC" as "don't spend opcode space for no reason".
AMD and Intel can afford to spend a large amounts of money on design because x64 chips are sold at high margins in high volume. If you are interested in the economics of the chip business, [0] is a great Usenet post by John Mashey[1].
It is also worth making note, that Intel (and to a lesser extent AMD), have had a fair amount of advantage by being further ahead in chip fabrication technology, having more transistors has played a major role in how Intel kept it's status.
You are correct. RISC did win. Modern x86 micro architectures are RISC under the hood. CISC instructions are implemented in microcode. https://en.wikipedia.org/wiki/Microcode
This is just looking at it from an assembly language point of view. Isn't it even more complicated at the actual byte level? For example, are an instruction, and then the same instruction but with a width prefix, different?
I'm not sure I would say more complicated, but you're completely right things are bit different on the byte level. For example there are certain prefixes that can be added indefinitely. Because of that there is a maximum instruction length x86 CPUs will decode, which I believe is 15 bytes - no valid instructions will go over that without adding unneeded prefixes, but without such a maximum the instruction set is technically infinite in length.
That said, as far as the CPU is concerned (And keep in mind, x86 CPUs are stupid complicated so none of what I'm about to describe would really tell you how a particular x86 executes a particular instruction. This would apply better to simpler CPUs) an instruction with multiple widths is generally treated the same way for all of the widths. It decodes the opcode identifier for the instruction, and then would go on the decode the width from a separate part of the instruction. (Though I mean, commonly simple CPUs don't even support multiple widths). The hardware that handles that would commonly be the same or very similar. Debating whether or not that makes them "different" is just matter of definitions I think. But at the end of the day they are generally treated very similarly.
It's also worth noting that assembly langauge's aren't quite a one-to-one mapping. It's fairly common for the assembler to substitute an equivalent instruction for one you used in cases where it knows it is better, and it is also common that a instruction doesn't actually exist on the CPU, but the mnemonics for it exists for convenience. But this does mean that the assembly language can actually represent more instructions then the CPU actually has, meaning the 'byte level' version can in some ways be less complicated, depending on your POV. I think for x86 this likely doesn't really make much of a difference though because the number probably isn't extremely high.
On the Motorola 68000, the MOVE instruction is used to move data from one location (memory or a register) to another location (memory or register). The size of the move is encoded in the instruction (two bits, only three patterns are legal). The instruction will affect the flags (set the Zero and Negative flags, clear the Carry and Overflow flags) based on value moved, except if the destination is an address register, in which case, the flags are not affected (and only two of the four bit patterns for size are valid).
In fact, Motorola defined two instructions because of this, MOVE and MOVEA (move to address register). So, is this really one instruction (since both follow basically the same pattern)? Two instructions (because one sets the flags, the other doesn't)? Five (size and flags)?
Also, the invalid bits (for size) are used to indicate other instructions, so there's a bit of overlap in decoding instructions.
Great read, thanks! Followup question: what's the distribution of transistors per instruction?
Taking your number that there are ~2k instructions in Haswell, and given that Haswell has ~1.4 billion transistors (http://www.anandtech.com/show/7003/the-haswell-review-intel-...), that means on average ~1 million transistors per instruction. My guess is the majority of transistors go to things like the cache, and then there is duplication across cores, so the number is clearly much lower than that, but do you have any sense of what it costs to add an instruction in terms of number of transistors?
Probably zero. The decoder isn't instruction specific, it's just something that translates the incoming ops into micro-ops internally. Even then it's pretty abstract. Is an adder circuit specific to an add operation? It's probably used for a lot of things.
This is not a useful metric, surely? As you say most of the die is cache.
What an instruction costs depends mostly on what you want to do, and how many bits wide it is. Multipliers are expensive. Then there's considerations of pipeline depth and branch prediction.
There's something to be said about the gigantic number. The machine needs to direct itself to the location of the data, and those methods aren't interchangeable to the machine or to people, unless performance-irrelevant machine code becomes a major use case.
Nobody knew software could be so complicated. Seriously, though, the world would be a better place if Motorola would have evolved the 68000 architecture to win the pc market.
The Intel architecture manufacturers privilege reverse compatibility, so I am not sure if there's a limit where it would be good to sacrifice some of that compatibility in exchange for performance or simplicity.
I mean, for example I have not seen a single person talking about how to integrate AMD's 3DNow! in their software. Mostly because AMD adopted SSE, but Intel didn't adopt 3DNow!, so people use SSE... as a simple example. So you end up with a vestigial set of unused instructions... with the associated cost, since it's not for free in terms of design, implementation, manufacturing, etc.
> I mean, for example I have not seen a single person talking about how to integrate AMD's 3DNow! in their software. Mostly because AMD adopted SSE, but Intel didn't adopt 3DNow!, so people use SSE... as a simple example. So you end up with a vestigial set of unused instructions... with the associated cost, since it's not for free in terms of design, implementation, manufacturing, etc.
Because of this reason AMD decided to drop support for 3DNow! in more recent processors:
"However, the instruction set never gained much popularity, and AMD announced on August 2010 that support for 3DNow would be dropped in future AMD processors, except for two instructions (the PREFETCH and PREFETCHW instructions). The two instructions are also available in Bay-Trail Intel processors.".
Also look under "Processors supporting 3DNow" (emphasis mine):
"All AMD processors after K6-2 based on K6, Athlon, Athlon 64 and Phenom architecture families. Not supported in Bulldozer, Bobcat and Zen architecture processors and their derivates."
Amateur/hobbyist programmer here. Is there a good bottom up explanation of computing that you know of? i.e. something that starts at the hardware level (how a CPU works) and then moves up through the abstractions layers until you get to a top level language like JS running inside a browser?
I've been looking for something like this for a while, mainly to help explain computers to lay people that are interested in the details.
Thanks! that surely looks interesting and I will check it out in more detail.
However I was looking for something more summarized and using metaphors for the layman.
This game is rather similar in spirit to "The Elements of Computing Systems", which I recommended above, but more summarized (though it only considers up to CPU level in opposite to the Nisan-Schocken book).
Any count based on mnemonics is off by 1 because of the CMPSD mnemonic. This mnemonic maps to both a string instruction and a SIMD (floating point) instruction. You probably want to count them separately. It's funny that this mnemonic is mentioned in the article, but the author didn't notice that it actually points to 2 families of instructions, not 1!
Forgive my ignorance, but I don't see a link to the actual "instr-count" program and I don't see a GitHub link anywhere. What does the program do? How does it divine the instruction count?
>All the numbers in this blog post have been obtained through a small program making use of our awesome C++11 library for working with x86-64 assembly. Just like the library, my program is available as open source on GitHub.
There have also been various projects looking at the micro-instruction based machines and noting that extremely complex instruction sets could be designed so that the actual requirements of a programmer or project be placed in microcode. One example was an instruction to search a tree structure for a required value (as a single machine instruction).
Not that I can do anything about it, my conclusion over nearly 4 decades is that we have failed to take advantage of the increasingly denser silicon to make higher level machines, in particular machines that recognise the difference an instruction and a data element.
We can write very high level software languages but all have to be run on extremely low level hardware. Too much commodity and not enough variability.