Never having written an assembler for a proper ISA, I was under the impression that assemblers for real CPUs are extremely simple until you start writing them.
One of my hobbies in the 1980s was writing assemblers, because most of the existing commercial offerings were pretty bad. I started out writing my own utterly terrible (though fast) assemblers, and after several years and my fifth or sixth try I had one that was fast, very usable and that people liked a lot. It was shipped as a component of our company's devkits and wasn't "commercial" in the sense of a standalone product, but I still count it a commercial success.
Things that make assemblers useful -
- A real macro language. The C-preprocessor does not count.
- Support of the manufacturer's mnemonics and syntax. Unless the manufacturer's official syntax is "AT&T / Unix assembler syntax" then that doesn't count, either. Parsing addressing modes is often painful -- the 68000 was a bear that took several days to get right -- but telling your users "Oh, just use this alternate syntax . . . documented where? Umm..." is lots more difficult.
- Listings and cross-referencing. Maybe this was a function of the era when I was writing these, but all of the assemblers I wrote did not output a listing (addresses and generated bytes along with the program text) and my initial users were reluctant to use them. When I added listings to my last effort -- it took a day or two IIRC -- the tool suddenly became usable in their eyes.
- Speed. No reason these things can't crunch through a million lines a second.
(For my own game carts, I would print out a complete listing every couple of days because these were really helpful in debugging. At the end of a typical project I'd have a couple five-foot-high stacks of fanfold paper, which I'd have shredded. I'll point out that while my floppy-based copies of my games' sources have gone walkabout, I still have the most recent complete assembly listings in binders).
I've thought about reviving my hobby . . . but modern CPUs are much more complicated than the 68Ks and 8-bit wonders of decades ago, and life is too damned short to be writing assembly anyway.
Assemblers are one of those things that's very simple for the simple cases, and then when you add more complex cases you start to think it's actually very hard until you go back and add the proper abstractions, which would have seemed needlessly complicated for the simple cases.
So people doing something trivial thinks they're easy (like this example), people trying to do a bit more think they're really hard, and people doing an entire assembler think they're easy again.
I might try doing this as a learning project. Can you offer any advice as to what those abstractions are? I'd appreciate some advice to avoid the "going back" part.
Most simple cases of instructions seem straightforward to just go ahead and emit the bytes for. Then when you start to use more addressing modes you realise it gets to be a lot of code and it turns out there's a common pattern for everything.
Here's a concrete example from an industrial assembler.
Almost every simple instruction boils down this helper method which is parameterised by a bunch of flags and can then deal with all of them.
Ideally, the common fields and patterns noted by the manufacturer in the ISA description would be implemented, and then the remainder of the logic can be (relatively) simple table-lookup.
I wrote a runtime assembler for a small subset of amd64. I had basically zero experience with assembly and I was still able to write what I needed in half a day. The hard part is understanding how the extra bytes work, which I did by running small bits of assembly through the GNU assembler and then calling objdump on the output.
I think a RISC ISA might be more difficult to start with, and supporting things like AVX might be hard. I only needed support for some basic instructions though. The amd64 manual is actually pretty good. Overall, it was much easier than I expected it to be.
I think RISC would actually be far easier, since there are far fewer irregular instructions (like you get with x86) and you could leverage the fixed instruction size to not have to worry about some things.
Honestly a low LOC Assembler is a bit of a :shrug: for me, since the compilation of assembly from a higher level language is where things are ridiculously complex, assemblers are basically just translation mappers, there are some weird things that happen in them, but the vast majority of weirdness is covered elsewhere. It is a neat experiment to write an assembler certainly, but I found it much more interesting in Uni trying to express complex high level concepts in assembly. I came to appreciate just how ridiculously helpful compilers are.