Hacker News new | past | comments | ask | show | jobs | submit login
Dirty Assembly Tricks in NES Assembly (2013) (andrewkelley.me)
113 points by valarauca1 on Oct 3, 2014 | hide | past | favorite | 26 comments



Assembler programmers of that vintage were crazy. I did a fair amount of reverse engineering of Honda engine computers[0] from the late 80's early 90's and this kind of thing was the status quo. These things had 256 _bytes_ of memory, and they used every single one, some of them being multiplexed to do multiple things.

There are some interesting, digestible bits in the serial interrupt handler[1]

[0]: https://github.com/benogle/obd0vtec_dev/blob/master/src/stoc... [1]: https://github.com/benogle/obd0vtec_dev/blob/master/src/stoc...


Here's my favorite, from the Atari 2600.

The graphics chip has two "missile" objects that act like simple sprites that are only one bit wide. They are each enabled or disabled by writing 1 or 0 to a certain bit of a certain hardware register of the graphics chip.

That register is laid out in hardware with that certain bit at bit 1. Why there, why not the low or high bit? Because that corresponds to the location of the zero flag in the flags register of the 6502 CPU.

So the missile objects can be quickly enabled or disabled by setting the stack pointer to the address of that register, doing a compare on the object coordinates, and pushing the flags register. Both missiles can be done in sequence because the registers are adjacent and the push instruction decrements the stack pointer to set up for the next one.

All to save two instructions on rendering the missiles each scanline. With a nice bonus that the code is time-invariant without branching.


2600 programming is an exercise in crazy.

- The parallel port chip has direction bits that control whether a line is an input or an output. This register is read/write, so you can use it as temporary memory when you don't care about reading or writing I/O (which is most of the time).

- Lots of code doesn't need a stack; it doesn't call anything, and it can just jump to where it needs to be at the end. Now we can use the S register for something else.

Many, many other tricks. I used some of them a few years ago on an embedded system that needed to fit into 1920 bytes. I enlisted a cow-orker into the effort and we were both cackling away. [Later, the hardware guys offered us a different chip with twice the memory, but at that point we were too close to shipping to make any changes, and besides, it wouldn't have been as fun... :-) ]


2600 programming is alive in the demo scene. This was just created in 2014 and is completely insane:

http://youtu.be/04Wk9Oi_Fsk


I'd be interested to know if they limited themselves to 4K ROM (standard size of a 2600 cartridge).


That and much much more is discussed here by the authors:

http://xayax.net/bang!/


One of my side projects is very similar: I built a recursive descent disassembler for Gameboy that I'm working on building into a decompiler. I took a slightly different approach - rather than generate assembly, I generate a call graph - each instruction is a node with zero or more potential following nodes.

The key word, though, is 'potential' - as the author mentions, one of the issues you run into with disassembly is that you sometimes run into code that jumps into garbage (dynamic jumps with insufficiently constrained inputs, or code that should've been unreachable), and sometimes you just hit an infinite loop waiting for an interrupt. It really needs some input/interactivity to determine likely entry points and dead ends/unreachable code.

One nice thing you can do while you have the call graph, though, is propagate constraints (basically type inference/data flow analysis) to get an idea of where dynamic jumps may go, which ones might be under-constrained, and which branch targets are likely dead ends. Haven't gotten as far as implementing it yet, but there's a paper on it here: http://www.cs.rhul.ac.uk/home/kinder/papers/phdthesis.pdf


I've done something vaguely along these lines in the disassembler which is part of my Chip8 IDE[1]. I take advantage of the fact that the ISA is register-rich (16 general-purpose registers) but individual registers are only 8 bits wide, making it feasible to propagate value sets per register. This allows me to perform fairly precise analysis for situations like jump tables, identifying unreachable entries and code/data overlap. Like most static analysis techniques it still runs into brick walls for code which relies heavily on self-modification.

[1] https://github.com/JohnEarnest/Octo


With enough ingenuity the tricks in assembler can be automatically translated. I work for a company that converts legacy languages like COBOL and mainframe assembly language to Java or .net. The founder has a bunch of programs which analyze the code. In the case of mainframe assembly there are many weird tricks. Self modifying code or jumping into the middle of an instruction are common. There is even one program where the code jumps into a constant to cause an abend. The translation process works like a charm. My job is to get the translated output working in java and compare it to the original to make sure the translation works. A few things need to be tweaked by hand, but every once in a while we'll get a series of programs where the original coder uses the same tricks over and over again and a new rule has to be added to the translation tools. Some programmers used all kinds of odd tricks. I can tell when a program was written by the same person, just by the style of coding. It's a fun job.


Awesome.


Yet, people get very upset these days when one performs a trick in a high level language to perform optimization. It's heart breaking to see how much folks utilized every resource they had back then, today, we just waste it. Computers are getting faster, but we are not seeing the results because we have excess. Constraints can be very good, it make's folks resourceful.


These kinds of tricks actually are being done: by the compiler (as long as you give it enough information and don't try to outsmart it) - they're exactly what gets put into optimization passes. Trying to do these optimizations by hand generally turns into a time sink that results in buggy, unmaintainable code, which is, pretty much by definition, too low level to take advantage of current or future compiler optimizations.


If you ever hit a crash in CLR or JVM code and get a full, detailed stacktrace; or if you've ever decompiled an Objective-C binary and ended up with pretty much exactly the source code it was compiled from–that's the result of not crunching programs to death in the name of optimizing every spare cycle.

And this isn't just for developers; this bijection between source and binary enables things like Dropbox hooking the OSX Finder without the Finder explicitly supporting a plug-in system, or Windows Error Reporting sending you reports when your shipped app crashes in production on some client's system.

And either way, I'd much rather my code had proven-safe JIT tricks done to it by the runtime system it's loaded on, than that those tricks be burned into it by the compiler (where they will then become gradually more outdated.) Or, worse yet, that I burn them in there myself by hand-editing the resulting binary, and then have to do the same thing every time I recompile.


See my post above. These tricks are still alive in many areas; cost-constrained embedded systems (where people have knife-fights over penny resistors) and very well trafficked code in operating systems or network code (think software TLB refill on the Xbox 360, or fast path TCP). Runtime systems for languages like Java. Oh, lots of places.

I'll bet that many high-end routers pay very close attention to instruction clocks on their main datapaths.

Optimization is not dead. Pointless optimization should be dead, but isn't.


Generally, people get upset either because the trick is unnecessary (the code is not performance sensitive to begin with) or because the trick is counterproductive (modern CPUs are so complex that it's easy to make performance much worse if you naively "improve" your code). Only a few fools get upset if you use a necessary trick that actually gets the job done.


I'd argue that the results we are seeing are more features for the same development cost.


Every time I read something like this I think back to the story of Mel in the jargon file. http://www.outpost9.com/reference/jargon/jargon_49.html


Me too :-)

Love that story.


> Similarly, there are instances where the program jumps into the middle of an instruction.

Reminds me of the not so extreme but still considered alternatively beautiful and horrendous Duff's Device in C, where one entangles a switch/case in a while loop:

http://en.wikipedia.org/wiki/Duff%27s_device


The "middle of an instruction" trick has been known for a long time in the demoscene where size is highly constrained, and makes an appearance in many of the tiniest intros. Some packers also do similar things, but more in an attempt to obfuscate than optimise.

I vaguely remember it being discussed in an obscure demoscene magazine article (20+ years ago) about size-optimisation techniques, where it was called a "bridge". I also had some experience working with a disassembler/decompiler that could detect such tricks and output a multicolumn disassembly.

The amount of skill and their results demonstrated by many groups in the demoscene continues to amaze and inspire me, particularly those on platforms like the NES and C64 - a tiny system orders of magnitude smaller and slower than PCs today, yet these people can make them do things that most people would think require far more powerful hardware.


Pushing an address and "returning" was the only (sane) way to have an indirect jump on the Z80, so it wasn't even considered a hack.

There was also this trick of using the last byte of RAM as the entry to the interrupt handler, so as to use the "free" array of 0xFF that sat in ROM as the handler table. The last byte was a relative jump because luckily the first byte of ROM was negative in U2, so the handler jumped back into RAM after wrapping around into ROM.


Oh man, you reminded me of LDIR tricks, with loaders (tape loaders!) decoding stuff in tight manually written decoding loops... and with the final decode, overwriting the decoding loop itself.

Bam, you're in the game.


I remember these too. The really good loops could shovel bits from audio to memory and do stuff like drawing the loaded image in a fancy way or display the load progress.


For those who didn't realize, this link is a subtopic of a much larger blog post, which goes into more NES Architecture.


What a cool project. The assembly tricks the original programmers did are interesting but this post walks through the entire process of building of these compiled games. Wow!

Thanks for sharing. I will have to read this again carefully after I learn some Go myself.


Great hacks, and a really great article, packed with information but easy to read. I hope Andrew will consider writing some language tutorials or programming guides, he has a gift for communicating his enthusiasm as well as his observations.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: