A Proper x86 Assembler in Haskell Using the Escardó-Oliva Functional

tliltocatl · 2025-01-23T08:43:13 1737621793

I know I might be barking up "just a showcase" tree, but what's the memory complexity of this? E g. a nice property of multipass assemblers is that these only need to keep a list of symbols in the memory rather than the whole program. Not really a concern for modern systems with large memories, but then those neat haskell tricks usually carry some extra O(n)s - in addition to (avoidable) extra constant factors due to hidden linked lists and immutability. Aka the infamous haskell quicksort - much shorter and clear, but also not really a quicksort.

ggm · 2025-01-23T00:57:24 1737593844

One pass, but.. I'm willing to bet there is some transitional internal state which had to be looped over to do this, subsuming the "pass" inside lazy eval. You have to know the indeterminate "where" to jump to, which depends on subsequent state. So you preserve execution state, calculate on, and return at end to fill in the gaps.

codebje · 2025-01-23T05:23:50 1737609830

Yes - it's pretty opaque to trace through, but you can see in `valid` that for a given move history, a "good" test is performed on each instruction in the assembly listing - a full pass. This is invoked, asymptotically at least, once per jump per level of the search tree explored in `bigotimes`. Each level 'i' of the search tree has a history of `i` jumps decided to be short or near and branches 'n-i' times; both of those average out to n/2, so the tree searches n^2 nodes with each node doing 'n' iterations over the list of instructions to check all jumps are valid, so the total running time is O(n^3).

(With some modifications to the code as shown, anyway: `lookup` is linear but used in such a way it can be replaced with a constant vector index, instruction lists can be compressed to just jumps & labels with byte counts following, avoiding the duplicate invocations of `p` in `arginf`, etc).

I'd be dubious that there's an algorithm to find the true minimum in less than cubic time, but there's plenty of "good enough" approaches: most branches will be definitely short or definitely not; jumps backwards can be resolved when encountered; jumps forward can be written long and replaced with a short jump plus no-op padding when the target is encountered within short jump distance.

rhet0rica · 2025-01-23T04:21:52 1737606112

Yes, that seems to be what the aptly-named "tardis" monad does, but the article notes that it isn't sufficient because of the minutiae of how jump instructions themselves can vary in length under x86.

In an act of brilliance that surely proves beyond all doubt that Haskell's main reason for existing is to upstage the esolang scene, the author solves the problem by trotting out a special-purpose constraint satisfaction algorithm. (You can imagine what kind of computational complexity this implies. The creators of this algorithm note that it can enumerate all tic-tac-toe games in just over a second!)

asplake · 2025-01-23T07:08:36 1737616116

Explanation here: https://kcsongor.github.io/time-travel-in-haskell-for-dummie...

jsjohnst · 2025-01-24T03:40:54 1737690054

> The creators of this algorithm note that it can enumerate all tic-tac-toe games in just over a second!

This is achievable in most languages, no?

lmm · 2025-01-23T04:14:02 1737605642

That bigotimes function that it's talking about is doing something like tree search with backtracking - lazy evaluation isn't really relevant as far as I can see. I don't think the statement that it's "one pass" is meant to mean that it's a closed form solution or something, the point is that you can write your code as plain single pass logic and this generic functional tool does the rest.

ggm · 2025-01-23T04:21:11 1737606071

It's a showcase for Haskell and well worth reading. I should be less snide.

Nowadays "multipass" means Mila Jovovich but it used to mean "another complex language with massive implications we deal with by divide-and-conquer-through-the-filesystem" and in the case of the York ada compiler that was some amazing number like 10 or more passes.

There was a fashion for running your C instrumented and then recompiling it after runtime branch choice evidence was collected. I think the Dec Alpha OSF/1 compiler did it optionally.

astrange · 2025-01-23T09:07:15 1737623235

> There was a fashion for running your C instrumented and then recompiling it after runtime branch choice evidence was collected. I think the Dec Alpha OSF/1 compiler did it optionally.

This is called profile-guided optimization and new compilers (well, GCC and LLVM) have it.

I mostly think it's bad and not statistically sound; given profile data the compiler is both overly trusting of it (no error bars) and can't find much useful to do with it (because there's no way to hint a modern OoO CPU).

vdupras · 2025-01-23T04:27:42 1737606462

Or you could just be pessimistic in your offset calculation and assume that all "in between" unconditional jumps are 5 bytes in length. I know I know, not proper. But I doubt that in real code, that simpler algo would make you miss a lot of opportunities for 2 bytes jumps.

guerrilla · 2025-01-23T08:22:19 1737620539

Yeah, this is a fun problem. This is where the majority of my work went when writing an x86-virus that lived in the NOP areas between functions. Compilers produced those blocks of NOPs after function bodies to make the following function addresses aligned to to 16-byte boundaries. I chained all of those cavities together with jumps, distributing as much code as would fit and then putting a jump to the next cavity at the end. The trick is that you could fit more instructions (3 more bytes?) if you had a shorter jump.

I still want to write my own assembler some day.

netr0ute · 2025-01-23T00:50:48 1737593448

This is basically irrelevant now that better ISAs like RISC-V have a fixed instruction length (2 or 4 bytes) so the fancy algorithm here isn't necessary.

lifthrasiir · 2025-01-23T01:57:12 1737597432

That fancy algorithm is relevant to RISC-V (and in fact, most fixed-length ISAs) because loading an immediate into a register needs one or two instructions depending on the immediate; you surely want to elide a redundant LUI instruction if you can. Of course such redundant instructions don't harm by itself, but that equally applies to x86 as the algorithm is an optimization.

Coolbeanstoo · 2025-01-23T00:56:58 1737593818

As a result of RISC-V existing, all x86 processors have ceased to exist or be produced.

snvzz · 2025-01-23T04:49:15 1737607755

Accurate, if said sometime in the future rather than today.

saagarjha · 2025-01-23T08:36:12 1737621372

There are still people making z80 machines today, so no.

remexre · 2025-01-23T04:07:05 1737605225

This same problem applies to RISC-V with the C extension, because the J and JAL instructions have a larger range than the C.J and C.JAL instructions.

tliltocatl · 2025-01-23T08:20:07 1737620407

Having fixed instruction length doesn't make the need to load large constants magically disappear. These just get split between multiple instructions. If anything, RISC-V might be worse. See also https://maskray.me/blog/2021-03-14-the-dark-side-of-riscv-li....

nicebyte · 2025-01-23T01:00:37 1737594037

ARM would have been a better example because the amount of people that care about RISC-V is a rounding error compared to x86 or ARM.