I've written a (mostly) fully-featured x86 assembler.
There are two gotchas that most people starting out probably won't realize until it's too late.
1) Encoding instructions on x86 can be more challenging than they first appear. Understanding how this works and fits together will save you a lot of headaches.
2) Forward jumps are a pain. Any reasonable assembler will provide symbolic labels for jumps and other operations. Because x86 instructions are variable length, you can't simply count up the lines to the label and multiply by some fixed amount. It doesn't work that way. Instructions can range from one byte to 15 bytes. What this means to you is that you'll need to design a multi-pass assembler. You allocate a fixed amount of space for the jump instruction and come back later to "patch" that jump instruction with the correct byte offset. Labels simply become keys in a dictionary that record the byte offset into the assembled binary data.
Other things to make sure you understand: big/little endian, LSB/MSB, how two's-complement works, etc.
One surprising aspect of writing an assembler is that you learn to decode instructions by simply looking at a hex dump of the binary output. You almost feel like Neo in the Matrix, in more nerdy less awesome sort of way. You start thinking of assembler and macros as merely convenience of having to write out hex (or binary) of the instructions you want. And on top of that, you see C as convenience for having to do assembler.
You allocate a fixed amount of space for the jump instruction and come back later to "patch" that jump instruction with the correct byte offset.
Also worth noting there are "short" and "near" jumps, which trade off between a shorter encoding and a longer range of destination; if the target is a forward reference, not yet known at the time the jump is assembled, then nearly all assemblers will use the long variant, because they don't know yet whether the destination will be close enough. One notable exception is fasm, which starts off with being "optimistic" about jump sizes, and then increases only the ones which didn't quite make it, repeatedly, until all the offsets are large enough. Here's a series of very detailed posts about that from its author:
Glancing at the posts, I have a question: do they defend against the pathological cases that require O(n) passes, and therefore at least naively have O(n^2) or worse behavior? Let's say that a jump can be encoded with either an 8-bit signed offset in 2 bytes, or a 32-bit signed offset in 5 bytes. (Supposedly x86 has a 16-bit offset version, but that uses the same opcode as the 32-bit offset, and seems unsupported in 64-bit mode.) Consider the following:
L0: jmp M0
L1: jmp M1
L2: jmp M2
...
L25: jmp M25
M0: jmp N0
M1: jmp N1
...
M25: jmp N25
N0: jmp O0
...
...
Z0: jmp LAST
Z1: jmp LAST
...
Z25: jmp LAST
[more than 128 bytes of unrelated stuff]
LAST:
If all the jmps are encoded with 5 bytes, then the offset from e.g. L0 to M0 is 130 bytes, which is over the limit of 127 you can get from an 8-bit signed offset. The optimistic approach will discover in the first pass that all the Z jmps have to be 5 bytes. Pass 2, assuming it goes left to right, will see that everything up to Y24 still has offset ≤ 127 (even "Y24: jmp Z24" could be 127 bytes or 130 depending on how Y25's jump is encoded), but will find that "Y25: jmp Z25" has an offset of 130 and must be 5 bytes. Pass 3 will discover that Y24 needs to be 5 bytes too, and discover nothing else. Pass 4 will discover the sole fact that Y23 needs to be 5 bytes, and so on. If the total number of jmps here is n, then it will take close to n passes to fix up all the jmps. (If the passes go from right to left, or even alternate the direction or randomize... I think it's still possible to construct an example where pessimizing Y10 kicks Y20 over the edge, which kicks Y9 over the edge, which triggers pessimizing Y21, etc.)
This isn't a flaw unique to the optimistic approach, by the way—I could construct a similar version where all the L, M, etc. chains go from 0 to 62, and the "more than 128 bytes of unrelated stuff" becomes "0 bytes of unrelated stuff"; then, in the pessimistic approach, once the Zs have settled down, the subsequent passes will discover, one at a time, that Y62 only has a 126-byte offset and can be encoded with 2 bytes, then that Y61 can be similarly encoded, and so on.
I'm sure this doesn't actually happen in practice, unless you're compiling malicious code off the internet (in which case there are likely worse vulnerabilities inside the compiler)... but I'm curious if assemblers have nonetheless seriously addressed the problem.
One of the posts talks about NUMBER_OF_PASSES; this suggests, but I don't think the post explicitly states, that the defense mechanism is "arbitrarily limiting the number of passes". Do you know if fasm and others actually do this? (Looks like nasm has a command-line switch to turn off multi-pass branch offset optimization.)
There's a standard technique for this that is as old as variable-length ISA's. Yes, you've identified the fundamental problem with the optimistic approach.
You do start with a pessimistic approach and assume jump distances that are the maximum length instructions. On each pass, you restrict yourself to reducing the lengths of instructions. Each pass shrinks only those instructions which currently reach their targets and recomputes only those jump distances. Yes, you can construct pathological cases where you only get to compact one insn at a time.
It doesn't matter. The goal of the asssembler is not to produce the optimal encoding, it is to produce a valid encoding. The program was valid at every step of the way. So after running N passes, you just stop and emit the program you got.
Oddly enough, its the fixed-length RISC ISA's that have a bigger problem with this. A RISC-V branch usually only has 12 bits of jump distance available. That's good enough most of the time, but 5% of the time (varying widely with the program) the target isn't in range. To handle those cases, you have to emit a longer-distance unconditional jump, and invert the logic of the parent branch to hop over the long jump instead of falling through to the 'else' label. Essentially, its a 64-bit encoding of a branch instruction with 20 bits of jump distance. Pretty sure the RISC-V assembler assumes the case of 12-bit distances in 32-bit instructions at first.
Eventually, there is a pass limit for optimistic assemblers as well. But when the limit is reached, instead of emitting a program that is less than optimal, you just return an error, because you didn't have a valid program that you could emit at all.
For multiple passes, is it ever possible that the next pass shrinks the size of the code? Off the top of my head I don't see if this is possible, but I've always wondered this since it would then suggest you could end up bouncing back and forth on every pass, unless explicitly avoid it somehow... can that happen?
It could, depending on how you initially encode the forward jmp instruction. If you encode them all as 16-bit int and your assembler optimizes, then it could shrink the code to use 8-bit jmp variant, for example.
The way I got around this issue entirely was to encode all jumps as 16-bit variant. Then, during the 2nd pass I would check for overflow and throw an error and halt, if the jmp was no longer within range. I had a simple type declaration syntax for "byte", "word", and "dword" that you could use to coerce a label to a specific size (and thus, control the instruction variant you want the assembler to use).
So the assembler is actually a simple 2-pass assembler. The 2nd pass locks the code size and errors out if the 1st pass assumptions and types do not hold.
Otherwise, as userbinator mentioned, your problem becomes a backtracking/constraint problem. It's an interesting CS problem, but not so fun when all you want is a working assembler.
edit: Should point out that the initial jmp encoding may be dependent on operand mode (i.e. 16/32/64-bit instruction mode). It's been awhile and I can't remember all the details there, plus my assembler was a few years before 64-bit on the x86 became mainstream, so YMMV.
There are two gotchas that most people starting out probably won't realize until it's too late.
1) Encoding instructions on x86 can be more challenging than they first appear. Understanding how this works and fits together will save you a lot of headaches.
2) Forward jumps are a pain. Any reasonable assembler will provide symbolic labels for jumps and other operations. Because x86 instructions are variable length, you can't simply count up the lines to the label and multiply by some fixed amount. It doesn't work that way. Instructions can range from one byte to 15 bytes. What this means to you is that you'll need to design a multi-pass assembler. You allocate a fixed amount of space for the jump instruction and come back later to "patch" that jump instruction with the correct byte offset. Labels simply become keys in a dictionary that record the byte offset into the assembled binary data.
Other things to make sure you understand: big/little endian, LSB/MSB, how two's-complement works, etc.
One surprising aspect of writing an assembler is that you learn to decode instructions by simply looking at a hex dump of the binary output. You almost feel like Neo in the Matrix, in more nerdy less awesome sort of way. You start thinking of assembler and macros as merely convenience of having to write out hex (or binary) of the instructions you want. And on top of that, you see C as convenience for having to do assembler.