x86-64 Assembly Language Programming with Ubuntu (2022)

grandiego · on Dec 25, 2023

Some months ago I was trying to learn a bit on this subject (using Debian); from my notes I made a sort of beginner's "self-tutorial"[1].

[1]: https://americati.com/dat/assembler.html

serialNumber · on Dec 25, 2023

This is incredible! Putting on my holiday to-do list :)

yla92 · on Dec 25, 2023

This is great. Thanks for sharing!

kragen · on Dec 25, 2023

367-page cc-by-nc-sa textbook; though the license permits modification, no source format seems to be provided, just a pdf. it covers mostly introductory userland programming, though there's a chapter on interrupt handling. on skimming the table of contents i don't see anything about page tables, tlbs, atomics, the amd64 memory consistency model, running before sdram is enabled, iommus, mtrrs, or simd. none of those things are necessary to write a compiler or debug most compiled code, except maybe simd

this book also looks like it could be a very solid base for a course that did cover one or more of those additional topics; the course would also include some supplementary material

it probably teaches more than everything i know about amd64 assembly, except the most important thing, which is that arm assembly is much better

vanderZwan · on Dec 25, 2023

I'm sure many architectures are nicer than amd64, but that's what the CPU in my laptop uses, as well as my Steam Deck. If I want to get into low-level hacking it's easiest to start with the hardware I have, no?

akkartik · on Dec 25, 2023

This was the thinking behind my https://github.com/akkartik/mu

kragen · on Dec 25, 2023

yup, no use looking for your keys in the rose garden if you know you dropped them in the sewer

adrian_b · on Dec 25, 2023

While in general Arm assembly is much better, there are various niche applications, like computations with big numbers, where Arm still lacks some of the instructions included in the AMD64 ISA, which make the latter much more convenient for those special applications.

NukedOne · on Dec 25, 2023

Might you know where the topics you listed out are covered?

kragen · on Dec 25, 2023

i don't, sorry

snvzz · on Dec 25, 2023

Or RISC-V for that matter.

(while being an order of magnitude simpler)

sylware · on Dec 25, 2023

Yep, RISC-V is supposed to become an heaven for assembly written programs as it is an ISA standard, namely it will "work" on CPUs from many vendors.

The main pitfall while writing assembly is the abuse of a preprocessor. Being dependent on the grotesquely and absurdely complex and massive compilers out there is one thing, but moving that dependency to a complex preprocessor is not that much better. So caution and care about that issue must be kept in mind.

I wish all RISC-V CPU vendors to pay the silicium real estate price for 64bits because writting once a 64bits RISC-V code and to be able to "run-ish" it everywhere, from "embeded" to servers passing thru workstation, wow.

I am writting currently x86_64 assembly, namely the manual "register-ization" of some code paths is done, ready for an easy RISC-V port. Not to mention RISC-V has double the amount of registers, and some code paths will really benefits from this additional register space (intel plans to follow that route).

Ofc, this will be mostly micro-arch agnostic assembly code, as it is the case already for x86_64, with maybe simple or very generic static "optimizations" (cache line, alignment, registerization, leveraging some instruction fusions, etc). Worst case scenario, adapted (not rewritten from scratch) assembly written code paths to fit better on a micro-arch with a runtime switch/installation... if really needed. I guess correct code will become very important, maybe more than fast code ("should" become true for hardware design too, if the performance penalty is not too high).

RISC-V will need compiler support though, that for legacy support, and sometimes, humanization of the assembly output of some compiled programs may be usefull.

All that depends on the success or not of RISC-V, which for that will need ultra-performant implementations all over the board, micro-archs and best silicium process.

RISC-V is not perfect, but a more than good enough modern ISA, and the real risk of fragmentation is 32bits/64bits code paths even though some care was provided to make 32bits<->64bits code adapation easy.

Mistakes will be made (micro-arch with critical bugs), so it won't happen overnight.

kragen · on Dec 26, 2023

dunno, maybe? ch32v003 is rv32e, so rv64 seems out of the question. more likely than supporting rv64 on ch32v003 and its successors would be supporting an rv32i mode on rv64 processors, which only costs a few gates. or, for write-once-runnish-anywhere, as/400-style/android-style compilation to native code at installation time, or nvidia-style compilation to native code at load time, or transmeta-style/qemu-style compilation to native code at i$-miss time

the preprocessor thing depends on what it is that bothers you about the grotesque compilers. a macro preprocessor can be pretty small; gpm (basically unix m6) was reputedly 250 machine instructions

historically 'worse is better' and 'the innovator's dilemma' suggest that adoption of an innovation like risc-v depends more on conquering the low end of the market long before the high end. but then there's the tesla roadster...

kragen · on Dec 25, 2023

risc-v is nicer than amd64, but in between ldm/stm, shifted indexing, postincrement/preincrement, and conditional instructions, arm assembly is just about c

shame about the t bit tho

corysama · on Dec 25, 2023

Besides media codecs and embedded microcontrollers, what are major uses of writing raw assembly language these days? I worked in game engines for quite a while and everyone I know of there sticks to intrinsics.

o11c · on Dec 25, 2023

Most people who need assembly are going to want to only write assembly-embedded-in-C, and learn the GCC constraint specifiers. This of course assumes that there aren't sufficient intrinsics (these days, there are a lot of intrinsics exposed! [1]). Note that you can specify registers to be used to store a variable without writing any asm, if for some reason the register allocator is confused by what you're doing.

Developing a compiler from scratch is the other significant use for writing it. Of course it is quite common to need to read it.

[1]: https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.ht...

Edit: also useful are https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html and https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables....

haxrob · on Dec 25, 2023

> Of course it is quite common to need to read it

This is a notable differentiation - Writing assembly is a different skill to reading it from a disassembly. Reverse engineering, malware analysis etc. does not inherently require you to be able to write asm, although it certainly would help.

JonChesterfield · on Dec 25, 2023

asm goto is really useful in this context. It means you can pick the branch instructions and define the exact control flow graph you want. Intrinsics, normal C, inline asm defining basic blocks and asm goto defining the CFG lets you emit exactly the instructions you want, albeit with somewhat challenging syntax.

The pinning local variables to registers thing didn't work in llvm a couple of years ago (which seems consistent with the gcc docs) but does work at the boundaries of inline asm and that's generally enough. I like a pin-register intrinsic, something like `u64 pin(u64, enum reg)` where the compile time constant enum names the register and the semantics are a no-op other than constraining the register allocator, but that doesn't seem to be readily available in gcc/clang.

I don't have a good answer to constraining instruction scheduling.

On reflection it's all somewhat more horrible than it needs to be, perhaps inline compiler IR is a better idea.

moonchild · on Dec 25, 2023

Media codecs are a microcosm of routines that want more care and attention than compilers are yet able to give. Some of RAD's decompression kernels are written in assembly, for instance, and a particular unicode transcoding routine was 20% faster in assembly than in c with intrinsics. An anecdote I heard: it was appreciably faster to run the linux version of postgresql in a translation layer under solaris than to run the native version, because the linux c library's strings functions were written in assembly. A particular binary search routine I rewrote in assembly (from c) was about 4x faster; I would estimate that about half of that was from an improved algorithm, and the other half was from the choice of language. (Of course, somebody else later made a c implementation that was a bit faster, owing to an algorithmic improvement...)

xoranth · on Dec 25, 2023

More prosaically, getting compilers to generate branchless code reliably is difficult, there's no intrinsic for CMOV* and similars, and the builtins that should act as an hint don't work[1].

[1]: https://c.godbolt.org/z/6EWzKe5zd

moonchild · on Dec 25, 2023

The default probability of any branch, absent further information, should be 0.5, so I wouldn't expect this to do anything. You should rather direct your ire at llvm, which has __builtin_unpredictable which is explicitly supposed to do this but doesn't.

xoranth · on Dec 26, 2023

Yeah, makes sense. That said, (G)GP asked what could be done in raw assembly but not with C + intrinsics.

My point is that conditional moves are one of usecases badly supported by compilers, and that (may) require dropping to assembly.

kragen · on Dec 25, 2023

i feel like maybe you told me about this particular routine but i failed to bookmark it, or maybe tag the bookmark appropriately; do you have a link?

kragen · on Dec 25, 2023

as you know, a lot of day-to-day use of assembly nowadays is writing compilers and debugging compiled programs, though i also got surprisingly good performance in httpdito from a fork-per-client web server http://canonical.org/~kragen/sw/dev3/httpdito-readme

dan bernstein makes the argument that, as computers get faster, we use them on bigger problems, which means that computer performance is increasingly dominated by small inner loops, which is precisely the situation where it becomes more rational to put effort into hand-optimizing your small inner loops than to hack on the compiler to hopefully speed up all parts of the program, just as it was in the 01960s for different reasons

a different way to attack that problem in many cases is to write a domain-specific compiler from a domain-specific language to machine code, as thompson's regexp engine did, and as verilog compilers do. but i'm not sure how you speed up a media codec that way

bernstein has also written a fair bit of assembly to eliminate timing side-channel leaks from cryptographic code

Manuel_D · on Dec 25, 2023

Operating systems and hardware drivers are big ones. Compilers do a lot of silent stuff behind the scene that can interfere when bare metal is involved.

Example: https://github.com/dddrrreee/cs140e-23win/blob/85b9ae3bd46c7...

8372049 · on Dec 25, 2023

Not exactly what you're asking, but reading and manipulating asm is critical in several parts of infosec: reversing, vuln analysis, sploiting and manipulating, etc. The same goes for somewhat related fields suchs as game hacks, cracking and so on. To some extent this also includes writing raw asm.

latenightcoding · on Dec 25, 2023

I don’t write it but I read it almost every day while working on high performance C/C++ code. Reading the generated assembly almost always gives me hints on how to optimize the code better

corysama · on Dec 25, 2023

Godbolt FTW.

matheusmoreira · on Dec 25, 2023

Freestanding nolibc Linux applications. They don't link in the so called "startfiles" so there's nothing there that will bring your program from its ELF entry point to your actual main function. I wrote assembly code to collect process parameters like arguments, environment and auxiliary vector and pass them to a C function of my choice.

Conservative garbage collectors. Scanning the native stack for pointers can be done in C but isn't quite enough since there might be pointers in registers. So I wrote assembly code to spill all the registers onto the stack prior to scanning.

ack_complete · on Dec 25, 2023

General data compression/decompression can also benefit from asm level tuning, including generic compression such as Huffman/zlib and more specific compression like animation compression. MSVC is also a lot worse at vectorization than GCC or Clang and can more easily be beaten with asm.

I would agree that direct asm is very rare these days in game engines outside of third-party libraries. There can be significant gains with tuned asm but some combination of intrinsics and ISPC is usually good enough. But it is far more useful to be able to _read_ assembly, for debugging in an optimized build or analyzing release crashes.

e12e · on Dec 25, 2023

I always like to point at heavy thing - but not sure if it inactive?

https://2ton.com.au/HeavyThing/

CoastalCoder · on Dec 25, 2023

I used it to write a bare-bones JIT compiler for deep-learning kernels.

I would have preferred to emit something like LLVM IR instead, but couldn't because of several constraints.

pjmlp · on Dec 25, 2023

Being able to understand the machine code generated by AOT and JIT compilers, and how it maps to the higher level code.

fragmede · on Dec 25, 2023

Another area that demands performance and can pay for it is databases. Proprietary databases have hand written assembly tuned for the exact processor its running on in its hot paths.

spicymaki · on Dec 25, 2023

Debugging, Emulation and Compiler development comes to mind.

anonacct37 · on Dec 25, 2023

cryptography. Things that really need to be constant time and not be "optimized" by the compiler.

See the .s files in: https://cs.opensource.google/go/go/+/refs/tags/go1.21.5:src/...

I occasionally see it in compression as well.

__turbobrew__ · on Dec 26, 2023

Writing raw assembly, maybe not that much. Reading raw assembly on the other hand is very useful.

_xerces_ · on Dec 25, 2023

Reverse engineering.

rickoooooo · on Dec 25, 2023

Exploit payloads (shellcode)

cinntaile · on Dec 25, 2023

I have always wondered why they are written in assembly. Is it "just" to guarantee the exact shape, size and contents of the payload or are there other reasons?

dazed_confused · on Dec 25, 2023

Size is very important but in the case of memory copying errors, removing null bytes is key to prevent the copy from terminating early. Additionally, you may be creating/modifying a new stack in some cases.

badrabbit · on Dec 26, 2023

Shellcode.

kd913 · on Dec 25, 2023

Thank you for this resource. Bookmarked and will hopefully prove useful for debugging crash or interviews.

pjmlp · on Dec 25, 2023

Thankfully using Intel's syntax.

vanderZwan · on Dec 25, 2023

Can you elaborate on what you mean by that, for those of us with little prior exposure to x86 assembly?

(my own ASM experience is limited to using TASM to write some games in Z80 on my TI-83 graphing calculator)

pjmlp · on Dec 25, 2023

x86 is one of the few architectures that has more than one official syntax.

The Intel syntax, common in the PC world since the MS-DOS days, used across Windows and OS/2 as well.

In macro Assemblers, inline Assembly in high level languages, and naturally Intel and other x86 manufacturers CPU manuals.

It uses the format,

    op dest, source

Then you have the AT&T syntax, for whatever reason when support was added to UNIX for x86, they chose the format used by other architectures.

It is only found on GNU/Linux and BSDs, and follows

    op source, dest

Where op is written with the data size prefix, and addressing modes are somehow more complex.

Intel:

    lea eax, [eax + eax * 4]

AT&T:

    lea (%eax, %eax, 4), %eax

vanderZwan · on Dec 25, 2023

Thank you for explaining.

Since this book targets Ubuntu I'm assuming Linux supports Intel syntax now, so I guess that this also means that it will be a more "portable skill" across different systems.

pjmlp · on Dec 25, 2023

It has supported Intel syntax for decades now, and yasm/nasm also exist on Linux and BSDs, doesn't change the fact that it is still quite common.

Android is probably the only Linux based system where Intel's syntax is favoured via yasm's inclusion on the NDK.

fredoralive · on Dec 25, 2023

x86 assembly has two main forms, Intel’s own syntax that is also used by Microsoft’s MASM and other DOS / Windows assemblers, and an AT&T syntax that is traditionally used by Unix type operating systems, that is similar to PDP-11 assembly. Notably, source and destination are generally flipped amongst other differences.

Generally AT&T syntax is seen as a bit weird in the x86 realm, and there are some Unix assemblers like NASM that use Intel syntax.

JayDustheadz · on Dec 25, 2023

Hopefully not offtopic: if I'd like to learn/use asm on Apple's ARM Macs, what would be a good start? In very short, I'd be interested in, at some point, improving the performance of the code for the apps that my company creates.

LeFantome · on Dec 25, 2023

They are just ARM64. Start there.

https://developer.apple.com/documentation/xcode/writing-arm6...

snakey · on Dec 25, 2023

Slightly off topic but does anyone have any recommendations for a similar style of book that focuses on ARM and/or RISC assembly programming?

NelsonMinar · on Dec 25, 2023

I liked Stephen Smith's "Programming with 64-Bit ARM Assembly Language". https://www.goodreads.com/book/show/53671067-programming-wit...

cinntaile · on Dec 25, 2023

There is this guide [0], the same person has also written a book. But I don't think it's a similar style?

[0] https://azeria-labs.com/writing-arm-assembly-part-1/

ThinkBeat · on Dec 25, 2023

Thank you for making this available. I am reading it tomorrow. (well starting it). Been looking for something like this for a while