367-page cc-by-nc-sa textbook; though the license permits modification, no source format seems to be provided, just a pdf. it covers mostly introductory userland programming, though there's a chapter on interrupt handling. on skimming the table of contents i don't see anything about page tables, tlbs, atomics, the amd64 memory consistency model, running before sdram is enabled, iommus, mtrrs, or simd. none of those things are necessary to write a compiler or debug most compiled code, except maybe simd
this book also looks like it could be a very solid base for a course that did cover one or more of those additional topics; the course would also include some supplementary material
it probably teaches more than everything i know about amd64 assembly, except the most important thing, which is that arm assembly is much better
I'm sure many architectures are nicer than amd64, but that's what the CPU in my laptop uses, as well as my Steam Deck. If I want to get into low-level hacking it's easiest to start with the hardware I have, no?
While in general Arm assembly is much better, there are various niche applications, like computations with big numbers, where Arm still lacks some of the instructions included in the AMD64 ISA, which make the latter much more convenient for those special applications.
Yep, RISC-V is supposed to become an heaven for assembly written programs as it is an ISA standard, namely it will "work" on CPUs from many vendors.
The main pitfall while writing assembly is the abuse of a preprocessor. Being dependent on the grotesquely and absurdely complex and massive compilers out there is one thing, but moving that dependency to a complex preprocessor is not that much better. So caution and care about that issue must be kept in mind.
I wish all RISC-V CPU vendors to pay the silicium real estate price for 64bits because writting once a 64bits RISC-V code and to be able to "run-ish" it everywhere, from "embeded" to servers passing thru workstation, wow.
I am writting currently x86_64 assembly, namely the manual "register-ization" of some code paths is done, ready for an easy RISC-V port. Not to mention RISC-V has double the amount of registers, and some code paths will really benefits from this additional register space (intel plans to follow that route).
Ofc, this will be mostly micro-arch agnostic assembly code, as it is the case already for x86_64, with maybe simple or very generic static "optimizations" (cache line, alignment, registerization, leveraging some instruction fusions, etc). Worst case scenario, adapted (not rewritten from scratch) assembly written code paths to fit better on a micro-arch with a runtime switch/installation... if really needed. I guess correct code will become very important, maybe more than fast code ("should" become true for hardware design too, if the performance penalty is not too high).
RISC-V will need compiler support though, that for legacy support, and sometimes, humanization of the assembly output of some compiled programs may be usefull.
All that depends on the success or not of RISC-V, which for that will need ultra-performant implementations all over the board, micro-archs and best silicium process.
RISC-V is not perfect, but a more than good enough modern ISA, and the real risk of fragmentation is 32bits/64bits code paths even though some care was provided to make 32bits<->64bits code adapation easy.
Mistakes will be made (micro-arch with critical bugs), so it won't happen overnight.
dunno, maybe? ch32v003 is rv32e, so rv64 seems out of the question. more likely than supporting rv64 on ch32v003 and its successors would be supporting an rv32i mode on rv64 processors, which only costs a few gates. or, for write-once-runnish-anywhere, as/400-style/android-style compilation to native code at installation time, or nvidia-style compilation to native code at load time, or transmeta-style/qemu-style compilation to native code at i$-miss time
the preprocessor thing depends on what it is that bothers you about the grotesque compilers. a macro preprocessor can be pretty small; gpm (basically unix m6) was reputedly 250 machine instructions
historically 'worse is better' and 'the innovator's dilemma' suggest that adoption of an innovation like risc-v depends more on conquering the low end of the market long before the high end. but then there's the tesla roadster...
risc-v is nicer than amd64, but in between ldm/stm, shifted indexing, postincrement/preincrement, and conditional instructions, arm assembly is just about c
Besides media codecs and embedded microcontrollers, what are major uses of writing raw assembly language these days? I worked in game engines for quite a while and everyone I know of there sticks to intrinsics.
Most people who need assembly are going to want to only write assembly-embedded-in-C, and learn the GCC constraint specifiers. This of course assumes that there aren't sufficient intrinsics (these days, there are a lot of intrinsics exposed! [1]). Note that you can specify registers to be used to store a variable without writing any asm, if for some reason the register allocator is confused by what you're doing.
Developing a compiler from scratch is the other significant use for writing it. Of course it is quite common to need to read it.
This is a notable differentiation - Writing assembly is a different skill to reading it from a disassembly. Reverse engineering, malware analysis etc. does not inherently require you to be able to write asm, although it certainly would help.
asm goto is really useful in this context. It means you can pick the branch instructions and define the exact control flow graph you want. Intrinsics, normal C, inline asm defining basic blocks and asm goto defining the CFG lets you emit exactly the instructions you want, albeit with somewhat challenging syntax.
The pinning local variables to registers thing didn't work in llvm a couple of years ago (which seems consistent with the gcc docs) but does work at the boundaries of inline asm and that's generally enough. I like a pin-register intrinsic, something like `u64 pin(u64, enum reg)` where the compile time constant enum names the register and the semantics are a no-op other than constraining the register allocator, but that doesn't seem to be readily available in gcc/clang.
I don't have a good answer to constraining instruction scheduling.
On reflection it's all somewhat more horrible than it needs to be, perhaps inline compiler IR is a better idea.
Media codecs are a microcosm of routines that want more care and attention than compilers are yet able to give. Some of RAD's decompression kernels are written in assembly, for instance, and a particular unicode transcoding routine was 20% faster in assembly than in c with intrinsics. An anecdote I heard: it was appreciably faster to run the linux version of postgresql in a translation layer under solaris than to run the native version, because the linux c library's strings functions were written in assembly. A particular binary search routine I rewrote in assembly (from c) was about 4x faster; I would estimate that about half of that was from an improved algorithm, and the other half was from the choice of language. (Of course, somebody else later made a c implementation that was a bit faster, owing to an algorithmic improvement...)
More prosaically, getting compilers to generate branchless code reliably is difficult, there's no intrinsic for CMOV* and similars, and the builtins that should act as an hint don't work[1].
The default probability of any branch, absent further information, should be 0.5, so I wouldn't expect this to do anything. You should rather direct your ire at llvm, which has __builtin_unpredictable which is explicitly supposed to do this but doesn't.
as you know, a lot of day-to-day use of assembly nowadays is writing compilers and debugging compiled programs, though i also got surprisingly good performance in httpdito from a fork-per-client web server http://canonical.org/~kragen/sw/dev3/httpdito-readme
dan bernstein makes the argument that, as computers get faster, we use them on bigger problems, which means that computer performance is increasingly dominated by small inner loops, which is precisely the situation where it becomes more rational to put effort into hand-optimizing your small inner loops than to hack on the compiler to hopefully speed up all parts of the program, just as it was in the 01960s for different reasons
a different way to attack that problem in many cases is to write a domain-specific compiler from a domain-specific language to machine code, as thompson's regexp engine did, and as verilog compilers do. but i'm not sure how you speed up a media codec that way
bernstein has also written a fair bit of assembly to eliminate timing side-channel leaks from cryptographic code
Operating systems and hardware drivers are big ones. Compilers do a lot of silent stuff behind the scene that can interfere when bare metal is involved.
Not exactly what you're asking, but reading and manipulating asm is critical in several parts of infosec: reversing, vuln analysis, sploiting and manipulating, etc. The same goes for somewhat related fields suchs as game hacks, cracking and so on. To some extent this also includes writing raw asm.
I don’t write it but I read it almost every day while working on high performance C/C++ code. Reading the generated assembly almost always gives me hints on how to optimize the code better
Freestanding nolibc Linux applications. They don't link in the so called "startfiles" so there's nothing there that will bring your program from its ELF entry point to your actual main function. I wrote assembly code to collect process parameters like arguments, environment and auxiliary vector and pass them to a C function of my choice.
Conservative garbage collectors. Scanning the native stack for pointers can be done in C but isn't quite enough since there might be pointers in registers. So I wrote assembly code to spill all the registers onto the stack prior to scanning.
General data compression/decompression can also benefit from asm level tuning, including generic compression such as Huffman/zlib and more specific compression like animation compression. MSVC is also a lot worse at vectorization than GCC or Clang and can more easily be beaten with asm.
I would agree that direct asm is very rare these days in game engines outside of third-party libraries. There can be significant gains with tuned asm but some combination of intrinsics and ISPC is usually good enough. But it is far more useful to be able to _read_ assembly, for debugging in an optimized build or analyzing release crashes.
Another area that demands performance and can pay for it is databases. Proprietary databases have hand written assembly tuned for the exact processor its running on in its hot paths.
I have always wondered why they are written in assembly. Is it "just" to guarantee the exact shape, size and contents of the payload or are there other reasons?
Size is very important but in the case of memory copying errors, removing null bytes is key to prevent the copy from terminating early. Additionally, you may be creating/modifying a new stack in some cases.
Since this book targets Ubuntu I'm assuming Linux supports Intel syntax now, so I guess that this also means that it will be a more "portable skill" across different systems.
x86 assembly has two main forms, Intel’s own syntax that is also used by Microsoft’s MASM and other DOS / Windows assemblers, and an AT&T syntax that is traditionally used by Unix type operating systems, that is similar to PDP-11 assembly. Notably, source and destination are generally flipped amongst other differences.
Generally AT&T syntax is seen as a bit weird in the x86 realm, and there are some Unix assemblers like NASM that use Intel syntax.
Hopefully not offtopic: if I'd like to learn/use asm on Apple's ARM Macs, what would be a good start?
In very short, I'd be interested in, at some point, improving the performance of the code for the apps that my company creates.
[1]: https://americati.com/dat/assembler.html