There are multiple axes of "best". The *simplest*, *most portable*, and *most re...

kbolino · 2025-05-21T15:10:51 1747840251

There's going to be a compatibility-performance tradeoff here, to be sure, though the compatibility issue is going to be more with "very old platforms" and the performance issue is going to be more with "very high resolutions on very high refresh rates". So it's a question of whether you want to produce something that works well on current and past hardware vs. works well on current and future hardware, with some allowance for "can't please everybody".

I don't consider scrolling a large page to be an "uncontrollable fit of interactivity" but it's going to struggle to stay smooth using a single, simple linear array of pixels that's manipulated solely by the CPU. If you can at least work with multiple pixel buffers and operate on them at least somewhat abstractly so that even basic operations can be pushed down to the GPU, even if you don't work directly with command buffers, that will go a long way to bridging the gap between past and future, at least for 2D interfaces.

kragen · 2025-05-21T22:29:43 1747866583

The compatibility issue is mostly going to be with future platforms that subtly change the semantics of the interfaces you're using or whose drivers have different bugs than the drivers you tested on. To take a trivial example, most GPUs don't bother to implement IEEE 754 gradual underflow.

I think you're wrong about struggling to stay smooth scrolling a large page. Maybe it was true on the original iPhone in 02007? Or it's true of complex multilayered translucent vector art with a fixed background? But it's not true of things like text with inline images.

Let's suppose that scrolling a large page involves filling a 4K pixel buffer, 3840×2160, with 32-bit color. If you have an in-memory image of the page, this is just 2160 memcpys of the appropriate 15360-byte pixel line; you're going to be memcpy-bandwidth-limited, because figuring out where to copy the pixels from is a relatively trivial calculation by comparison. On the laptop I'm typing this on (which incidentally doesn't have a 4K screen) memcpy bandwidth to main memory (not cache) is 10.8 gigabytes per second, according to http://canonical.org/~kragen/sw/dev3/memcpycost.c. The whole pixel buffer you're filling is only 33.2 megabytes, so this takes 3.1 milliseconds. (Of one CPU core.) Even at 120fps this is less than half the time required.

(For a large page you might want to not keep all your JPEGs decompressed in RAM, re-decoding them as required, but this is basically never done on the GPU.)

But what if the page is full of text and you have to rerender the visible part from a font atlas every frame? That's not quite as fast on the CPU, but it's still not slow enough to be a problem.

If you have a tree of glyph-index strings with page positions in memory already, finding the glyph strings that are on the screen is computationally trivial; perhaps in an 16-pixel-tall font, 2160 scan lines is 135 lines of text, each of which might contain five or six strings, and so you just have to find the 600 strings in the tree that overlap your viewport. Maybe each line has 400 glyphs in it, though 60 would be more typical, for a total of 55000 glyphs to draw.

We're going to want to render one texel per pixel to avoid fuzzing out the letters, and by the same token we can, I think, presuppose that the text is not rotated. So again in our inner loop we're memcpying, but this time from the font atlas into the pixel buffer. Maybe we're only memcpying a few pixels at a time, like an average of 8, so we end up calling memcpy 55000×16 ≈ 900k times per frame, requiring on the order of 10 million instructions, which is on the order of an extra millisecond. So maybe instead of 3 milliseconds your frame time is 4 milliseconds.

(It might actually be faster instead of slower, because the relevant parts of the font atlas are probably going to have a high data cache hit rate, so memcpy can go faster than 10 gigs a second.)

I did test something similar to this in http://canonical.org/~kragen/sw/dev3/propfont.c, which runs on one core of this laptop at 84 million glyphs per second (thus about 0.7ms for our hypothetical 55000-glyph screenful) but it's doing a somewhat harder job because it's word-wrapping the text as it goes. (It's using a small font, so it takes less memcpy time per glyph.)

So maybe scrolling a 4K page might take 4 milliseconds per screen update on the CPU. If you only use one core. I would say it was "struggling to stay smooth" if the frame rate fell below 30fps, which is 33 milliseconds per frame. So you have almost an order of magnitude of performance headroom. If your window is only 1920×1080, you have 1½ orders of magnitude of headroom, 2 orders of magnitude if you're willing to use four cores.

kbolino · 2025-05-22T15:50:01 1747929001

I did some basic tests with SDL3 and SDL3_ttf, using only surfaces in CPU memory and with acceleration disabled, on my 2560p 144Hz monitor and the copying was never a bottleneck. I was concretely able to achieve an average of 3ms per frame, well under the 144Hz budget of 6.9ms per frame, to scroll a pre-rendered text box with a small border in a fullscreen window. Even at 4K resolution (though that monitor is only 60Hz), I was seeing 5-6 ms per frame, still good enough for 144Hz and leaving lots of time to spare for 60Hz. I think this certainly proves that smoothly scrolling a text box, at least with a powerful desktop computer, is not an issue using only direct pixel access.

The bigger issue, though, may be rendering the text in the first place. I'm not sure how much the GPU can help there, though it is at least possible with SDL3_ttf to pass off some of the work to the GPU; I may test that as well.

vidarh · 2025-05-22T18:40:05 1747939205

> The bigger issue, though, may be rendering the text in the first place. I'm not sure how much the GPU can help there, though it is at least possible with SDL3_ttf to pass off some of the work to the GPU; I may test that as well.

The font rendering gets slow if you re-render the glyphs regularly. This becomes a challenge if you render anti-aliased glyphs at sub-pixel offsets, and so make the cost of caching them get really high.

If you keep things on pixel boundaries, caching them is cheap, and so you just render each glyph once at a given size, unless severely memory constrained.

For proportional text or if you add support for ligatures etc. it can get harder, but I think for most scenarios your rendering would have a really high cache hit ratio unless you're very memory constrained.

My terminal is written in Ruby, and uses a TTF engine in Ruby, and while it's not super-fast, the font rendering isn't in the hot path in normal use and so while speeding up my terminal rendering is somewhere on my todo list (far down), the font rendering isn't where I'll spending time...

Even the worst case of rendering a full screen of text in 4k at a tiny font size after changing font size (and so throwing away the glyph cache) is pretty much fast enough.

I think this is pretty much the worst case scenario you'll run into on a modern system - Ruby isn't fast (though much faster than it was) - and running a pure Ruby terminal with a pure Ruby font renderer with a pure Ruby X11 client library would only get "worse" if I go crazy enough to write a pure Ruby X11 server as well (the thought has crossed my mind).

If I were to replace any of the Ruby with a C extension, the inner rendering loop that constructs spans of text that reuses the same attributes (colors, boldness etc) and issues the appropriate X calls would be where I'd focus, but I think that too can be made substantially faster than it currently is just by improving the algorithm used instead.

kragen · 2025-05-22T18:49:42 1747939782

I think it's okay for glyph generation to be slow as long as it doesn't block redraw and immediate user feedback such as scrolling. While you can make that problem easier by throwing more horsepower at the problem, I think that to actually solve it you need to design the software so that redraw doesn't wait for glyph generation. It's a case where late answers are worse than wrong answers.

I had forgotten or didn't know that you'd also written a pure Ruby replacement for Xlib! That's pretty exciting! I'm inclined to regard X-Windows as a mistake, though. I think display servers and clients should communicate through the filesystem, by writing window images and input events to files where the other can find them. Inotify is also a botch of an API, but on Linux, inotify provides deep-submillisecond latency for filesystem change notification.

vidarh · 2025-05-22T22:57:37 1747954657

For the glyph regeneration, individual characters is more than fast enough - TrueType is actually quite simple to rasterize [1] (if you ignore things like hinting, which you increasingly might as well on 4k displays etc.; also: if you ignore emojis, which involve an embedded subset of SVG in the font file... eww). It's really just if you have a screenful of previously unseen glyphs you'd get a very brief slowdown. You could warm the cache if you wanted, but in practice I can increase/decrease the fontsize with a screenful of text in my terminal without it being slow enough to be worth optimizing more.

> I had forgotten or didn't know that you'd also written a pure Ruby replacement for Xlib!

That one is not all me. I've just filled in a bunch of blanks[2], mostly by specifying more packets after the original maintainer disappeared. I keep meaning to simplify it, as while it works well, I find it unnecessarily verbose. I'm also tempted to bite the bullet and write the code to auto-generate the packet handling from the XML files used for XCB.

I think there's large parts of X11 that are broken, but the more I'm looking at my stack, and how little modern X clients use of X, the more tempted I am to try to write an X server as well, and see how much cruft I could strip away if I just implement what is needed to run the clients I care about (you could always run Xvnc or Xephyr or similar if you want to run some other app).

That would make it plausible to then separate the rendering backend and the X protocol implementation, and toy with simpler/cleaner protocols...

[1] https://github.com/vidarh/skrift

[2] https://github.com/vidarh/ruby-x11

kragen · 2025-05-24T17:36:09 1748108169

> I think it's okay for glyph generation to be slow as long as it doesn't block redraw and immediate user feedback such as scrolling

Incidentally, last night I loaded a page containing https://news.ycombinator.com/item?id=44061550 in Fennec on my phone, and at some point when I scrolled to where some superscripts were in view, they were briefly displayed as gray boxes. My inference is that Fennec had loaded the font metrics so it could do layout but didn't do glyph rasterization until the glyphs were in view or nearly so.

kbolino · 2025-05-23T14:16:35 1748009795

Yeah, the difficulty with glyph caching IMO is handling things like combining diacritics. Really, you'd need to do proper Unicode grapheme cluster segmentation [1] to even decide on what is a valid cache key in the first place, at least if you intend on supporting all major languages. But if you only want to support most languages, you could get by without it, or just with Unicode normalization [2].

[1]: https://unicode.org/reports/tr29/

[2]: https://unicode.org/reports/tr15/

kragen · 2025-05-23T21:20:28 1748035228

If you were short on CPU, you could handle "normal" combining diacritics like 0̩́ in a variety of ways, including just alpha-compositing several glyphs into the same pixels every time you redraw, and (except for emoji!) you could compute each scan line of a text layer as 8-bit-deep pixelwise coverage first, opening up the possibility of compositing each pixel with bytewise max() rather than alpha-compositing, before mapping those coverages onto pixel colors. But I think the high nibble of the above discussion is that there's quite a bit of performance headroom.

kragen · 2025-05-22T16:08:36 1747930116

Thanks for checking me on that!

Yeah, text rendering can get arbitrarily difficult—if you let it. Rotated and nonuniformly scaled text, Gaussian filters for drop shadows, TrueType rasterization and hinting, overlapping glyphs, glyph selection in cases where there are multiple candidate glyphs for a code point, word wrap, paragraph-filling optimization, hyphenation, etc. But I think that most of those are computations you can do less often than once per frame, still in nearly linear time, and computing over kilobytes of data rather than megabytes.