Anything is possible to fix; the question is why bother. Every fix cuts into the...

kbolino · 2025-05-20T21:33:17 1747776797

This isn't meant to detract from the broader point about the limitations of terminals, but a simple array of pixels is among the least efficient ways to interact with modern GPUs, especially if it doesn't support rectangular copy operations. The best way to interact with a GPU today and for the foreseeable future is through command buffers, not direct pixel access per se.

kragen · 2025-05-21T11:50:11 1747828211

There are multiple axes of "best". The simplest, most portable, and most reproducible way to interact with a GPU is direct pixel access. Sometimes that's not fast enough, of course, but that's mainly when you're suffering from uncontrollable fits of interactivity. Most of the time, the best solution to that problem is to redesign your user interface to require less interaction: https://worrydream.com/MagicInk/

> The ubiquity of frustrating, unhelpful software interfaces has motivated decades of research into “Human-Computer Interaction.” In this paper, I suggest that the long-standing focus on “interaction” may be misguided. For a majority subset of software, called “information software,” I argue that interactivity is actually a curse for users and a crutch for designers, and users’ goals can be better satisfied through other means.

But yeah if you're playing an FPS you probably want to talk to your GPU through command buffers rather than pixel buffers.

kbolino · 2025-05-21T15:10:51 1747840251

There's going to be a compatibility-performance tradeoff here, to be sure, though the compatibility issue is going to be more with "very old platforms" and the performance issue is going to be more with "very high resolutions on very high refresh rates". So it's a question of whether you want to produce something that works well on current and past hardware vs. works well on current and future hardware, with some allowance for "can't please everybody".

I don't consider scrolling a large page to be an "uncontrollable fit of interactivity" but it's going to struggle to stay smooth using a single, simple linear array of pixels that's manipulated solely by the CPU. If you can at least work with multiple pixel buffers and operate on them at least somewhat abstractly so that even basic operations can be pushed down to the GPU, even if you don't work directly with command buffers, that will go a long way to bridging the gap between past and future, at least for 2D interfaces.

kragen · 2025-05-21T22:29:43 1747866583

The compatibility issue is mostly going to be with future platforms that subtly change the semantics of the interfaces you're using or whose drivers have different bugs than the drivers you tested on. To take a trivial example, most GPUs don't bother to implement IEEE 754 gradual underflow.

I think you're wrong about struggling to stay smooth scrolling a large page. Maybe it was true on the original iPhone in 02007? Or it's true of complex multilayered translucent vector art with a fixed background? But it's not true of things like text with inline images.

Let's suppose that scrolling a large page involves filling a 4K pixel buffer, 3840×2160, with 32-bit color. If you have an in-memory image of the page, this is just 2160 memcpys of the appropriate 15360-byte pixel line; you're going to be memcpy-bandwidth-limited, because figuring out where to copy the pixels from is a relatively trivial calculation by comparison. On the laptop I'm typing this on (which incidentally doesn't have a 4K screen) memcpy bandwidth to main memory (not cache) is 10.8 gigabytes per second, according to http://canonical.org/~kragen/sw/dev3/memcpycost.c. The whole pixel buffer you're filling is only 33.2 megabytes, so this takes 3.1 milliseconds. (Of one CPU core.) Even at 120fps this is less than half the time required.

(For a large page you might want to not keep all your JPEGs decompressed in RAM, re-decoding them as required, but this is basically never done on the GPU.)

But what if the page is full of text and you have to rerender the visible part from a font atlas every frame? That's not quite as fast on the CPU, but it's still not slow enough to be a problem.

If you have a tree of glyph-index strings with page positions in memory already, finding the glyph strings that are on the screen is computationally trivial; perhaps in an 16-pixel-tall font, 2160 scan lines is 135 lines of text, each of which might contain five or six strings, and so you just have to find the 600 strings in the tree that overlap your viewport. Maybe each line has 400 glyphs in it, though 60 would be more typical, for a total of 55000 glyphs to draw.

We're going to want to render one texel per pixel to avoid fuzzing out the letters, and by the same token we can, I think, presuppose that the text is not rotated. So again in our inner loop we're memcpying, but this time from the font atlas into the pixel buffer. Maybe we're only memcpying a few pixels at a time, like an average of 8, so we end up calling memcpy 55000×16 ≈ 900k times per frame, requiring on the order of 10 million instructions, which is on the order of an extra millisecond. So maybe instead of 3 milliseconds your frame time is 4 milliseconds.

(It might actually be faster instead of slower, because the relevant parts of the font atlas are probably going to have a high data cache hit rate, so memcpy can go faster than 10 gigs a second.)

I did test something similar to this in http://canonical.org/~kragen/sw/dev3/propfont.c, which runs on one core of this laptop at 84 million glyphs per second (thus about 0.7ms for our hypothetical 55000-glyph screenful) but it's doing a somewhat harder job because it's word-wrapping the text as it goes. (It's using a small font, so it takes less memcpy time per glyph.)

So maybe scrolling a 4K page might take 4 milliseconds per screen update on the CPU. If you only use one core. I would say it was "struggling to stay smooth" if the frame rate fell below 30fps, which is 33 milliseconds per frame. So you have almost an order of magnitude of performance headroom. If your window is only 1920×1080, you have 1½ orders of magnitude of headroom, 2 orders of magnitude if you're willing to use four cores.

kbolino · 2025-05-22T15:50:01 1747929001

I did some basic tests with SDL3 and SDL3_ttf, using only surfaces in CPU memory and with acceleration disabled, on my 2560p 144Hz monitor and the copying was never a bottleneck. I was concretely able to achieve an average of 3ms per frame, well under the 144Hz budget of 6.9ms per frame, to scroll a pre-rendered text box with a small border in a fullscreen window. Even at 4K resolution (though that monitor is only 60Hz), I was seeing 5-6 ms per frame, still good enough for 144Hz and leaving lots of time to spare for 60Hz. I think this certainly proves that smoothly scrolling a text box, at least with a powerful desktop computer, is not an issue using only direct pixel access.

The bigger issue, though, may be rendering the text in the first place. I'm not sure how much the GPU can help there, though it is at least possible with SDL3_ttf to pass off some of the work to the GPU; I may test that as well.

vidarh · 2025-05-22T18:40:05 1747939205

> The bigger issue, though, may be rendering the text in the first place. I'm not sure how much the GPU can help there, though it is at least possible with SDL3_ttf to pass off some of the work to the GPU; I may test that as well.

The font rendering gets slow if you re-render the glyphs regularly. This becomes a challenge if you render anti-aliased glyphs at sub-pixel offsets, and so make the cost of caching them get really high.

If you keep things on pixel boundaries, caching them is cheap, and so you just render each glyph once at a given size, unless severely memory constrained.

For proportional text or if you add support for ligatures etc. it can get harder, but I think for most scenarios your rendering would have a really high cache hit ratio unless you're very memory constrained.

My terminal is written in Ruby, and uses a TTF engine in Ruby, and while it's not super-fast, the font rendering isn't in the hot path in normal use and so while speeding up my terminal rendering is somewhere on my todo list (far down), the font rendering isn't where I'll spending time...

Even the worst case of rendering a full screen of text in 4k at a tiny font size after changing font size (and so throwing away the glyph cache) is pretty much fast enough.

I think this is pretty much the worst case scenario you'll run into on a modern system - Ruby isn't fast (though much faster than it was) - and running a pure Ruby terminal with a pure Ruby font renderer with a pure Ruby X11 client library would only get "worse" if I go crazy enough to write a pure Ruby X11 server as well (the thought has crossed my mind).

If I were to replace any of the Ruby with a C extension, the inner rendering loop that constructs spans of text that reuses the same attributes (colors, boldness etc) and issues the appropriate X calls would be where I'd focus, but I think that too can be made substantially faster than it currently is just by improving the algorithm used instead.

kragen · 2025-05-22T18:49:42 1747939782

I think it's okay for glyph generation to be slow as long as it doesn't block redraw and immediate user feedback such as scrolling. While you can make that problem easier by throwing more horsepower at the problem, I think that to actually solve it you need to design the software so that redraw doesn't wait for glyph generation. It's a case where late answers are worse than wrong answers.

I had forgotten or didn't know that you'd also written a pure Ruby replacement for Xlib! That's pretty exciting! I'm inclined to regard X-Windows as a mistake, though. I think display servers and clients should communicate through the filesystem, by writing window images and input events to files where the other can find them. Inotify is also a botch of an API, but on Linux, inotify provides deep-submillisecond latency for filesystem change notification.

vidarh · 2025-05-22T22:57:37 1747954657

For the glyph regeneration, individual characters is more than fast enough - TrueType is actually quite simple to rasterize [1] (if you ignore things like hinting, which you increasingly might as well on 4k displays etc.; also: if you ignore emojis, which involve an embedded subset of SVG in the font file... eww). It's really just if you have a screenful of previously unseen glyphs you'd get a very brief slowdown. You could warm the cache if you wanted, but in practice I can increase/decrease the fontsize with a screenful of text in my terminal without it being slow enough to be worth optimizing more.

> I had forgotten or didn't know that you'd also written a pure Ruby replacement for Xlib!

That one is not all me. I've just filled in a bunch of blanks[2], mostly by specifying more packets after the original maintainer disappeared. I keep meaning to simplify it, as while it works well, I find it unnecessarily verbose. I'm also tempted to bite the bullet and write the code to auto-generate the packet handling from the XML files used for XCB.

I think there's large parts of X11 that are broken, but the more I'm looking at my stack, and how little modern X clients use of X, the more tempted I am to try to write an X server as well, and see how much cruft I could strip away if I just implement what is needed to run the clients I care about (you could always run Xvnc or Xephyr or similar if you want to run some other app).

That would make it plausible to then separate the rendering backend and the X protocol implementation, and toy with simpler/cleaner protocols...

[1] https://github.com/vidarh/skrift

[2] https://github.com/vidarh/ruby-x11

kragen · 2025-05-24T17:36:09 1748108169

> I think it's okay for glyph generation to be slow as long as it doesn't block redraw and immediate user feedback such as scrolling

Incidentally, last night I loaded a page containing https://news.ycombinator.com/item?id=44061550 in Fennec on my phone, and at some point when I scrolled to where some superscripts were in view, they were briefly displayed as gray boxes. My inference is that Fennec had loaded the font metrics so it could do layout but didn't do glyph rasterization until the glyphs were in view or nearly so.

kbolino · 2025-05-23T14:16:35 1748009795

Yeah, the difficulty with glyph caching IMO is handling things like combining diacritics. Really, you'd need to do proper Unicode grapheme cluster segmentation [1] to even decide on what is a valid cache key in the first place, at least if you intend on supporting all major languages. But if you only want to support most languages, you could get by without it, or just with Unicode normalization [2].

[1]: https://unicode.org/reports/tr29/

[2]: https://unicode.org/reports/tr15/

kragen · 2025-05-23T21:20:28 1748035228

If you were short on CPU, you could handle "normal" combining diacritics like 0̩́ in a variety of ways, including just alpha-compositing several glyphs into the same pixels every time you redraw, and (except for emoji!) you could compute each scan line of a text layer as 8-bit-deep pixelwise coverage first, opening up the possibility of compositing each pixel with bytewise max() rather than alpha-compositing, before mapping those coverages onto pixel colors. But I think the high nibble of the above discussion is that there's quite a bit of performance headroom.

kragen · 2025-05-22T16:08:36 1747930116

Thanks for checking me on that!

Yeah, text rendering can get arbitrarily difficult—if you let it. Rotated and nonuniformly scaled text, Gaussian filters for drop shadows, TrueType rasterization and hinting, overlapping glyphs, glyph selection in cases where there are multiple candidate glyphs for a code point, word wrap, paragraph-filling optimization, hyphenation, etc. But I think that most of those are computations you can do less often than once per frame, still in nearly linear time, and computing over kilobytes of data rather than megabytes.

akkartik · 2025-05-20T22:35:32 1747780532

The point is well taken. I don't know much about interacting with GPUs. I don't particularly care so far about getting more performance, given the wildly fast computers I have and my use cases (I don't make or play games). I _do_ care about power efficiency; do GPUs help there? Modern GPU-based terminal implementations aren't particularly power efficient in my experience..

kbolino · 2025-05-21T15:23:08 1747840988

There's so many factors affecting power efficiency that it's hard to give a categorical answer. A lot of it is dependent on factors that vary widely, from the hardware in use, to the display setup (resolution and refresh rate), to the quality of the drivers, to the window system (composited or not), to the size (cols x rows) of the terminal window, to the feature set involved, etc.

The problem with a lot of GPU accelerated terminals, if I had to wager a guess, is that they draw as fast as possible. Turning off GPU acceleration likely forces things to happen much slower thanks to various bottlenecks like memory bandwidth and sharing CPU time with other processes. GPU acceleration of most GUI apps puts them in a similar position as video games. It doesn't have to happen as fast as possible, and can be throttled through e.g. v-sync or lower-than-max FPS targets or turning on and off specific display features that might tax the GPU more (e.g. if shaders get involved, alpha blending is used, etc.).

The sibling comment makes a good point about compatibility and simplicity, though those don't always translate into lower power usage.

vidarh · 2025-05-22T18:52:57 1747939977

> It doesn't have to happen as fast as possible, and can be throttled through e.g. v-sync or lower-than-max FPS targets or turning on and off specific display features that might tax the GPU more (e.g. if shaders get involved, alpha blending is used, etc.).

Exactly.

E.g if you want to render as fast as possible, the logical way of doing it is to keep track of how many lines have been output (the easiest, but not necessarily most efficient, way is to render to a scrollback buffer) and then separately, synced to v-sync if you prefer, start rendering from what is at the top of the virtual text-version of the screen when the rendering starts a new frame.

Do this in two threads, and you can then render to the bitmap at whatever FPS you can handle, while you can let the app running in the terminal output text as fast as it can produce it:

If the text-output thread manages to add more than one line to the end of the buffer per frame rendered to the bitmap, your output will just scroll more than one line per frame.

You've then decoupled the decision of the FPS necessary from how fast the app you're running can output text, and frankly, your FPS needs to dip fairly low before that looks janky.

vidarh · 2025-05-20T19:25:05 1747769105

The reason to bother is that a lot of prefer terminals and want to evolve them. The reason they're not evolving faster isn't that compatibility is really a problem because we see new terminal capabilities gain support fairly quickly, but usually because there often isn't a major perceived need for the features people who don't use terminals much think are missing.

People don't add capabilities to try to attract people like you who don't want terminals in the first place.

Wrapping and scrolling can be turned off on any terminal newer than the vt100, or constrained to regions. I never wonder how a different computer reacts to writing off the bottom right of the screen, because that works just fine on every terminal that matters. The actual differences are relatively minor if you don't do anything esoteric.

A "simple" flat array of pixels means you have to reimplement much of a terminal, such as attribute rendering, font rendering etc. It's not a huge amount of work, but not having to is nice.

So is the network transparency, and vnc etc. is not a viable replacement.

akkartik · 2025-05-20T22:49:37 1747781377

"A simple flat array of pixels means you have to reimplement much of a terminal, such as attribute rendering, font rendering etc. It's not a huge amount of work, but not having to is nice."

For me the debate isn't about implementing a terminal vs something else. I assume one uses tools others build in either case. The question is how much the tools hinder your use case. I find a canvas or a graphical game engine (which implement fonts and so on) hinders me less in building the sorts of tools I care about building. A terminal feels like more of a hindrance.

vidarh · 2025-05-21T07:22:20 1747812140

> I assume one uses tools others build in either case

And there is the disconnect. For a terminal app, you often don't need to.

> I find a canvas or a graphical game engine (which implement fonts and so on) hinders me less in building the sorts of tools I care about building. A terminal feels like more of a hindrance.

And for me it's the opposite. The tools I build mostly works on text. A terminal provides enough that I usually don't need any extra dependencies.

akkartik · 2025-05-21T13:43:46 1747835026

The disconnect might be that I'm counting the weight of the terminal program itself. I think this makes for a more fair comparison. The terminal program is built by others, it often uses many of the same libraries for pixel graphics and font rendering.

vidarh · 2025-05-21T15:26:02 1747841162

I find this thinking bizarre. What matters is the simplicity of the API my code has to talk to.

You'd have a better argument if most people built their own terminals, like I have (mine is only ~2k lines of code, however), as then there'd be a reasonable argument you're writing that code anyway. But most people don't.

Even then I'd consider it fairly specious, because the terminal code is a one off cost to give every TUI application a simple, minimal API that also gives me the ability to display their UI on every computer system built in at least the last half a century.

I write plenty of code that requires more complex UI's too, and don't try to force those into a terminal, but I also would never consider building a more complex UI for an app that can be easily accommodated in a terminal.

kragen · 2025-05-23T17:10:47 1748020247

It's unfortunate to see that this thread ended up in ego defense and head-butting rather than the thoughtful exploration it started in.

akkartik · 2025-05-21T15:29:06 1747841346

I guess I'll stop here then.

akkartik · 2025-05-20T19:36:34 1747769794

Availability over ssh is indeed a good point. I've reduced my reliance over the network at the same time I've grown disenchanted with terminals; thanks for pointing out that connection.

The rest are mutually incommensurable worldviews, and we have to agree to disagree.