The master branch of xterm.js (which will become version 5.4) has a new experimental support for grapheme clusters, combining characters, and partial support for variation selectors, based on Unicode 15. (Contributed by Your Truly.) For now it needs to be explicitly enabled (see https://github.com/xtermjs/xterm.js/tree/master/addons/addon...) but in a later release we hope to make it the default. Most of the work is handled by the browser and the font, but xterm.js does need to detect cluster boundaries - which is what the addon does.
The key observation to me here is that the character cell grid abstraction that terminals are built upon is poor fit for modern general-purpose multi-lingual text rendering. Works well enough with plain ASCII but anything beyond that gets increasingly messy
I am working on a re-write (https://github.com/xtermjs/xterm.js/issues/4800) of the xterm.js "BufferLine" data structures that (among other benefits) could potentially support variable-width fonts in the terminal. Instead of a 2-dimensional grid of character cells, the primary data structure will be a list of logical lines, each consisting of characters and attributes.
A rather modest enhancement to the vt100/xterm escape sequences would be a mode where screen addressing would be in terms of logical characters and logical lines. The application should not need to know the width of characters or how lines are broken into screen rows. This makes use of variable-width fonts practical - which is necessary for decent rendering of many scripts.
DomTerm's display engine (https://domterm.org) does support lines containing a mix of monospace and variable width characters, and know how to wrap such lines. I'm working on long-term replacing DomTerm's native display engine with xterm.js (for speed and other reasons) which requires making xterm.js more versatile.
That’s an interesting thought, IMO. These flexible width fonts are often thought of as localization-assistance features, to help fit other languages in the terminal world. But English also had to be modified to fit into fixed-width fonts. I wonder how the discussion would change if this technology had been popularized originally in another country.
In utf-8, bytes (uint8_t) may not represent a whole "code point". A code point being an individually meaningful element in utf-8 like a space an 'e' or a modifier code point like an accent or a ZWJ. Most utf-8 libraries will let you address individual code points but it might still garble the text if you split between an 'e' and a '`'. To prevent this, splitting should be done in between graphemes (sequences of code points that render like a single unit*). And even graphemes have their problems.
Yes, I understand a little about Unicode in this kind of problem, but a code point is an individual logical item even if it is composed of multiple bytes; being a kind of 'string' in itself. I should have asked more carefully, what would be a better system in your view?
Thanks for the link, will check it out after Christmas.
I personally believe that Swift's strings where graphemes are the smallest indexable unit are the gold standard for writing logic that might truncate multilingual text. It's still not perfect though, they add overhead and updates to Unicode might change behaviour so there's that but it should handle most cases gracefully.
I know this feature is kinda frown upon but I really hope to see a standard way to really switch between multiple fonts in a terminal, not just playing with variations. I'd really want to try the Monaspace font familly so I can make a distinction between my input and a program output and it's error output by using different fonts. It could also allow the use of these fonts inside neovim for instance for richer code hints.
Kitty does allow you to choose 4 different fonts for regular/bold/italic/bold+italic. I guess that with the right configuration it should be quite straightforward to set your shell to use bold+italic for your input, and set that as a different font. sure, that font will be used for other bold+italic text, but that's quite uncommon.
Does anyone know how to get Korean text to display consistently on macOS in the terminal?
If you create a file in Finder with a hangul name, by default ls in iTerm2 will display the name incorrectly, with the composite syllables expanded out into separate characters. There is an option in iTerm2 to 'normalise unicode' which corrects this issue if set to HFS or NFC (if I remember right). This mimics the behaviour of Terminal.app.
However, different interactive terminal apps seem to have different expectations of terminal behaviour for this form of unicode hangul. With the 'normalise' setting off, fzf will corrupt the display when showing hangul. But with it on, vim/neovim will corrupt the display when navigating lines with hangul text.
It seems there is no way to get consistent and user-friendly hangul handling in the terminal on macOS, but I'm not sure who's at fault (except Apple of course).
I just want a word wrapping tool like `fmt` or `fold` that can (1) handle colours, and (2) handle emojis.
There is literally nothing out there that can reliably do this. Par (https://manpages.debian.org/par) seems to be promising, but still gets lost on some example texts.
> Depicted here in iTerm2 is a single U+23F1 "Stopwatch" character partially occluded by any next character. Surprisingly, this is the correct behavior of a terminal when U+FE0F "Variation Selector-16" is not in sequence.
Well yeah, if your font doesn’t have the right glyphs, you’re going to get a mess. Indic text has a habit of being just straight illegible even in terminal emulators that handle this wcwidth, and I don’t read Arabic but I expect it tends to be particularly commonly illegible, not just overflowing all over the place but also being rendered left-to-right instead of right-to-left.
Fun fact, Indic text can accidentally end up more legible in terminal emulators that don’t support this wcwidth stuff, so long as they still do Indic script rendering. Take my name in the Telugu script: క్రిస్ is six code points long (letter ka, sign virama (which suppresses the -a inherent vowel), letter ra, vowel sign i, letter sa, sign virama) and normally rendered as two clusters (kri, s), but it’s actually perfectly valid (though uncommon) to write it as క్ రిస్ (I inserted a ZERO WIDTH NON-JOINER between the k and the ri, but HN turned it into a regular space which is very wrong :-( ), and separating the conjunct makes the character fit into a cell more reliably, because it won’t go so far up or down or occasionally sideways. I use Alacritty; it doesn’t do the full wcwidth thing in this case specifically because (if I recall correctly) it doesn’t want to do Indic script rendering for performance reasons. Now because of this unfortunate combination what I actually get there is the three cells, but with the inherent vowel still rendered on every consonant, even if replaced with another vowel sign or virama—so I basically get క్ రిస్ and కరస rendered on top of each other.
But returning to the original topic: yeah, all of this stuff falls over extremely often, because no one has fonts that handle everything properly, and the whole thing is just an exhibition of how the column/cell model is completely unsuited for a Unicode world, and we just keep layering patches upon patches to mitigate the harm, but it’s impossible to actually fix it: the whole thing needs burning down and replacing.
What I wish for is a control sequence that enables a non-cellular mode: which allows non-monospaced rendering if desired, starts doing BiDi text rendering, introduces a couple of varieties of flexible tab stop (since visual columnar alignment is still highly desirable and two of the main forms are still practical to achieve, loosely corresponding to CSS flex-wrap and grid), tweaks the behaviours of most cursor-affecting control sequences to use a grapheme cluster basis (and you’d need to do something about logical versus wrapped lines, programs might need to address both sometimes, though most should go logical), and ruins the notion of $COLUMNS. Honestly, with minor tweaks to programs that do not-at-start-of-line visual alignment, there’s not all that much that would break: things like rustc output would suffer a little (you could only align the start of span underlines, not the end), and vertical splits in things like Vim and tmux are not practical, but that’s about all I can immediately think of. And so many more things would start working properly. I think this is achievable and even practical, and I think the end goal is well worthwhile, but it would take quite a lot of effort, in specifying and in implementing. I’m curious if this catches anyone’s fancy. What I have in mind sounds similar to what Per_Bothner describes elsewhere in this thread, perhaps just more featureful (for good and ill!).