There are tons of articles about plain text editor data structures, but what about rich text editor data structures? Let's say i want to implement a text editor that can have bold, italic, underline, etc text but also be able to do automatic word breaking, align paragraph text to left/middle/right, insert images and/or other objects, have floating images and/or other objects around which the other (non floating) text/images/objects wrap, etc.
What would be the data structures for that? I can only think of trying to replicate something like the HTML DOM but i have a feeling something like Write for Windows 3.1 used a simpler data structure.
It's not actually that different or complicated if you're already doing proportional plaintext rendering -- you need to store style attributes for a range of text, and you need to support different heights per line.
The real complexity is rendering all of unicode properly, and supporting international fonts, bidi layout, vertical text, etc.
You don't need any fancy data structures. 95% of the performance goes into glyph rendering. And with Unicode the performance gain from monospace fonts goes out the window as some Unicode characters are very large, and not only that they also take up many bytes. So one Unicode character can be up to 5 bytes long and take up the same canvas space as 3 characters. You also need to read ahead as there are combination characters, for example a smiley combined with the color brow becomes a brown smiley. I've blogged about implement support for Unicode in an editor here: https://xn--zta-qla.com//en/blog/editor10.htm
Some "single character" emoji easily exceed 5 bytes in all encodings. You may think ZWJ sequences are cheating, but emoji isn't the only language encoded in Unicode with complex ZWJ sequences.
My question is about the phrase up to 5. What in Unicode is up to 5? Codepoints are up to 4 in all the encodings I know. ZWJ sequences may as well be arbitrarily long. What is "up to 5"?
> So one Unicode character can be up to 5 bytes long and take up the same canvas space as 3 characters.
FWIW, I didn't read that as suggesting an upper bound of 5 bytes, but rather as an example using arbitrary numbers: N bytes of code units could, depending on the font providing the glyph(s) for the respective grapheme(s), could be rendered at M times the size of, say, the letter A, where N != M -- despite the font otherwise being monospaced. Which is just another way of saying that you must consult the font for the character widths involved.
I think you're reading that quote as an assertion that:
For any grapheme G, G can be encoded in at most 5 bytes.
While what I think was being said was:
There exists a grapheme G, where G is encoded in 5 bytes, and the respective glyph happens to be displayed at 3 times a single character (e.g. the letter A), despite the font otherwise being monospaced. Therefore you *must* consult the font for each glyph to correctly determine character widths.
> You also need to read ahead as there are combination characters, for example a smiley combined with the color brow becomes a brown smiley.
Emphasis mine. Clearly combination characters are being treated separately.
Frankly I think it's crazy to read "up to 5 bytes" and not think that it suggest an upper bound. I think you're reaching for a highly questionably interpretation of a totally unambiguous clause. If the author meant to express what you're saying, they would certainly have written: "Some Unicode characters are 5 bytes long and take up the same canvas space as 3 characters". Which would still look incorrect if they followed it with the sentence "You also need to read ahead as there are combination characters...".
It is far more likely that the author is simply mistaken and should have said 4 bytes, and perhaps used the word "codepoint" instead of "character" in the original sentence. That's a perfectly understandable technical error, while the reinterpretation you're putting together would imply an error of colloquial language.
Codepoints themselves could technically all fit into 3 bytes (or 21 bits to be precise), but there is no standard 3-byte encoding. The highest Unicode codepoint is 0x10FFFF.
An idea for a variable width encoding of 1 to 3 bytes: Read the MSB of each byte: If it's 0, don't read any more bytes. If it's 1, read the next byte. Do the same (up to 3 times). The non MSB bits of each byte then make up the codepoint.
The obvious drawback to this approach is that it is inherently serial. You need to read each byte before considering the next, so it would perform worse than UTF-8 in most cases.
Another drawback is that it is not self-synchronizing, which is one of the benefits of UTF-8.
It also has the issue that you can represent some codepoints with more than one encoding: eg, put ASCII characters into 2 or 3 bytes. So you would need rules to use the minimal encoding for each codepoint.
As a space-saving technique, it may offer better density than UTF-8 or UTF-16 on some texts.
You could also use a fixed-width encoding of 24-bits to avoid the problem of reading it serially, but as computers typically work in powers of 2, you would align 24-bit values at 32-bit addresses and load them into 32-bit+ registers, so there's nothing to really gain in terms of performance here over UTF-32, but you could save a bit of space.
It's more compact than UTF-8 because fewer bits are used for the encoding itself. There are 24 bits, and only 3 bits are used as part of the encoding, with the other 21 bits representing the code-point. (Precisely the number we need to represent all Unicode).
In UTF-8, a 3-byte encoding uses 8-bits as part of the encoding, a full byte worth of bits for the encoding itself, leaving only 16-bits for the codepoint. If you need higher code-points you need to use 4 bytes, where 11 bytes are the encoding and 21 bytes are the codepoint.
So UTF-8 is space efficient for ASCII, but ~1/3 of the bits are used for the encoding in for non-ASCII, versus a fixed 1/8 of the bits used for the encoding above for all 1-3 bytes. The above has a fixed 12.5% space overhead over raw codepoints. UTF-8 has 12.5% only for ASCII, and ~33% overhead for everything else.
Although it is not self-synchronizing like UTF-8, you can synchronize a reliable stream by holding a buffer of the previous byte. If the previous byte's value is >=0x80, the current byte is part of the same character. If it's <0x80, the current byte is the start of a new character, so it's still possible to do substring matching etc, fairly efficiently but slightly less efficiently than UTF-8. It makes it suitable for file storage, but not ideal for transmission.
That said, most sane protocols will prefix a string with a length (in bytes), so self-synchronization is not always an issue.
But you also read it as referring to ZWJ sequences? So the author has picked a number that is actually below average and they've worded it as up to...?
Saying a ZWJ sequence can be "up to 5 bytes" is like saying "the current generation of Intel processors run at clock speeds of up to 2 GHz".
If they were referring to ZWJ sequences (I don't think they were; I think they were just misremembering the maximum encoded length of a codepoint) and they had said "up to 35 bytes", then I might agree with you. It's still not technically accurate, but it's a reasonable colloquial usage, like saying "human males can grow up to seven feet tall".
I think you are trying to read something that wasn't meant to be technical documentation as if it was trying to be exact technical specifications. I'm not the original author, so I don't have reason to litigate this any further, and I'm not sure what you are arguing about at this point.
Sorry I meant codepoint/characters, but it would not suprise me if there existed an encoding or language where my wording would be technically correct, but I do not know of any such encoding. I also did not know that there exist more then 5 combinations in Unicode, but I'm not supprised and my implementation is probably buggy. But I do challange you to test how well your favourite editor (terminal emulator cough) handles Unicode emojis.
UTF-8's original specification included 5-byte and 6-byte encodings to cover the complete astral plane (31-bit code points), but later specifications have marked those "invalid" today due to the current 21-bit limit of UTF-16 and to align both specifications for now rather than fix the bugs in UTF-16 (or scratch UTF-16 altogether). In theory, UTF-8 can even extend beyond 6-byte encodings (and UTF-32 into 8-byte encodings and beyond) if the next plane (63-bit code points) or the one after that ever needed to open up. (No one expects that any time soon, of course. Today Unicode is nowhere close to in danger of filling 21-bits much less 31. That would be a massive shock and the compatibility headache would be terrible with UTF-16 breaking or today's software breaking that hard codes the assumption that UTF-8 should never go past 4-byte encodings.)
If it wouldn't surprise you then I think you should recalibrate your feelings about how surprising Unicode encodings are. There aren't very many of them, they haven't changed in a very long time, and they don't deal with any of the stuff that makes Unicode very complicated (collation, combination characters, etc). They just encode 21-bit integers, albiet sometimes in a highly convoluted way for backwards-compatibility reasons (UTF-16). It's not the kind of thing that needs to be estimated, or where a layer of FUD is warranted (as it kind of is with combination characters). When talking of codepoints, it's just "up to 4 bytes", high confidence, nothing more to it.
Is there a maximum number of "zero-width joiner" (ZWJ) sequences that can be combined to create a single-width emoji? (Yes, I know the term "single-width" is a loaded term.) I cannot find a precise answer.
Yes, UTF-8 was up to 6 bytes per character early on. Some broken implementations like MySQL's limit it to up to 3 bytes per character. The actual number is 4.
I think with decomposed Hangul, you can end up with 6 or more bytes per character, due to each part of it being two bytes, and 2-4(?) parts per character.
I think unicode, fonts, etc would be an issue even with a plain text editor anyway.
The "styling a range of text" is something i thought but you still need to somehow associate the text with the range - and vice versa - and this doesn't handle things like inserting images and other types of objects since these aren't text.
You could have a document be a series of "paragraphs", each being a series of "elements" with each "element" being something like "text" (with a style), "image", etc. But then once tables enter the picture, you need to expand paragraphs to be of "table" type and each table cell is itself a self-contained "series of paragraphs" - and then start thinking about nested tables or images in tables!
Generalize that enough to avoid special cases inside special cases and you end up with more of a tree-like structure representing a DOM and less with a linear structure with range-based styling.
(of course, then again, i don't remember Write for Windows 3.1 having tables in the first place :-P but i'm interested if there are alternative approaches anyway)
EDIT: one thing i forgot to mention - and why i am curious about non-DOM-based approaches - is that one problem with the DOM approach is the selection: with a linear/range-based structure the selection is just one or two indices inside the range, but with the DOM the selection can start from a node with node-specific subrange (e.g. character in a text node) and end with another node and both being very unrelated to each other (i.e. only having some distant common ancestor and not necessarily at the same level).
A plaintext document is an array of chars, a richtext document is tree, which may or may not be well-formed.
Think about someone trying to bold semi-half of_a sentence_, and how MS Frontpage was made by smart people, it’s just really hard.
The most interesting thing lately is the HTML attribute `contenteditable`, and how it almost just kinda works! You still have to be full-stack to make something good, but that was an amazing improvement to the browser.
Yeah, that trying to bold half of a sentence - or even better, the middle of a sentence - is why i was wondering about simpler alternatives to DOM. Some time ago i toyed around with an HTML editor[0] (that one had to use a DOM anyway, but my question is for rich text editing in general - BTW the rectangles in the shot show a selection that goes across nodes) and doing something like that involved traversing all the nodes (going both down and up the node tree, starting from the cursor's starting position), finding the closest common ancestors under the selection, creating "B" siblings to them and then reparenting them under these new "B" nodes.
You can move a lot of that stuff to reusable methods but personally i find the whole "editing" aspect to be more involved than the "drawing" side - and also the one more likely to be different than a plain text editor - when dealing with DOM-like structures. Hence why i am interested to see what alternatives there are.
Text is usually stored as tree either way in an editor, using a DOM-like approach might work well on top of the usual datastructures.
> with the DOM the selection can start from a node with node-specific subrange (e.g. character in a text node) and end with another node and both being very unrelated to each other
I'd just store the range as character indices, using those the right nodes in the tree can be accessed pretty quickly as needed.
I don't think character indices are enough, what if your selection begins at the middle of a table cell and ends on an image that is the only child of a cell in a completely different table (no text involved at all, except some text in the cells in between)? If you want to, e.g., delete those how do you find which nodes are to be deleted and updated (e.g. for merging the two tables if there are cells after the one that contains the image)?
Images and table cells are just nodes within the tree holding the text, assuming all styling etc is represented in a plain text syntax similar to Markdown, of course. Looking up the nodes from char indices is quick if each node stores how many chars it contains.
Other approaches would probably require the selection to be a tree of its own, I can't really say whether that's simpler overall or not.
The syntax shouldn't matter (you may not even being using a plain text syntax - or any syntax - anyway), you could treat an image or whatever as a single "special" character. Or just assign a linearly increasing ID (increasing in the order the text, images, etc flows) to each node.
Though that is basically another way to represent what i wrote above with having a pair of node pointers and a subrange (well, an index actually, the other end of the subrange is implicit if the node pointers are different). This is basically what the old HTML editing control Microsoft had back in the 90s used and that worked with the DOM tree (also what i used in a test editor i wrote some time ago). And yeah it isn't simple.
Start and end of a selection would be nodes in the tree. The algorithm to find all the selected nodes is already part of the renderer, as the renderer has to implement traversal of nodes in text order as well.
Indeed, you could even use a piece tree just like in the blog but you store additional information in each piece which tell the renderer how to layout the associated text. My understanding is that most rich text editors represent the text as a node-based tree anyway.
Some early rich text editors never used a tree representation for formatting represented the formatting as "control character sequences". This was part of why some people loved WordPerfect so much because you could toggle a view of all the control characters and just delete/copy/paste them like any other text in the same document.
It's basically how the classic RTF format [1] works, and things like VT100/ANSI escape codes in terminals. It's kind of like the difference between imperative code and declarative code: "this character sequence means toggle the state of bold" versus "this node of characters is bold".
I believe most do use a DOM like structure for block level elements. For inline styling however they tend to not have separate nested nodes for each style (bold, italic, underline), and are closer to the old HTML4 <font> tag with multiple properties for each style and don't nest them. It makes for much simpler code.
The docs for ProseMirror are a brilliant insight into how many of these editors are designed. ProseMirror maintains its own document model in parallel to the html DOM and rectifies one to the other as either changes.
For realtime collaborative rich text editors take a look at PeriText, they have a brilliant article explaining the CRDT structures used.
The more I read about this control, the more I learn about its insane feature set! Microsoft continues to make significant improvements to a version that is only shipped with Microsoft Office -- not available from a barebones Win7/10/11 install. Read more here: https://devblogs.microsoft.com/math-in-office/using-richedit...
Write is actually quite barebones, paragraph and text styling with paragraphs being "images" (Win3.0) or OLE objects (Win3.1, images were OLE objects). Tables, etc were done manually with tabstops.
There is a text file describing the reverse engineered file format here:
Many older word processors (the only ones I'm familiar with) used style "runs", which are simply an array of structs that contain startPosition, endPosition (or length) and then a bunch of style info in whatever way makes sense to them.
I worked on Ready,Set,Go! back in the day and also wrote my own styled editor for the Mac in the 90's.
One interesting thing about this is that you could use a lazy "adjustor" to remove duplicate or overlapping styles. No need to worry about it during typing or selecting. A low priority task could analyze the runs and fix them up as needed and no one was the wiser.
IMO, the hardest part about writing a word processor back then was font management. You had to construct width tables to be able to compose lines properly. This generally involved a bunch of poorly documented APIs and peering into opaque data structures.
RichTeXtFX[1] has these abilities[2] along with demo code[3] that provides an entry point for sleuthing. My text editor, KeenWrite[4], uses RichTeXtFX, but takes a different approach due to its usage of inline variable references. Rather than edit the document in a WYSIWYG-like fashion, users edit the Markdown document as plain text then export as PDF by selecting a theme to apply for typesetting. This cleanly separates content from presentation. The "themes" video tutorial demonstrates how theme selection works.[5]
Yes, I think your instinct to look at win32 APIs is a good one. Many structures in riched are public in headers, and of course you could reference Wine's implementation[1].
That is neat, thanks. I didn't check RICHED itself but when i was toying around with an HTML editor a few years ago (see my other post) i did check out an HTML editor API that Microsoft had (i don't remember its name but it was used by, e.g., Joel Spolsky's CityDesk - IIRC the control itself was documented in the MSDN CDs that came with Visual Studio 6), from where i got the idea of using a two node pointers and a subrange index to represent a selection.
its probably instructive to look at elisp text properties. not so much that its a great system - but it raises some messy questions about how to operate on substrings that have different sets of properties
This seems revisionist/ignorant. The article attributes a "piece tree" data structure to VS Code developers, however the must-read 1998 paper by Charles Crowley, Data Structures for Text Sequences, already mentions search trees as being used to enhance the naive piece table in some text editors.
Thank you for the pointer! I was actually not aware of this paper at all, which is why it was not included here (not sure how I missed it). I'll be sure to push a revision to the blog which mentions this paper.
You may also want to take a look at “Deletion: The curse of the red-black tree” (Germane and Might 2014), which presents functional deletion for red-black trees.
I might have a useful suggestion. I've implemented a Piece Tree (VS Code style) and a Rope in a functional language (found the Rope much faster at everything but I made a few adjustments to the structure so it's not a straightforward Rope).
The major failing point of the Piece Tree from my benchmarks is the substring/querying time. An idea I want to try out to speed up my Piece Tree implementation is to distinguish between Concat (metadata) and Leaf (string) nodes, just as a Rope does, storing metadata in the internal nodes and Pieces at the leaves.
The reason (I hope) that will improve substring times is because, in a Piece Tree, the original string can be reconstructed through an in-order traversal of the tree's internal nodes.
So, if you specify a substring range that starts from one character before the root node and ends one character after the root node, you end up traversing to the rightmost node in the left subtree and the leftmost node in the right subtree (two O(log n) operations).
I'm hopeful the tree depth that would need to be traversed if the nodes were at the Leaves (like in a Rope) would be shorter (especially since adjacent pieces won't be O(log n) distance away) and want to try it out myself, but my intuition might be wrong. You can have a go trying that out yourself if the idea interests you.
Thanks to this article, I learned that the core data structure for VSCode's text is written in TypeScript[0]. I, uh, I knew VS Code was written in TypeScript but I didn't realise _all_ of it was. It's crazy to think that the editor works as well as it does!
I used a red-black tree written in JS over 15 years ago at a job where I needed to use lower bound and range operations for nodes in a pivot-table type dashboard for telecoms metrics. I was actually blown away at the time at how it actually performed rather decently, and this is back in the pre-Chrome Firefox & IE6&IE7 days.
Now in the world of V8 and JavaScriptCore, I don't think it's crazy at all. So much work has gone into JS runtime optimization. For heavily concurrent and memory intensive workloads, I can imagine problems though.
> if the underlying buffer needs to reallocate after some edits the entire process slows to a crawl as you need to allocate a larger buffer and copy each byte into the new buffer and destroy the old one. It can turn an O(n) operation into O(n^2) randomly.
This would still only be a O(n) operation. The constant value might be higher, but the complexity is the same.
I don’t buy the argument that gap buffers “are bad for multiple cursors”. I get the argument on a theoretical level, but real hardware is not theoretical. There are several operations that are theoretical faster with a hashmap or b-tree than a vector, like insert and delete. But in reality the vector is usually faster in the real world except for very large inputs[1]. Gap buffers are basically vectors.
Another point with multiple cursors and gap buffers is that Chris Wellons animations show the gap moving back to the first cursor every time it needed to add a new character. But in reality you would just walk the cursors in reverse order each time, saving a trip.
I have actually written and benchmarked a very naive and unoptimized gap buffer, and the results showed that it was faster than highly optimized rope implementations on all real world benchmarks[2], including multiple cursors.
That being said, a gap buffer is still probably not the best data structure for a text editor because it has worse “worse case” performance than something like a rope or piece-tree. Even though it is faster overall, it’s the tail latency's that really matter for interactive programs.
Overall I enjoyed reading the post, I find the topics fascinating and this was well presented.
These data structures are not mutually exclusive, see the paper by Crowley linked elsewhere in the discussion. You could use, e.g., a piece table with a gap buffer for descriptors, or a piece table with a tree for descriptors.
Can someone point me to structures/algorithms to use for text editor which could support files of unlimited lengths (including lines of unlimited length) without loading those fully in the buffer?
I miss that editor so much, that I'm considering to write one some day, but I have no idea how to do so. I can invent things myself, but I guess those things were invented already back in the days computers were different.
See the author instantly opening a ~1GB text file with async loading, paging through, copying/pasting, and undoing/redoing in their prototype “ewig” text editor about 27 minutes into their talk here:
That is essentially what VLF[1] does in Emacs. It reads in discrete chunks of the file at a time and doesn’t load the next one till you try to display it. Doesn’t require any fancy data structures, just some extra book keeping and mechanics.
> Create debugging utilities for data structures very early on.
Yes, this isn't encouraged enough. I often serialize my data structures as either JSON or Graphviz DOT files for visualization. It helps save an immense amount of time. You can also use the generated files for regression testing, i.e. diff the actual output with the serialized output and if they're different, then a bug was introduced.
Apple's MPW was Apple's programming environment for a long time.
It's been a while, but I believe that it represented everything in a tree of n-character chunks (n = 6?). It was probably the first editor that I used that could open files of pretty much any length.
elsewhere on HN today is a short article on BBedit, which has a really nice shell worksheet interface similar to the old MPW. Select the text you want to execute, hit control-return, and the shell output will appear below the selected text.
If I'm not mistaken, I believe this feature originates with Smalltalk (you might think that Smalltalk is a language and the feature in question is an editor feature, but most Smalltalk variants are tightly integrated with an editing environment).
Another potential approach is to use a traditional rope as-is, and generate a series of reverse operational transforms everytime there is an edit. Run a "compacting operation" every n edits to keep the stack of ops from growing out of hand.
I've been wondering if there's a good algorithm to handle highlights in a piece of text. If I highlight some text, lets say the part between the => <= here:
"Idioteque" is a song => by the English rock band Radiohead <=, released on their fourth album, Kid A (2000).
If I want to then edit this text, is there an efficient algorithm for figuring out the start and end index of the highlight for the edited text?
There's way too little detail given in this question for you to get a useful answer. How is the text stored? How are you highlighting if you don't have the indices?
I feel like the rope was too quickly discarded here. The author stopped exploring ropes because they don't fit criteria 2--efficient undo/redo. There are a few problems with claiming that ropes aren't efficient for this usage:
1. The claim that the rope is inefficient for undo/redo is based on a singular example of a small edit on short strings. This isn't where ropes shine, admittedly, but they don't need to shine there, because when dealing with such small pieces of data, pretty much anything you do will be faster than the user can see. If using larger strings, the space allocated for nodes becomes background noise as the strings themselves dominate the size.
2. The chosen solution, the piece table, is more memory-efficient than the rope at first glance, but that's a surface-level efficiency. The eventually-chosen solution, a piece tree, is far less memory efficient. Sure, at first glance it is more memory-efficient, but this is at the expense of tree traversals, which in the VSCode article, are addressed with cache, which... uses more memory. In the author's implementation there's even more memory used because there's a requirement he didn't include in his list: he wants it all to be immutable. Nevermind that ropes were immutable from the start...
3. If you have a document which uses a lot (read, thousands) of very small edits, then the size of small strings might start to matter. So if you're going to optimize for this, optimize it. There are some fairly small optimizations that make the inefficiency concerns completely irrelevant. One is pointer packing: in a 64-bit system, pointers are 64 bits, but in practice, the vast majority of systems use 48 or fewer of those bits: as it turns out, there aren't many systems with more than 2^48 bytes = 256 terabytes of RAM. This means the leading 16 bits are 0s. Trivially, this means you can store strings of 7 8-bit characters in the pointer itself, using the first 8 bits to signal if it's a string or pointer (if they're all 0s, it's a pointer) and the length of the string. All the strings in the inefficiency example can fit in a 64 bit integer: "Hello", " ", and "world" are all fewer than 7 bytes, which means you're passing around 64-bit integers, by value, with no allocations necessary. In fact, this means you can either append the " " to "Hello" like "Hello ", or prepend it to "world" like " world", and still stay under 7 bytes in either case. Remember, this is now 64 bit integers being passed around by value: this is far faster than piece trees.
4. The author treats undo/redo as a stack, but all the cool editors treat it as a tree. If you make a change, then undo it, then make another change, then undo the second change, is your first change lost? In vim/emacs, the answer is no: you can go back into a tree and find it to reapply. This means that all text is not only immutable but immortal: it has to be kept for the duration of the editing session. This enables a few more optimizations: we no longer need reference counts or garbage collection since we aren't reclaiming the memory, and now we can point into existing strings since they'll never change. Consider the following string: "The quick brown box jumps over the lazy dog." You may have noticed a typo: "box" should be "fox". This change requires 0 buffer allocations: we have a pointer to "The quick brown box jumps over the lazy dog." for the original string, a pointer to the same spot for "The quick brown " (with a length), a second pointer to "ox jumps over the lazy dog.", and a packed pointer (integer) for the string "f". This is pretty key because if you're not freeing any of this memory, you need to make sure you don't allocate more than necessary!
NOTE: I'm not saying that the rope is the better structure here. There may be more requirements which weren't captured in the article which mean that piece buffers really are the right answer. All I'm saying is that the article doesn't really explore ropes deeply enough to write them off so quickly.
What would be the data structures for that? I can only think of trying to replicate something like the HTML DOM but i have a feeling something like Write for Windows 3.1 used a simpler data structure.