Inside font-rs, a font renderer written in Rust

nathancahill · on Aug 2, 2016

Neat. I can't help but wonder if this will fall victim to the "last 10% is the hardest" rule? Will going from tech demo to production ready remove the performance gain?

It sounds like you're claiming that FreeType is slower because the parsing/accumulation implementations are slower. It's far from my area of expertise, but wouldn't 20 y/o open source software as prevalent as FreeType have optimized those code paths?

Edit: Author is a heavyweight in font rendering circles. Excuse my ignorance. Just wary of "10x faster with 90% of functionality!" benchmarks.

elcapitan · on Aug 2, 2016

Author:

> The current state of the code is quite rough. The code isn't well organized, and it's basically not ready for prime time.

From what I can see in https://github.com/google/font-rs/blob/master/src/font.rs, this is what is missing:

* support of CFF-based fonts (that is "postscript-flavored outlines" (i.e. cubic) OTF files)

* "Advanced Typographic Tables" (see Opentype spec: https://www.microsoft.com/typography/otspec/otff.htm), this is what is needed to render more complex non-latin languages like Arabic etc, because it defines context-specific replacements and positioning, but also opentype-level kerning

* support for the kerning table

* support for slightly more exotic TTF variations like EOT and WOFF for webfonts

* hinting support for smaller rendering sizes

Most of these things are supported by Freetype, and are probably a considerable amount of work to add. Once you add them into the rendering calculations, the abstractions in the code would have to be refactored and the code would become more complex and probably slower.

Having said that, it's still a nice implementation and easy to read, something I wouldn't necessarily say about Freetype ;)

pcwalton · on Aug 2, 2016

> Once you add them into the rendering calculations, the abstractions in the code would have to be refactored and the code would become more complex and probably slower.

These features don't have much to do with the core rasterization algorithm, which is where the vast majority of the time is typically spent. So I wouldn't expect things to go slower.

elcapitan · on Aug 2, 2016

The hinting is definitely rasterization related, isn't it?

pcwalton · on Aug 2, 2016

Well, sure, but it's increasingly common to just not do hinting these days. For example, Mac, iOS, and Android (from what I can gather) don't.

In any case, hinting just changes point positions. It doesn't affect the way the rasterizer works at a fundamental level, I don't believe.

a_e_k · on Aug 2, 2016

In the case of TrueType hinting with the interpreter, it can. The SCANTYPE instruction [1] allows the hint program to request different scan conversion settings. It's mainly used to change the dropout control mode. Without dropout control, the rule is strictly that pixels whose centers are inside of the glyph are drawn. At small scales, this can lead to features smaller than a pixel disappearing, so dropout control adds additional rasterization rules to draw some pixels even if they're slightly outside of the glyph. I've seen that FreeType supports them, and I'm sure it must have a complexity and performance impact on them. Granted, it's less useful with antialiasing, but is still a case where hinting affects the rasterizer's operation.

[1] https://developer.apple.com/fonts/TrueType-Reference-Manual/...

johncolanduoni · on Aug 2, 2016

For CFF at least, hinting instructions are integrated tightly with the rendering instructions. I'm not sure about TrueType outlines though.

jahewson · on Aug 3, 2016

FreeType does not support EOT.

FreeType also doesn't handle Advanced Typographic Tables. For that you need HarfBuzz. Note that rendering and shaping are usually considered to be separate processes - one doesn't expect a font rendering library to handle shaping

Other than hinting nothing you've mentioned should affect the rasterization performance.

pcwalton · on Aug 2, 2016

> wouldn't 20 y/o open source software as prevalent as FreeType have optimized those code paths

You'd be surprised how often SIMD goes unused. libpng, for example, doesn't use it on x86…

Outside of games, scientific computing, and a few other fields, it's notable how little of the hardware in our devices actually gets put to use.

viraptor · on Aug 2, 2016

> libpng, for example, doesn't use it on x86…

Do you mean doesn't use it explicitly? If so, GCC is still happy to find quite a few spots to automatically inject it. On my arch system:

    $ objdump -d /usr/lib/libpng.so | grep -c xmm
    316

Then again, gif doesn't seem to get even that, so maybe there is something in libpng?

    $ objdump -d /usr/lib/libgif.so | grep -c xmm
    0

mayoff · on Aug 2, 2016

gcc and clang generate SSE2 instructions for scalar (non-SIMD) arithmetic, depending on the target architecture and compiler flags. For x86_64 targets, the calling convention puts floating point arguments in xmm registers, so the compilers must use SSE2 (and it's faster than x87 anyway). That's what you're seeing. If you look at the instructions in libpng.so that mention xmm registers, you'll see that they aren't SIMD instructions.

pcwalton · on Aug 2, 2016

Explicitly, for filtering.

zzleeper · on Aug 2, 2016

I wonder whose fault is that.. if you build a feature that almost on one will use, then is the 'API' too arcane, everyone too lazy, or are there no tools available to ease the pain?

14113 · on Aug 2, 2016

No tools available, or at least, none that are easy to use. It's the classic parallelism problem: tools are either too low level (e.g. CUDA/OpenCL for the GPU space) for day to day developers to leverage effectively, or too high level to provide the performance in the other 10% of the codebase, wiping out your parallelism benefits. In this space, the low level alternative is manually writing SIMD x86 instructions, or using a similarly low level C library to do the same. The alternative is switching to something like haskell and using one of their high level array manipulation DSLs. In the first case, you have to explicitly manage the parallelism, and there _will_ be bugs there, in the second case you probably lose more performance in the other 90% of your application by switching to (say) haskell than you gain in parallelism.

In terms of automation, SIMD is hard to automatically implement (i.e. through the compiler) with a lot of traditional programming languages (e.g. C), and hard to add to dynamic languages, as you end up adding extra code paths/jit passes for each new type of simd/parallelism hardware construct available.

moosingin3space · on Aug 2, 2016

I believe that C is in part to blame, because the ANSI C abstract machine describes a single-threaded processor.

vardump · on Aug 2, 2016

SIMD is not about threading. Of course you can multithread it too, just like any other code.

moosingin3space · on Aug 2, 2016

My comment was responding to the concept that "it's notable how little of the hardware in our devices actually gets put to use".

warmwaffles · on Aug 3, 2016

It's funny, anecdotally I've seen developers say that it should be the compiler's job to insert SIMD instructions and do optimizations. But many don't understand SIMD and think it's something as simple as replacing a few instructions.

stuaxo · on Aug 2, 2016

Interesting, especially seeing how other architectures do it, I don't know if libpng does google summer of code, but that would be a great suggestion for a GSOC.

brohee · on Aug 2, 2016

> wouldn't 20 y/o open source software as prevalent as FreeType have optimized those code paths?

The tradeoffs that made the most sense 20 years ago are not those that would lead to the fastest implementation on current hardware. This is not so much the Freetype is unoptimized, it's that old, possibly wrong optimizations are pretty much baked in now...

Modern CPU have massively more cache and have vectorization instructions, which make the optimal solution very different from the one optimal for the Pentium II that was top of the line when Freetype was first conceived... It's also acceptable to use vastly more memory, Freetype dates from a time where a beefy desktop machine had maybe 32Mb of RAM...

zellyn · on Aug 2, 2016

Raph really, really knows his stuff. His interview on the New Rustacean podcast starts with a summary of his background: Gimp, Ghostview, android font rendering.

zellyn · on Aug 2, 2016

Your 10x speed, 90% functionality caution is definitely warranted. I've lost count of how many times I've seen "we run Python/Ruby/Perl/whatever" 3x faster, just haven't implemented exceptions and monkey-patching yet. :-)

catawbasam · on Aug 2, 2016

Yup, I still remember being blown away by his dissertation. That he is using Rust in earnest is enough reason for me to take another look at the language.

cdbattags · on Aug 2, 2016

Any chance we can get a link?

Edit: Whoops, see below. And for the lazy: http://www.newrustacean.com/show_notes/interview/_2/index.ht... (credit to @wscott)

on_and_off · on Aug 3, 2016

As a font neophyte I can only say that he sounds like he breathes fonts. His interview in Android dev Backstage was very interesting : http://androidbackstage.blogspot.fr/2015/01/episode-20-fonts...

(And the whole podcast is an easy recommandation. Googlers talking about Android dev stuff from an insider perspective)

microcolonel · on Aug 2, 2016

For what it's worth; Gimp, Ghostscript, and Android all use FreeType for text rendering. At best, they would require him to know how to use FreeType bitmaps correctly (i.e. gamma-correct blending, when to multiply colours in). Though he could easily go beyond that, it is very interesting stuff and I find it incredibly hard to stop myself from reading into FreeType.

jerf · on Aug 2, 2016

"It sounds like you're claiming that FreeType is slower because the parsing/accumulation implementations are slower."

This is one of those performance drains across many languages that we often don't see because our weak languages don't let us. Manifesting an array for a consumer that only wants to iterate on it once, in order, with no backtracking, is a common antipattern. Just as I'd look askance at any putative "next big language" that doesn't have any sort of closure support, I look askance at "next big languages" that don't have iterators.

nilved · on Aug 2, 2016

> It's far from my area of expertise, but wouldn't 20 y/o open source software as prevalent as FreeType have optimized those code paths?

No, 20 y/o software had Heartbleed. Just because software is old doesn't mean it's battle-tested.

moosingin3space · on Aug 2, 2016

And just because software is mature, open-source, actively developed does not imply it is optimal for today's hardware. See WebRender versus GDI/GTK2, for instance.

oldmanjay · on Aug 2, 2016

Being open source is no guarantee of being fast, or good, or anything. The only thing you really know you're getting is the ability to inspect the code to figure out if it suits your needs instead of having to assume.

kibwen · on Aug 2, 2016

The author is in the thread on /r/rust if anyone has any questions: https://www.reddit.com/r/rust/comments/4vqpxx/inside_the_fas...

(Please ignore today's deliberately garish background image, we just turned MIR on and we're celebrating. :P )

ufo · on Aug 2, 2016

That warning about the web design shows that you know the HN crowd very well. Complaints about fonts, annoying Javascript and other things that make the site hard to read always get voted to the top :)

wmil · on Aug 2, 2016

This is actually an important use for Rust. Many font renders were initially written for speed on single user systems running trusted programs.

Vector rendering creates a lot of edge cases that C tends to ignore.

The initial X-Box soft mod hack was done by loading a font with negative values in key fields. Microsoft had brought over the Windows font rendering code, and that wasn't written with hostile fonts in mind.

moosingin3space · on Aug 2, 2016

My belief is that Rust today, in production, is best used in networking code, such as protocol parsers, but the interest in font rendering in Rust will help to push the priority of stabilizing SIMD up, which should help quite a bit with the applications where Rust lags in performance.

pcwalton · on Aug 2, 2016

I haven't observed that the lack of stable SIMD makes Rust only suitable for networking code. You can easily use SIMD by writing in assembly (inline assembly, even), with unstable intrinsics, or even with autovectorization.

In Servo, for example, we have large speedups over existing C++ codebases that have nothing to do with networking.

anp · on Aug 2, 2016

Inline assembly is still unstable though, yes?

Also autovectorization doesn't support many uses and can be quite brittle AFAIK.

I think Rust is absolutely amazing, but stabilized SIMD support is right up at the top of my wishlist.

steveklabnik · on Aug 2, 2016

Inline asm is still unstable, yes.

moosingin3space · on Aug 2, 2016

Oh I'm a huge Rust fan and am using it in nearly everything, but I'd really like to use it in data analytics/linear algebra applications (it would be nice to be able to use a single language for this instead of the mash of Python, R, Octave, and C), and lack of stable SIMD has been an issue there.

vardump · on Aug 2, 2016

It would be useful to be able to have an option for 16 (SSE)/32 (AVX)/64 (AVX512) byte aligned stack in Rust, to help with SIMD alignment requirements. That would help with SIMD related stack alignment code and reduce cases where dropping down to assembler is needed. Of course external ABI might need need to have normal alignment.

Ability to selectively align functions too would be nice. Sometimes it's nice to get them to 64-byte boundary, to reduce icache latency and waste. You usually only have 512 64-byte lines of L1 instruction cache, sometimes it pays to be able to choose where it's spent.

publicfig · on Aug 2, 2016

There's also rusttype [1] that was written as an alternative to FreeType in Rust, I know it's being used at this point in Redox [2], which is an interesting project of a Unix-like operating system being written entirely in Rust to the best of my knowledge. I'm curious to see how the two compare at this time.

[1] https://github.com/dylanede/rusttype

[2] https://www.redox-os.org/

raphlinus · on Aug 2, 2016

I did some measurements of rusttype and found it to be even slower than FreeType. That said, this is all open source and so I have confidence that the improvements will flow all the way, either through rusttype using the font-rs rasterizer, or font-rs becoming robust enough for downstream like Redox and Piston to adopt.

moosingin3space · on Aug 2, 2016

Integrating parts of this into RustType would be awesome!

wscott · on Aug 2, 2016

Very interesting post.

Note that the New Rustacean podcast did an interview with the author of this post, Raph Levien. That was very interesting and he did touch on this program. http://www.newrustacean.com/show_notes/interview/_2/index.ht...

I believe it should be possible to have rust compile to a library that could be called from a normal C program.

untothebreach · on Aug 2, 2016

Indeed, C <-> Rust interop is one of rusts design goals. The Rust Book goes over this a bit: https://doc.rust-lang.org/book/ffi.html#calling-rust-code-fr...

steveklabnik · on Aug 2, 2016

  > I believe it should be possible to have rust compile to a library
  > that could be called from a normal C program.

Not only that, but languages where you can use C as a way to extend them. Rust has been in production for years as "a Ruby gem written in Rust", for example.

joosters · on Aug 2, 2016

Are fonts always rendered 'completely' these days? I thought that the fonts would be rendered once to a cache of bitmaps/textures, and then those bitmaps can be copied to the screen/buffer pretty much instantaneously.

After all, there's no point re-doing all the calculations to draw all the curves in a letter 'g' when the output is going to look just the same as the last time you drew it...

pcwalton · on Aug 2, 2016

> Are fonts always rendered 'completely' these days? I thought that the fonts would be rendered once to a cache of bitmaps/textures, and then those bitmaps can be copied to the screen/buffer pretty much instantaneously.

They are. But (a) non-Latin languages often miss in the cache; (b) subpixel positioning makes cache misses happen more often; (c) sometimes people animate font size, negating the optimization; (d) we care about initial load time.

acdha · on Aug 2, 2016

> After all, there's no point re-doing all the calculations to draw all the curves in a letter 'g' when the output is going to look just the same as the last time you drew it...

Beyond the sub-pixel aliasing and CJK questions other people asked, a fair number of people use languages like Arabic, Devangari, etc. which have complex rules for how adjacent characters affect rendering (you can see something like this in English with a font like Zapfino which has ligatures: http://download.linotype.com/free/howtouse/ZapfinoTips_e.pdf). I would imagine all of that would conspire against cache hit rates more than we might guess.

That's not to say this isn't great work but just that any time something involves text rendering it seems to inevitably sprout special cases on the special cases.

witty_username · on Aug 3, 2016

Devanagari is not a language; it's a writing script.

acdha · on Aug 3, 2016

You're right, I should have been less colloquial in that reference since I was using “Arabic” in reference to the script, which is used by multiple languages.

posterboy · on Aug 2, 2016

hard to imagine the code used to render the bitmaps would be smaller and hence less often lead to cache misses than the bitmaps.

munificent · on Aug 2, 2016

acdha is referring to the cache of rendered bitmaps, not the CPU cache.

Font renderers that cache keep a cache of bitmaps for glyphs that it has previously rendered. Since you need a separate bitmap for each Unicode codepoint or ligature × font size × subpixel offset, the cache could potentially get huge.

So, they cap the number of bitmaps that get cached and evict some. But when you're animating a font's size or dealing with non-Latin languages, you're churning through so many unique bitmaps that you end up not getting much value from the cache.

microcolonel · on Aug 2, 2016

There is a point. Subpixel positioning is important and it's hard to know how much memory will be taken up by the full set of characters. Because you can not know the advances(and hence the exact positions) before you render the characters, you can not render them ahead of time without quantizing the position and storing multiple bitmaps.

This quickly becomes a non-optimization.

p0nce · on Aug 2, 2016

What I do is rounding the glyph position to 1/4 of pixels, in my tests it has not much visual impact. That still mean in the worst case the glyph cache could contains 16x the same glyph, and that's before hinting enters the equation.

panic · on Aug 3, 2016

You can get good subpixel rendering from a single bitmap using a larger prefiltered glyph: https://github.com/nothings/stb/tree/master/tests/oversample

Ono-Sendai · on Aug 2, 2016

Sure you could. Just resample the bitmap.

zellyn · on Aug 2, 2016

On the parallel Reddit discussion, someone mentioned that the gain is less when rendering CJK scripts, since there are so many different glyphs.

sievebrain · on Aug 2, 2016

Surely the biggest win here is not so much the faster rendering, which is nice, but rather that font rendering is so often a source of remote code execution exploits ...

pcwalton · on Aug 2, 2016

I think Rust is at its strongest when it achieves both improved performance and better security. Not everyone cares about performance, and not everyone cares about security, but most people care about one or the other.

danbruc · on Aug 2, 2016

Looking at the source code it does not look like this implementation includes the virtual machine for hinting which seems the most likely source of remote code execution vulnerabilities in a text rendering system.

Drdrdrq · on Aug 2, 2016

I'm curious: is font rendering speed something that regular user would notice in his day to day use? Maybe in battery life? Or is this important just for some specialized (designers'?) use?

I do hope this project gains traction, though the advantage I see is in replacing another piece of legacy code with safe(r) one (Rust).

rspeer · on Aug 3, 2016

Right now you can basically DoS gnome-terminal on Ubuntu by sending it a screenful of Thai text (or any of many similar scripts where most adjacent letters have ligatures to each other, but Thai is by far the most common). It slows down to rendering like 40 characters per second.

I work with multilingual corpora and I live in fear of accidentally scrolling into the Thai parts. I would report this bug but I don't know whose bug it is.

So maybe I'm not a regular user, but font rendering speed is something I notice.

anthk · on Aug 3, 2016

gnome-terminal is slow. Try st, or rxvt.

rspeer · on Aug 3, 2016

You must mean something else besides rxvt, which doesn't support Unicode.

welterde · on Aug 3, 2016

Probably meant rxvt-unicode and not plain rxvt, which doesn't appear to be under development anymore.

rspeer · on Aug 5, 2016

As a follow-up: I tried urxvt and st, despite the unfriendliness of the fact that you configure urxvt by looking up .Xresources incantations and you configure st by actually editing the code and recompiling it.

It seems what we have here is a tradeoff: the font rendering in your terminal can be good, or it can be fast, but perhaps not both given the current options. urxvt and st are doing some kind of very low-level font rendering that doesn't match the fonts I see most of the time on Ubuntu. Whatever antialiasing algorithm they use, if you let them antialias, is a smudgy mess, and not something I would want to look at all day.

The result only makes me appreciate more the idea that good, fast, text rendering is something that a new library could help us achieve.

anthk · on Aug 6, 2016

Edit ~/.fonts.conf

https://wiki.archlinux.org/index.php/Font_configuration#Hint...

I have the hinting style set to slight.

moosingin3space · on Aug 2, 2016

A reduction in font rendering time would make it easier to meet higher framerates, especially on mobile when rendering CJK characters, as mentioned on the Reddit thread.

Mathnerd314 · on Aug 2, 2016

For a while now, I've been wondering if font rendering could be improved by using a Lanczos filter instead of a box filter for anti-aliasing. (Conceptually, when you compute the pixel coverage you are convoluting the image with a box filter and then sampling at each pixel's center) The performance impact might be too high, but I haven't seen anyone try the experiment.

vardump · on Aug 2, 2016

> if font rendering could be improved by using a Lanczos filter instead of a box filter

You don't want an ideal filter for rendering fonts -- you want sharp edges instead. A lot of work [1][2] has been done to achieve this.

In font rendering, alignment to pixel boundaries is intentional, it's called hinting. So you could say aliasing is used for an advantage.

[1]: https://en.wikipedia.org/wiki/Font_hinting

[2]: https://en.wikipedia.org/wiki/Subpixel_rendering

Mathnerd314 · on Aug 2, 2016

> you want sharp edges instead

It depends on your philosophy: https://blog.codinghorror.com/font-rendering-respecting-the-... http://www.joelonsoftware.com/items/2007/06/12.html

In general antialiasing and font hinting are at odds; you want one but not the other. With recent displays, the pixels are small enough that you can use anti-aliasing all the time, and no hinting: http://lh5.ggpht.com/-tsgwX-9fsRc/U6armqjI6QI/AAAAAAAAB9E/Bw...

Subpixel rendering is just another anti-aliasing technique; it can be used on any bitmap.

Of course, it's also true that the sinc filter is not the best possible. There's interesting research on shearlet transformations http://colorlab.no/content/download/46570/721686/file/2014_C... and other weird filters http://www.ansatt.hig.no/mariusp/publications/Pedersen2015_I.... Or just avoid the filtering stuff entirely and optimize a whole-eye model: http://michaelfrankdeering.com/blog/projects/eye_work/eye_mo...

Ericson2314 · on Aug 2, 2016

I think on today's high density mobile phone screens, these techniques are no longer needed.

TheRealPomax · on Aug 2, 2016

if those were the only screens in the world, I would agree wholeheartedly, but there are millions, if not more, non-mobile screens in use today. This problem is unlike to go away with high-dpi technology in the next 20 maybe even 50 years.

Ericson2314 · on Aug 3, 2016

Well, a different render can be written for those screens.

I assume 4K will come do predominate, but we'll see.

wtallis · on Aug 2, 2016

They can still provide some benefit, but it's true that modern high-DPI mobile screens don't need the heavy-handed hinting of Windows and doing subpixel rendering when the subpixel orientation keeps changing as the device is rotated can be more trouble than it's worth.

Const-me · on Aug 3, 2016

Impressive!

> dense representations have a huge advantage when data-parallelism is an option

Sparse or dense isn’t binary choice, it’s possible to combine the two to have best of both.

When I was working on similar problem in 3D space and with much larger dataset, I represented my voxels as sparse collection of small dense blocks. The blocks are small enough to fit in a single cache line, small enough to save a lot of RAM space + bandwidth because many are empty, but inside they are dense and large enough to benefit from SIMD parallelism.

However, for 2D images that only take a few hundred kb RAM dense buffer is probably better because fits in L1 or at least L2 cache.

amelius · on Aug 2, 2016

I always wonder why a distinction is made between font rendering and rendering arbitrary SVG shapes. In other words, couldn't this renderer be much more useful when generalized?

rtpg · on Aug 2, 2016

I would imagine that arbitrary SVG rendering is harder to get right?

With fonts you can probably cache a lot of the rendering (even considering things like ligatures), and considering how much text your machine is showing, the performance tricks could be extra useful.

valarauca1 · on Aug 2, 2016

SVG standard includes JavaScript.

So SVG renderers are very non-trivial to implement.

posterboy · on Aug 2, 2016

hilarious

kbenson · on Aug 2, 2016

In a "laugh because the only other option is to break down and cry" sort of way, yes.

amelius · on Aug 2, 2016

> and considering how much text your machine is showing, the performance tricks could be extra useful

Well, an SVG-based drawing program has to re-render a lot of the same SVG every time the user makes a little change. Also, when animating an SVG scene, a lot of the shapes are redrawn continuously.

raphlinus · on Aug 2, 2016

I think there is value in applying these ideas to SVG rendering as well. There are of course a bunch of things that are different. For SVG, buffers get large enough you'd probably want to do _some_ banded rendering because the accumulation buffer might otherwise get too big.

Another serious potential win in SVG is that you might be able to interleave the area integration with alpha compositing / masking. Could end up pretty darned fast.

kragen · on Aug 4, 2016

I was thinking about using your prefix-sum approach for exactly that almost all of last night. I don't have an attack on the problem that I'm satisfied with yet:

1. Render each path to a fresh pixel buffer, alpha-compositing it down onto the final canvas. Advantage: straightforward, works for sure. Disadvantage: you need a lot of multiplies per pixel.

2. Partition paths into "layers" of nonoverlapping paths, render each layer as a unit, and composite each new layer down onto the final canvas. Overlapping opaque paths can be incorporated into the same layer as whatever is below them by cutting an overlap-shaped hole in what's below before adding in the new path; although that involves some intersection tests to know where to stop, my intuition is that it will be a big win. Advantages: straightforward, probably faster than the previous one. Disadvantages: the partitioning is a potentially costly extra step (one of those things that makes me wonder if it's NP-complete to do it optimally), and there are still potentially many multiplies per pixel.

3. Separately accumulate a numerator (total premultiplied color) and denominator (total alpha) for each pixel, then divide in the end. Advantages: You avoid doing lots of work per pixel. Disadvantages: This is a weighted sum, not alpha blending. Alpha blending is a different thing. So the result is wrong. Also, an honest division per pixel is more expensive than quite a number of multiplications, although maybe you could cheat on the final division with a table of approximate multiplicative inverses or something. So this would probably be super slow.

4. Find a different group other than ℤ/256ℤ in which to do prefix-sum that somehow gives you the right results. Then you can just render all the edges into the same buffer and do a single vectorizable prefix-sum operation over it.

Advantages: This sounds super fast.

Disadvantages: It seems clear that this group is going to have to be able to represent the entire Z-ordered stack of colors at every pixel, because if I'm looking at some translucent green on top of translucent red on top of opaque black on top of pale blue, and I reach the right (negative) edge of the opaque black path, somehow I have to have remembered the blue thing underneath in order for it to peek through, which suggests to me that I need an unbounded number of bits per pixel to implement this scheme, which probably is not going to admit an actually fast implementation. In effect it has to reduce to the second approach, except that the software has to deal with the stack of layers once for every pixel. Or is there some magical way around this, at least for a fast-path case?

This part is probably obvious to you, Raph, but you can do SVG linear gradients with two prefix-sum passes instead of one, where the first pass just runs over signed gradient stops and gradient clipping boundaries, and then you draw the signed path boundaries into the buffer before the second prefix-sum pass. (Is that clear? I suspect it may be too abbreviated.)

I suspect that with three prefix-sum passes you could do a decent quadratic-spline† approximation of arbitrary gradients, including the weird skew cone gradients SVG calls "radial gradients". But I haven't worked out the details.

I know you don't have a lot of time to hack on this stuff right now, but would you have time to provide feedback if I were to hack on it a bit? I imagine that I'd run into any number of places where talking to you about it for half an hour could save me days of wasted effort.

† here I'm talking about what Carl de Boor calls "splines", which I know disagrees with your usage of "splines". I think you called them "B-splines" in your dissertation.

raphlinus · on Aug 7, 2016

Thanks for the detailed comment.

I think you're going for something much more complicated than what I had in mind. My idea is simply to have two modes other than accum buffer -> 8 bit alpha mask. One would be accum buffer + constant RGBA color -> update RGBA buffer. (By update I mean read an RGBA pixel, do the compositing, and write the composited pixel in place). The other would be accum buffer + source RGBA buffer -> update RGBA buffer. Of course, it's possible to imagine interleaving even more operations in the generation of the source RGBA buffer, but at some point the register pressure overcomes the r/w bandwidth.

Prefix-sum is not a particularly fast SIMD operation, due to the horizontal data dependency. I chose it for font-rs because I don't know of any faster ones that can get the job done. For just computing gradients, it's almost certainly going to be faster to compute it directly (it's simple multiply-add in the case of linear gradients) than to try to strength reduce. The same is no doubt true for SVG radial (cone) gradients.

When the gradient doesn't have any sharp creases or singularities, a very reasonable strategy is to compute it in lower resolution and then up-res, say 2x or 4x to keep the math super-simple. When it does have creases, it might make sense to decompose into regions and use different approaches in different regions.

In any case, it sounds like you may be re-inventing Cairo or Skia here. Might make sense to take a closer look at what they do and whether there's truly any low-hanging fruit left. I know that Skia has a bunch of SIMD optimizations already.

But this is fun stuff to think about and experiment with. I certainly don't want to discourage you.

vardump · on Aug 2, 2016

TrueType includes a virtual machine [1] for hinting. Text rendering also needs to alter shapes of glyphs for other reasons, like font weight and pseudo-italic.

[1]: https://en.wikipedia.org/wiki/TrueType#Hinting_language

wmil · on Aug 2, 2016

There are specific clarity optimizations that are important for font rendering but don't generalize to all SVG rendering.

wldcordeiro · on Aug 2, 2016

You could generalize that to image rendering too, same reason why Photoshop and other tools have 'optimize for text' and 'optimize for art' as options.

glippiglop · on Aug 2, 2016

I work on a project that uses Free type for text and an improved version of AntiGrain for SVG rendering. The thought of using AGG for the font rendering has already crossed my mind, but mainly for reducing the size of the codebase. Speed isn't really a concern to me because the glyphs are already being cached to internal bitmaps, after which the bottleneck is the drawing algorithm (which is straightforward anyway). Overall, it's not worth my time to change the font engine.

The really nice thing about using an SVG engine to draw text is the potential for applying runtime effects, transformations and so forth. However that's an entirely different use case and can be done through SVG as it stands. Most of the time you want your text to be on the horizontal and easy to read, and vanilla FT is fine for that.

TheRealPomax · on Aug 3, 2016

Fonts are not "collections of pictures", though, and haven't been since the mid 1980. Details in http://pomax.github.io/1449777175633/opentype-let-s-learn-ho... but the gist is that a font is a full, complex, 700-page-specced typesetting program that has to be run on a render engine, similar to how a game ROM is run on an emulator or hardware setup. Sure, you need the engine, but there is nothing to generalize: that engine needs to implement a really large spec, most of which is not "drawing a vector image".

stuaxo · on Aug 2, 2016

Text is usually made up of lots of small solid shapes whereas SVG will usually have larger shapes, sometimes using effects such as gradients - so it's likely that different optimisations will work best on each one.

speps · on Aug 2, 2016

The linked article[1] about stb_truetype seems to have polygons instead of fonts so it could be used for polygons or any vector.

[1] http://nothings.org/gamedev/rasterize/

kevin_thibedeau · on Aug 2, 2016

Font rendering makes some optimizations that aren't general purpose. Rendered glyphs are (may be) cached. Assumption of a single color. No self intersecting outlines. These allow font renderers to take shortcuts for performance.

nathell · on Aug 2, 2016

Extremely impressive and exciting, kudos to the author!

It would be interesting to compare the memory footprint and code size of font-rs vs freetype; comparing just speed doesn't give the full picture.

c-smile · on Aug 2, 2016

Those manual use/optimization of SIMD are done in C anyway.

What exactly is so Rust'y in the code? If to compare with C/C++? What is the main benefit of using Rust? Or is it just a test like "you can do that in Rust too"?

vardump · on Aug 2, 2016

> Those manual use/optimization of SIMD are done in C anyway.

C seems to be used only for SSE intrinsics support. ~16 LoC of "C".

This is all C in the whole project, if you really think you can call it as such:

https://github.com/google/font-rs/blob/master/src/accumulate...

  void accumulate_sse(const float *in, uint8_t *out, uint32_t n) {
    __m128 offset = _mm_setzero_ps();
    __m128i mask = _mm_set1_epi32(0x0c080400);
    __m128 sign_mask = _mm_set1_ps(-0.f);
    for (int i = 0; i < n; i += 4) {
      __m128 x = _mm_load_ps(&in[i]);
      x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
      x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));
      x = _mm_add_ps(x, offset);
      __m128 y = _mm_andnot_ps(sign_mask, x);  // fabs(x)
      y = _mm_min_ps(y, _mm_set1_ps(1.0f));
      y = _mm_mul_ps(y, _mm_set1_ps(255.0f));
      __m128i z = _mm_cvtps_epi32(y);
      z = _mm_shuffle_epi8(z, mask);
      _mm_store_ss((float *)&out[i], (__m128)z);
      offset = _mm_shuffle_ps(x, x, _MM_SHUFFLE(3, 3, 3, 3));
    }
  }

Also consider top level: https://github.com/google/font-rs/tree/master/src

> What is the main benefit of using Rust?

Safety and performance.

From the article:

"With the SIMD speedup, font-rs is approximately 7.6x faster than FreeType in larger sizes (keep in mind that 42 pixels/em is the default for xxhdpi Android devices)."

dman · on Aug 2, 2016

Any reasons why the sse code wouldnt be written in rust itself?

brohee · on Aug 2, 2016

The SIMD crate is still unstable so not usable with the stable compiler, that's the only reason. It should take very little time to convert it to Rust.

phpnode · on Aug 2, 2016

No SIMD intrinsics yet

kibwen · on Aug 2, 2016

Or rather, they're there, but only on nightly builds of Rust, though there's talk of stabilizing them with only minor changes.

aidenn0 · on Aug 2, 2016

Firstly, the graph clearly shows huge gains across point sizes without SIMD optimizations enabled.

One of the big gains mentioned in the article for Rust was using iterators instead of a one-time-use vector, which is very hard in C, would currently be best done with Boost in C++, but is a core idiom in Rust.

For optimized real-world code, the innermost, hottest, loop is almost always machine-specific, and a superset of C is used if the language of implementation doesn't have intrinsics for the particular machine feature one wants to access.

I have only dabbled in Rust, but it is common to see other high-level languages where given X amount of effort the high-level language implementation is faster than a C implementation that could be implemented with a similar amount of effort. It's almost certainly possible to create an iterator version with a custom stack allocator in C that matches the performance of the Rust version, but you will be creating an ad-hoc, informally specified, buggy version of two of rust's core features[1].

C++ is a different beast entirely as the language is on track to completely reinvent itself every 6 years, so it's likely in a few years that you could port the Rust version nearly unchanged to C++17 or C++20, and there's probably a Boost library that allows it now.

1: https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule

fsloth · on Aug 2, 2016

"iterators instead of a one-time-use vector, which is very hard in C, would currently be best done with Boost in C++, but is a core idiom in Rust."

Maybe misundestood somethin. But hard or requiring boost? Isn't this basically what you meant?

class enumerating_parser { uint8_t* pos; uint8_t* end; enumerating_parser(const uint8_t* bytes, int len):pos(), end(pos + len){} bool next(ParserResult& out){ if(pos == end) return false; // do parsing here and move pointer forward } };

avita1 · on Aug 2, 2016

This implementation uses an underlying, materialized array right? And that was the very thing that the author is trying to avoid.

fsloth · on Aug 2, 2016

No. There is the original bytestorage for the data that is to be parsed - and is the only thing necessarily taking a chunk of memory - and the sink for the parsed data somewhere else. The state held by the enumerator itself is tiny. A pointer to the source buffer is held in the pos pointer. As the user extracts entities from the parser the pos pointer is just forwarded. There is of course the parsing code inside 'next' itself which was not included, but which knows what entities to expect and forwards the pos by the entity size while checking it does not go over the end.

i.e. enumerating_parser p(source, source_len); ParseResult r; while(p.next(r)){// forward context of r forwards}.

topspin · on Aug 2, 2016

enumerating_parser look_ma_no_memory_safety(something, -1); // pos never == end

So use unsigned len or add more code to check your inputs or just use Rust where this sort of unsafe pointer arithmetic doesn't even compile.

fsloth · on Aug 2, 2016

Yes, actual memory safety is a feature that Rust has, and C++ has lots of footguns. It still does not make using an enumerator correctly any harder.

Rusky · on Aug 2, 2016

Hard in C, which your example is not, and which FreeType is.

fsloth · on Aug 2, 2016

No, it's pretty much the same in C[0]. Sorry, my point was not laid out. It was not to be obnoxious, but that one can implement various pattern fairly succintly in C or C++ and thus "is difficult to do in language x" is not a very compelling argument there.

Rust has algebraic datatypes, memory safety, pattern matching, etc, - those are actual things that are, if not hard, then at least tiringly verbose to do in C and C++;

0: enumerating_parser now becomes struct enumerating_parser{}, the constructor becomes a factory function while next() takes in the parser as a state parameter.

raphlinus · on Aug 2, 2016

The specific claim I'm making is that Rust has language (in the form of making it the default in `for` loops) and library (in the form of the `Iterator` trait) support for this pattern.

Further, Rust's monomorphizing approach to generics means that you can count on efficient code. Iterators over a slice typically elide the bounds checking, reliably.

Of course you can use these patterns in other languages, but you lose something, specifically, clarity and safety. I believe that's the reason you see them used all the time in Rust and rarely in C and C++.

fsloth · on Aug 3, 2016

Yes, I agree with your original analysis completely - well thought out languages with a richer basic type zoo are generally much more pleasant to deal with than C++.

My answer was to the specific claim that an iterator using the "next" formulation would be hard to use in C++ - it's not, and it makes several pieces of code much nicer than using the default iteration scheme of raw index or pointer-hopping.

So, the intent was not to make a statement "we can do that in C++, Rust has no merit in this regard" but rather, "you should not shy away from this nice pattern if you are forced to use C++ in your daily work". This intent was not that obvious in the message's context, though.

aidenn0 · on Aug 6, 2016

And I misspoke in my initial comment; you could do it and be memory safe with Boost. Obviously iterators can be used in basically any language.

raphlinus · on Aug 3, 2016

Total agreement, well said.

anp · on Aug 2, 2016

This thread (https://www.reddit.com/r/rust/comments/4vqpxx/inside_the_fas...) on the Rust subreddit includes a comment from the author regarding using the SIMD functionality in nightly Rust:

"Oooh, it's made a lot of progress recently! Funny thing, I actually started experimenting with this stuff pre-Rust 1.0. Last I looked at the SIMD crate, it was missing a whole bunch of stuff, and my style is to use fairly exotic SIMD instructions when they can help (for example, _mm_shuffle_epi8 is available in tmmintrin.h but not pmmintrin.h.

This is a technology demo, so using a not-yet stabilized feature seems in scope. Perhaps I'll try the SIMD crate for writing the ARM version."

bluejekyll · on Aug 2, 2016

It's the Iterators that make the big difference. Stack based iteration is a huge performeance gain that rust programs have.

You might have missed that in his post.

brohee · on Aug 2, 2016

To add to that, it's likely a huge performance gain because doing it on the stack instead of allocating on the heap improves cache locality quite a bit.