Fun story: When I implemented the syntax highlighting feature in Sequel Pro, I needed a way to determine the string length of UTF-8 encoded strings. I didn't find a function available on macOS that worked directly on a char*, so I googled and found a really simple UTF-8 strlen function. It was easy to understand and very fast. I think it was this: http://canonical.org/~kragen/strlen-utf8.html
But the funny thing is, that the supposedly fast vectorised strlen was optimised for very long strings. The benchmarks shows results for megabytes of texts. But we measured the length of tokens, usually only a few characters long, so in most cases the new function was actually slower!
I was a very junior dev, didn't want to piss off anybody, and the strlen wasn't in the hot path anyway, so I didn't say anything. But I was a bit sad, that my easy-to-read code was replaced by such a monstrosity.
What's my point? Before you go and use these functions in your code, profile your code to see if it would actually affect performance.
Somehow the article fails to mention that the speed will depend heavily on the input string: both its size and distribution of whitespace characters.
Even assuming the question is for the limit of very long strings, the distribution makes a huge difference. Natural English has spaces on average every 5.1 characters [1], so using multi-character tests to speed up the case of runs of 8 or more characters without a space will probably slow it down, not speed it up!
using multi-character tests to speed up the case of runs of 8 or more characters without a space will probably slow it down, not speed it up!
Yes, although there is the option of combining the vectorized comparisons with a branchless approach. I modified his code to do this, and got what seemed to be a flat .40 cycles per character independent of input, which is about twice the speed Daniel illustrated on the input with 3% spaces. I'm sure it can be made even faster (what would the limiting factor be?), but I think this shows that a multi-character approach is faster on all input than would be possible with any single-character approach?
I'm working with the author, and it should be in his GitHub repository soon. But if you want to reproduce before then, you should be able to just change the line "if (mask16 == 0)" to "if (0)" in sse4_despace(), which is in include/despace.h.
The idea is that rather than testing whether any spaces were found and taking a "shortcut", we always do a lookup on mask16, shuffle the bytes according to the result to remove the spaces, and advance the output pointer by 16 minus the number of spaces found. This costs 2-3 cycles for completely spaceless input, but saves the ~15 cycle branch misprediction penalty each time an unexpected space is found.
What strikes me is how many sites won't accept a login name, credit card number, phone number, or other field with leading or trailing spaces. This for something where the speed of the code doesn't matter at all. I can't think of any explanation other than incredible laziness or incompetence for not stripping off spaces.
You might not notice this if you use a password manager or browser autofill, but it's a lot of sites, including companies like major airlines for example.
Never mind that -- it's just a miracle when you can enter a credit card number formatted as 4123 4567 8901 2345 rather than squished together.
I worked at a small technical training company in early 2001. We had an in house application which was used for scheduling classes. If you entered a trailing space the program would beep 3 times and display a dialog box warning you in capital letters that trailing spaces would corrupt the database. The button to close the dialog box was labeled "I understand and will obey". It probably took the developer more time to create the dialog box than to sanitize all inputs.
The very premise of the article 'how quickly can we remove whitespaces' is rooted in the intellectual foundations of CS. As a culture, we are are obsessive about 'performance'.
This is because back in the day, it's always been important - and even today 'under the hood' it's always important. And of course there are situations in which it's still important (complex algs, limitations of mobile devices).
But in reality - these things are never the issue.
The 'issue' is the pragmatic application of basic algorithms to do a number of basic things elegantly, which together form the foundation of a good user experience.
Yes - the issue of 'no spaces' in card numbers etc. is a clumsy thing, and it's laziness by developers.
Also - things like 'performance' are objectively measurable, you can get cool data for it etc..
A 'bad experience' is sometimes difficult to define.
As a culture, we are are obsessive about 'performance'.
My purely anecdotal impression is quite the opposite. Speed of delivery and convenience for the developer (not the end user) seem to be the norm.
Frameworks, scripting languages, browser-based desktop and web apps: none of these have the characteristics of being small, nimble, lightweight, or performant. They certainly make life easier for the developer. Whether users get a 'good experience' out of the end result is open to debate.
This code will work on all UTF-8 encoded strings… which is the bulk of the strings found on the Internet if you consider that UTF-8 is a superset of ASCII.
Note that the first set of code (and possibly the rest), only work as the space, newline, and carriage return are the 7 bit ASCII set is included in UTF-8. However, the extended 8-bit ASCII set is not, but is often included when people speak of ASCII. So for example, if the request was to remove all "Copyright Sign" symbols, which is U00A9, it would not work correctly. The UTF-8 encoding for this symbol is 0xC2 0xA9, but the code only works on individual bytes, so it would remove the A9 byte, leaving a C2 byte and then whatever byte came next. Additionally, it would hit other UTF-8 characters like the "Greek Capital Letter Omega" (Ω which is encoded in UTF-8 as 0xCE 0xA9)
tl;dr
Only works for the 7-bit ASCII set, but not the common extended 8-bit ASCII sets
It's sort of a semantic argument, but I've never once heard anyone refer to "ASCII" to refer to anything other than the 7-bit standard. There is no "extended 8-bit ASCII set". That's a blanket term (though in casual use, "code page" is more typical) for any of the literally dozens of 8-bit character encodings that overlap with ASCII in the bottom half of the encoding space.
Basically the blog post was correct and precise: it removes ASCII characters from UTF-8. There is no elaboration needed.
Again, though, the post was about using SIMD primitives to optimize what looks like a scalar problem.
A number of the 8-bit sets that embedded the ASCII 7-bit set were informally referred to as ASCII especially during the 1980s; this may have been encouraged by microcomputer BASIC interpreters using the ASC() function (named for ASCII) to return the character code of a character in the system code set, which was often an 8-bit set that embedded ASCII.
You would be surprised. Referring to any of the most common 8-bit encodings as "extended ASCII" or even just "ASCII" has been very common in my experience.
Yeah. Most people (well, back in the DOS days anyway) who don't know what a "7-bit encoding" is would refer to code pages as "ASCII". I'm sure I did that hundreds of times myself (I didn't know better until I was what, 20?) and I've seen it thousands of times, not exaggerating. Just look at how many "[extended] ASCII games" there are.
I know a lot of DOS people used the term "extended ASCII" to mean specific 8-bit encodings whose first 7 bits were ASCII, but what "extended ASCII" meant depended quite a lot on the context. I came from a different background and it was very clear to me that ASCII was a 7 bit encoding that was almost always stored in an 8 bit byte (I can only think of a few instances in my whole life where I saw bitstreams of ASCII). So that last bit you could do whatever you wanted with.
Having said that, I clearly remember arguing with people that encodings with the eighth bit set were not part of ASCII. As you say, there were many people who didn't understand. My guess is that the GP never interacted with those people. Especially if you were around pre-DOS I think it would be easy to do.
If you used Macs in the 80s and 90s and shared files with PC users, you heard about it all the time, because high-ascii characters would map incorrectly across the platforms.
One neat thing about UTF-8 is that all bytes that compose non-ASCII characters will have the high bit set. ASCII values range from 0-127, so no ASCII values will ever have the high bit set. Therefore, a byte that corresponds to an ASCII value is guaranteed to be the same character as the corresponding ASCII encoding.
No, the article states the method will correctly remove the mentioned whitespace characters even from UTF-8 text, because they are part of the 7 bit subset this method works on.
> ISO 8859-1 encodes what it refers to as "Latin alphabet no. 1," consisting of 191 characters from the Latin script.
Now there are a few code points in ISO 8859-1 that are undefined, whereas they are defined in Unicode Latin-1, and Windows-1252, but they're mostly the same. The major difference is € and the TM symbol.
Exactly as it sounds. The characters encoded in Latin-1 are specific for Western Europe and thus may not appear in other ISO-8859 character sets.
> I don't think that is right. Here is the list of Latin-1 characters (as part of Unicode)...There's nothing region specific there.
Unicode is a different set of character sets (note: Unicode isn't even 1 specific character set!) yet again. Latin-1 is not unicode. In fact the point of Unicode was to address the problems that arose with region specific character sets like Latin-1. Hence why there's Latin-1 characters included in Unicode as well as characters from of locales. What you're referencing is the Latin-1 block within the UTF-8 character set.
> Also ISO-8859-1 is Latin-1
It is. But I was referencing ISO-8859 (without the -1) which covers Latin-1 as well as a bunch of other locales.
> Now there are a few code points in ISO 8859-1 that are undefined, whereas they are defined in Unicode Latin-1, and Windows-1252, but they're mostly the same. The major difference is € and the TM symbol.
You're drifting all over the place there:
1. there's no such thing as "Unicode Latin-1". They're different character sets albeit Unicode will have a Latin-1 block (much like Latin-1 has an ASCII block).
2. With regards to your point about the € and TM differences: that is precisely the reason I suggested using ISO-8859 (without the -1) as a reference rather than a region specific character set.
My pet peeve: using table lookups so a benchmark show faster when in reality your L1 cache is going to be stomped on. Not only will you be waiting on the L1 cache to populate but you also evict all the useful data.
I've seen this elsewhere too. For relatively mundane task (I think it was Morton code conversion), a giant lookup table was constructed and it gave a nice 2x or 4x performance improvement.
But no application will ever be doing only string despacing or Morton codes, so the "fast" lookup table algorithm will make everything else slower by evicting good cache lines. And once something else runs and evicts the lookup tables, the next run will be slow again.
This reminded me of spray-json's JsonParser. It has a bit of Scala code to seek past whitespace:
@tailrec private def ws(): Unit =
// fast test whether cursorChar is one of " \n\r\t"
if (((1L << cursorChar) & ((cursorChar - 64) >> 31) & 0x100002600L) != 0L) { advance(); ws() }
The compiler needs to be extremely clever on @tailrec for not blowing up the L1 cache on this code.
To me, this looks like a perfect example of a developer thinking they're very smart while they're actually writing code that's worse that its naive counterpart version (a good summary of Scala in my experience).
you don't even have to go to non english, just utf-8 has stuff like mathematic spaces and spaces of different em sizes. http://perldoc.perl.org/perlrecharclass.html has a reasonably good list of non ascii spaces, though I'm not sure where to dig through for locale specific lists.
That said, while it may not be a solution for every case, it's a solution for the common case and a starting point for other cases, and thus pretty nifty and potentially useful.
This is a good instinct, but doesn't work well in practice unless the prevalence of spaces is very low. And if it's very low, you only need the largest check followed by a single fallback. The problem is that mispredicted branches (for example, 'if statements' that require different code paths without a clear pattern) are relatively expensive.
The scalar operations of reading a character, checking whether it's a space (0x20), and writing it to an output can often be done in a single cycle (the processor is 'superscalar'). A mispredicted branch costs about 15 cycles. Thus for simple tasks like this, if the average distance between spaces is 16 or less, you are likely better off with the simpler straightforward approach.
If you use the naive algorithm at the top of the page but iterate over UTF-8 code points rather than bytes -- which is straightforward, BTW -- you will get some cleverness automatically in the compiler's implementation of the switch statement. I wrote a switch to handle the Unicode "is it space?" function, with the values from the table you linked, and compiled with "clang -Os". You know what it did?
It generated a friggin' binary search tree! Some of the leaves were a sequence of straight-line comparisons, because that's more compact, but the higher levels were all a bunch of "if (c > 0x167F) { ... }" sort of code. At one point it subtracts 8192 from something and then compares with 12, and I think this is because the x86 instruction encoding is shorter and the compiler knows that the register won't be needed again along either of the code paths from that point.
Compiling switches into binary (and even sometimes trinary) trees is an old technique that's been around since at least the days of 16-bit DOS... they will usually choose between a single-level lookup, double-level lookup, or tree depending on sparsity and options used.
Subtraction and addition implicitly sets the flags, so you can generate very small code with inc/dec (1-byte instructions), like this:
I've actually joked with colleagues about this, and often it doesn't work. Try using the U+FEFF in Javascript, in Chrome it's taken as a space (but the console will mark it as a red circle, not an invisible no-width character), and Firefox refuses to even allow it to be entered.
You can also use a character and combining character to make a symbol that looks exactly like a semicolon, but technically isn't. Browsers treat it as a semicolon anyways. I wanted to be able to do `var ; = "foobar";`
Maybe it will sneak by other languages that allow unicode in the code? Though I don't know of many.
I've had problems with code having non-ASCII spaces appended, and the terminal rendering them as zero-width, so I had to use hexdump to find the problem.
I wonder how much of a difference using a separate destination buffer would make. Apart from some memory/cache/invalidation/magic thing that I'm not certain might occur, it would allow one to use the extended instructions used for scanning.
stripping out LF isn't necessary because AWK does it on every record automatically. For even more speed, the code can be translated into ANSI C and compiled with awka[1] using an optimizing C compiler.
I am surprised to see any speedup in this! I'd expect something trivial like this to be completely memory bound with the CPU sitting almost idle waiting for bytes coming in from memory.
Looking at the benchmark code, this is using rdtsc to read the CPU time stamp counter. That does not take waiting for memory into account, does it?
I wonder if there's a difference when measured in wall clock time. It's still somewhat beneficial to have the CPU work efficiently to give an opportunity for hyperthreading to take place when waiting for memory.
If you really wanted to make something like this faster, you should focus on cache utilization and make use of prefetching instructions. x86 has pretty bad prefetching instructions and pretty good speculative fetching, so don't expect massive speedups but on ARM or Aarch64, you have a finer grained control over cache prefetching (L1 and L2 separately) and you could see much bigger differences.
As for benchmarking this kind of problems: you obviously want to measure real world performance, so you need wall clock time as well as time stamp counter, but I'd look for optimization clues in "perf stat" and other CPU perf counters, with an emphasis on cache misses and branch mispredictions.
The figure you should be staring at is the total throughput of the algorithm, measured in gigabytes per second. You should be getting figures close to the memory bandwidth available (25-50 GB/s depending on CPU and memory).
edit: I measured the wall clock time with clock_gettime before/after all the repeats (using a megabyte sized buffer) and there is indeed no significant difference, here's my results:
Thanks for looking into this! I know that Daniel appreciates feedback, but rarely reads HN, so an email to him or blog comment with your results might be helpful.
Looking at the benchmark code, this is using rdtsc to read the CPU time stamp counter. That does not take waiting for memory into account, does it?
On modern Intel processors, the "time stamp counter" is monotonically increasing at a constant rate, so it does take memory latency into account. On many Linux systems (including the one used here) clock_gettime() uses the same underlying clock source, so there should be no difference in accuracy for long measurements. The CPUID-RDTSC/RDSTCP-CPUID pattern used here has the advantage of a somewhat lower and significantly more predictable overhead, which helps when measuring shorter events.
I'd expect something trivial like this to be completely memory bound with the CPU sitting almost idle waiting for bytes coming in from memory.
If I remember the numbers right, the Skylake processor this is running on can read about 64B per cycle if the source is L1, about 24B per cycle if the source is L3, and about 6B per cycle from main memory. Using the existing RDTSC framework, I get equal speeds at L3 size, and still get sub .4B/cycle coming from main memory.
I measured the wall clock time with clock_gettime before/after all the repeats (using a megabyte sized buffer) and there is indeed no significant difference
I agree that testing on larger buffers would be informative (as would testing different ratios of whitespace) but I don't think your approach is capturing what you think it is. The macro being used for time measurement uses a different random input for each iteration (look at 'pre'), and the unoptimized time of initializing this dominates the clock time. So while I think your test is worthwhile, I think you need a better way to perform the measurement. I think you'll see that RDTSC maps exactly to wall time in this case, but surprises are definitely possible.
I'd look for optimization clues in "perf stat" and other CPU perf counters, with an emphasis on cache misses and branch mispredictions.
Yes, although as with wall time, one needs to be sure to measure only on the section of code that one is optimizing. Perf makes this difficult, so something like "likwid" with an API that allows profiling fragments of code would be required.
I haven't done that yet for this, but at the faster speeds (sse4_despace_branchless) I don't think there are any cache misses or branch mispredictions in the code of interest.
surprisingly, Clang is about 10% slower than GCC in my experiments
My surprise was the opposite. Clang was 25% faster than GCC and ICC on the L1- and L3-sized vectorized branchless (.25 cycles/byte versus .33 cycles/byte, although the code I'm running is not quite what's on Github). Most of this benefit seems to be because clang unrolls 2x to reduce loop overhead.
I dropped a comment on his blog and a link to this discussion...
And how sloppy of me! I didn't notice the `pre;` in the code consuming most of the time (to be fair it's just 4 characters and I didn't put too much time to it). When I move that outside of the loop, before the timer, I get results that show improvement with the optimized version.
And indeed it looks like rdtsc is giving similar figures to clock_gettime now. I falsely presumed that it counts retired instructions.
I'm still surprised to see a speedup, and how badly the original version is performing.
My guess is that the conditional stores are poison to the CPU pipelines. The 64 bit version gets most of the performance out of it already, I presume it's because of the more efficient memory usage pattern.
I can't edit my earlier post any more, I'd correct it if I could.
A variant of this problem yields further interesting possibilities: if you’re trying to remove control characters like CR and LF, but replacing them with whitespace would be acceptable. That way you can work on it in-place, without needing to copy memory or allocate or anything like that.
Someone file a compiler bug? 14x difference between the readable code and the optimized code is a lot. The first code is extremely straightforward, you shouldn't have deal with that SIMD mess manually.
Failure to optimize is never a bug, by definition. Automatic vectorization of sequential code is a very difficult problem in general. Especially when your code still depends on testing single bytes in the input and vectorizing that is only possible using some very clever bit twiddling.
Can you come up with a vectorization optimizer pass that could do what the OP did in the article?
This isn't a compiler bug or a shortcoming, it's a non-trivial optimization. Especially given the aliasing between the input and output (ie. it's an in-place algorithm) this is going to be a very difficult optimization.
You have to be careful with lookup tables. They can be impressive each at a time when microbenchmarking a single loop, but in the aggregate hurt performance by crowding other useful data out of limited cache space.
I don't think this particular problem would be well suited for a GPU. GPUs are not vector processors, they're matrix processors, and the problem that OP is describing is a vector problem. Basically this problem would be limited by the memory bandwidth of a GPU, which isn't a good thing: https://crd.lbl.gov/departments/computer-science/PAR/researc...
The Why has multiple answers - one is human curiosity for possibilities of better result (which is the sole cause of all the technology we see around us).
Optimizing by hand is rarely the fastest thing to do nowadays. I wonder how the original naive approach fairs with all optimizations turned on for gcc and with realistic input?
this is in no way embarrassingly parallel at all! in fact, it's entirely non trivial to parallelise, writing the output is an inhertintly a serial operation.
Could you not do something like chunk the input and operate on the pieces in parallel? I'd think you could use something like a rope (à la text editors) to support the concurrent writes.
(I may or may not have forgotten the differences between concurrency and parallelism.)
yeah, you could split the input into chunks, but then you're left with the problem of recombining the outputs from each portion into one string - that operation must be serial and would likely outweigh the benefits of the initial parallelism in most cases.
Even if the regex library compiled the expression down to the very simplest state machine possible, it would still be at most as efficient as the first handwritten code in the article, handling one byte at a time. I'm fairly sure no regex engine in existence is able to detect "simple enough" regexes and optimize them into code that handles 64 or 128 bit blocks at a time.
They exist. In fact, if you hand Rust's regex engine the pattern ` |\r|\n`, then it will compile it down to a very simple alternation of three literal bytes. It will never touch the actual regex engine at all. Once the regex engine knows its a simple alternation of three bytes, it will actually use a variant of memchr called memchr3[1], which looks almost exactly like the OP's "portable" approach.
So it's not quite the fastest possible because it doesn't use SIMD, but this is mostly because we don't have good explicit SIMD support on stable Rust yet. Once we do, this immediately becomes a strong candidate to optimize to something like the OP's fastest version (and also probably generalizing to AVX2 instructions as well).
N.B. My sibling comment linked to my blog post on ripgrep, which uses Rust's regex engine. ;-)
You might checkout the literal optimizations that ripgrep does. IIRC, the underlying regex lib is optimizing a few classes of "simple" regexes with SIMD to achieve better perf than the first handwritten code in that article.
Simple regexes can be reasonably quick, but you're almost certainly never going to do better than linear time, and if you get fancy, it's easy to go exponential with a large exponent.
If you want to guarantee that this will never happen, RE2 and similar regular expression engines are guaranteed to run in O(n) time, where n is the length of the target string.
Regular Expressions are efficient in that one line of code can save you writing hundreds of lines. But they're normally slower (even pre-compiled) than thoughtful hand written code simply due to the overhead.
Generally the simpler the objective the worse Regular Expressions are. They're better for complex operations. Plus people write regular expressions REALLY poorly, doubly so for UNICODE.
Ideally you should use the standard library for this. For example C# has Char.IsWhiteSpace() which supports tons of UNICODE whitespace and can be updated with whitespace which doesn't even exist today.
This isn't true. Regular expressions can be fast even when supporting Unicode by building finite state machines that recognize UTF-8 directly. This particular benchmark explains a bit: http://blog.burntsushi.net/ripgrep/#linux-unicode-word
What isn't true? I never said that regular expressions cannot support UNICODE fast. I said that regular expressions are slower than code due to the overhead in all scenarios.
I am responding to your claim. I'm saying that not all regex implementations are created equal. Some can be just as fast as what you might write by hand.
Regular expressions can be Unicode aware, right? You should be able to use a shortcut specifier that's equivalent of calling something like IsWhitespace.
You wouldn't be using a finite state recognize (regular expression) but a finite state transducer (regular expression that emits characters instead of just recognizing).
Regular expressions were created to do this, but your general string replace function with your general regular expression library might not be the fastest way, which is what this post is about.
Not claiming to be the fastest, but here are two Python solutions suggested by Dave Beazley [0], plus one I extended using compiled regular expressions:
# string replacement
s = ' hello world \n'
s.replace(' ', '')
# 'helloworld'
# regular expression
import re
re.sub('\s+', '', s)
# 'helloworld'
# compiled regular expression
pat = re.compile('\s+')
pat.sub('', s)
# 'helloworld'
I committed my syntax highlighting code, and a few days later someone had replaced the simple UTF8 strlen function with the really long vectorised version from this page: http://www.daemonology.net/blog/2008-06-05-faster-utf8-strle...
But the funny thing is, that the supposedly fast vectorised strlen was optimised for very long strings. The benchmarks shows results for megabytes of texts. But we measured the length of tokens, usually only a few characters long, so in most cases the new function was actually slower!
I was a very junior dev, didn't want to piss off anybody, and the strlen wasn't in the hot path anyway, so I didn't say anything. But I was a bit sad, that my easy-to-read code was replaced by such a monstrosity.
What's my point? Before you go and use these functions in your code, profile your code to see if it would actually affect performance.