How quickly can you remove spaces from a string?

jakobegger · on Jan 21, 2017

Fun story: When I implemented the syntax highlighting feature in Sequel Pro, I needed a way to determine the string length of UTF-8 encoded strings. I didn't find a function available on macOS that worked directly on a char*, so I googled and found a really simple UTF-8 strlen function. It was easy to understand and very fast. I think it was this: http://canonical.org/~kragen/strlen-utf8.html

I committed my syntax highlighting code, and a few days later someone had replaced the simple UTF8 strlen function with the really long vectorised version from this page: http://www.daemonology.net/blog/2008-06-05-faster-utf8-strle...

But the funny thing is, that the supposedly fast vectorised strlen was optimised for very long strings. The benchmarks shows results for megabytes of texts. But we measured the length of tokens, usually only a few characters long, so in most cases the new function was actually slower!

I was a very junior dev, didn't want to piss off anybody, and the strlen wasn't in the hot path anyway, so I didn't say anything. But I was a bit sad, that my easy-to-read code was replaced by such a monstrosity.

What's my point? Before you go and use these functions in your code, profile your code to see if it would actually affect performance.

dzdt · on Jan 20, 2017

Somehow the article fails to mention that the speed will depend heavily on the input string: both its size and distribution of whitespace characters.

Even assuming the question is for the limit of very long strings, the distribution makes a huge difference. Natural English has spaces on average every 5.1 characters [1], so using multi-character tests to speed up the case of runs of 8 or more characters without a space will probably slow it down, not speed it up!

[1]https://arxiv.org/pdf/1208.6109

nkurz · on Jan 20, 2017

using multi-character tests to speed up the case of runs of 8 or more characters without a space will probably slow it down, not speed it up!

Yes, although there is the option of combining the vectorized comparisons with a branchless approach. I modified his code to do this, and got what seemed to be a flat .40 cycles per character independent of input, which is about twice the speed Daniel illustrated on the input with 3% spaces. I'm sure it can be made even faster (what would the limiting factor be?), but I think this shows that a multi-character approach is faster on all input than would be possible with any single-character approach?

rjeli · on Jan 21, 2017

Could you share your code?

nkurz · on Jan 21, 2017

I'm working with the author, and it should be in his GitHub repository soon. But if you want to reproduce before then, you should be able to just change the line "if (mask16 == 0)" to "if (0)" in sse4_despace(), which is in include/despace.h.

The idea is that rather than testing whether any spaces were found and taking a "shortcut", we always do a lookup on mask16, shuffle the bytes according to the result to remove the spaces, and advance the output pointer by 16 minus the number of spaces found. This costs 2-3 cycles for completely spaceless input, but saves the ~15 cycle branch misprediction penalty each time an unexpected space is found.

wmil · on Jan 20, 2017

Perhaps he's using it on authors known to be grandiloquent, loquacious, and sesquipedalian.

dekhn · on Jan 20, 2017

you forgot mellifluous

DoofusOfDeath · on Jan 21, 2017

Don't be fatuous, Dekhn.

Razengan · on Jan 21, 2017

I propose "Iousoquentalian" to describe all such words.

parasubcutaneor · on Jan 21, 2017

A punctilious notion to be sure.

hyperpape · on Jan 20, 2017

Quoting: "It is going to be faster as long as most blocks of eight characters do not contain any white space."

Your point is still helpful because it seems that not having whitespace is probably less common, but there is a mention.

dzdt · on Jan 20, 2017

Fair enough. I should have said "fails to emphasize" not "fails to mention".

btilly · on Jan 20, 2017

For natural language, yes.

But log files tend to have a very different distribution of whitespace.

computator · on Jan 21, 2017

What strikes me is how many sites won't accept a login name, credit card number, phone number, or other field with leading or trailing spaces. This for something where the speed of the code doesn't matter at all. I can't think of any explanation other than incredible laziness or incompetence for not stripping off spaces.

You might not notice this if you use a password manager or browser autofill, but it's a lot of sites, including companies like major airlines for example.

Never mind that -- it's just a miracle when you can enter a credit card number formatted as 4123 4567 8901 2345 rather than squished together.

tjalfi · on Jan 21, 2017

I have an excellent example of this.

I worked at a small technical training company in early 2001. We had an in house application which was used for scheduling classes. If you entered a trailing space the program would beep 3 times and display a dialog box warning you in capital letters that trailing spaces would corrupt the database. The button to close the dialog box was labeled "I understand and will obey". It probably took the developer more time to create the dialog box than to sanitize all inputs.

edblarney · on Jan 21, 2017

You hit the issue in a roundabout way:

The very premise of the article 'how quickly can we remove whitespaces' is rooted in the intellectual foundations of CS. As a culture, we are are obsessive about 'performance'.

This is because back in the day, it's always been important - and even today 'under the hood' it's always important. And of course there are situations in which it's still important (complex algs, limitations of mobile devices).

But in reality - these things are never the issue.

The 'issue' is the pragmatic application of basic algorithms to do a number of basic things elegantly, which together form the foundation of a good user experience.

Yes - the issue of 'no spaces' in card numbers etc. is a clumsy thing, and it's laziness by developers.

Also - things like 'performance' are objectively measurable, you can get cool data for it etc..

A 'bad experience' is sometimes difficult to define.

open-source-ux · on Jan 21, 2017

As a culture, we are are obsessive about 'performance'.

My purely anecdotal impression is quite the opposite. Speed of delivery and convenience for the developer (not the end user) seem to be the norm.

Frameworks, scripting languages, browser-based desktop and web apps: none of these have the characteristics of being small, nimble, lightweight, or performant. They certainly make life easier for the developer. Whether users get a 'good experience' out of the end result is open to debate.

dsp1234 · on Jan 20, 2017

This code will work on all UTF-8 encoded strings… which is the bulk of the strings found on the Internet if you consider that UTF-8 is a superset of ASCII.

Note that the first set of code (and possibly the rest), only work as the space, newline, and carriage return are the 7 bit ASCII set is included in UTF-8. However, the extended 8-bit ASCII set is not, but is often included when people speak of ASCII. So for example, if the request was to remove all "Copyright Sign" symbols, which is U00A9, it would not work correctly. The UTF-8 encoding for this symbol is 0xC2 0xA9, but the code only works on individual bytes, so it would remove the A9 byte, leaving a C2 byte and then whatever byte came next. Additionally, it would hit other UTF-8 characters like the "Greek Capital Letter Omega" (Ω which is encoded in UTF-8 as 0xCE 0xA9)

tl;dr Only works for the 7-bit ASCII set, but not the common extended 8-bit ASCII sets

ajross · on Jan 20, 2017

It's sort of a semantic argument, but I've never once heard anyone refer to "ASCII" to refer to anything other than the 7-bit standard. There is no "extended 8-bit ASCII set". That's a blanket term (though in casual use, "code page" is more typical) for any of the literally dozens of 8-bit character encodings that overlap with ASCII in the bottom half of the encoding space.

Basically the blog post was correct and precise: it removes ASCII characters from UTF-8. There is no elaboration needed.

Again, though, the post was about using SIMD primitives to optimize what looks like a scalar problem.

dragonwriter · on Jan 20, 2017

A number of the 8-bit sets that embedded the ASCII 7-bit set were informally referred to as ASCII especially during the 1980s; this may have been encouraged by microcomputer BASIC interpreters using the ASC() function (named for ASCII) to return the character code of a character in the system code set, which was often an 8-bit set that embedded ASCII.

Sharlin · on Jan 20, 2017

You would be surprised. Referring to any of the most common 8-bit encodings as "extended ASCII" or even just "ASCII" has been very common in my experience.

versteegen · on Jan 20, 2017

Yeah. Most people (well, back in the DOS days anyway) who don't know what a "7-bit encoding" is would refer to code pages as "ASCII". I'm sure I did that hundreds of times myself (I didn't know better until I was what, 20?) and I've seen it thousands of times, not exaggerating. Just look at how many "[extended] ASCII games" there are.

mikekchar · on Jan 21, 2017

I know a lot of DOS people used the term "extended ASCII" to mean specific 8-bit encodings whose first 7 bits were ASCII, but what "extended ASCII" meant depended quite a lot on the context. I came from a different background and it was very clear to me that ASCII was a 7 bit encoding that was almost always stored in an 8 bit byte (I can only think of a few instances in my whole life where I saw bitstreams of ASCII). So that last bit you could do whatever you wanted with.

Having said that, I clearly remember arguing with people that encodings with the eighth bit set were not part of ASCII. As you say, there were many people who didn't understand. My guess is that the GP never interacted with those people. Especially if you were around pre-DOS I think it would be easy to do.

robterrell · on Jan 21, 2017

I'm guessing you're just not old enough to have heard anyone refer to extended ascii?

https://en.wikipedia.org/wiki/Extended_ASCII

If you used Macs in the 80s and 90s and shared files with PC users, you heard about it all the time, because high-ascii characters would map incorrectly across the platforms.

dsp1234 · on Jan 20, 2017

I mention it because I've seen it happen multiple times over my career.

delinka · on Jan 20, 2017

The article does specifically limit itself to ' ', '\n' and '\r' and does not say "all whitespace."

delinka · on Jan 20, 2017

Do you mean to say that TFA makes the incorrect assumption that UTF-8 is a single byte encoding?

MereInterest · on Jan 20, 2017

One neat thing about UTF-8 is that all bytes that compose non-ASCII characters will have the high bit set. ASCII values range from 0-127, so no ASCII values will ever have the high bit set. Therefore, a byte that corresponds to an ASCII value is guaranteed to be the same character as the corresponding ASCII encoding.

averell · on Jan 20, 2017

No, the article states the method will correctly remove the mentioned whitespace characters even from UTF-8 text, because they are part of the 7 bit subset this method works on.

IshKebab · on Jan 20, 2017

I think because of UTF-8 most programmers know that ASCII is 7-bit. What you call '8-bit ASCII' is more commonly called Latin-1.

laumars · on Jan 21, 2017

Latin-1 is region specifc. Windows Code Pages and ISO-8859 (depending on the era) would be better terms to use in your example.

IshKebab · on Jan 21, 2017

What do you mean it's region specific? I don't think that is right. Here is the list of Latin-1 characters (as part of Unicode):

https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_bl...

There's nothing region specific there.

Also ISO-8859-1 is Latin-1:

> ISO 8859-1 encodes what it refers to as "Latin alphabet no. 1," consisting of 191 characters from the Latin script.

Now there are a few code points in ISO 8859-1 that are undefined, whereas they are defined in Unicode Latin-1, and Windows-1252, but they're mostly the same. The major difference is € and the TM symbol.

laumars · on Jan 23, 2017

> What do you mean it's region specific?

Exactly as it sounds. The characters encoded in Latin-1 are specific for Western Europe and thus may not appear in other ISO-8859 character sets.

> I don't think that is right. Here is the list of Latin-1 characters (as part of Unicode)...There's nothing region specific there.

Unicode is a different set of character sets (note: Unicode isn't even 1 specific character set!) yet again. Latin-1 is not unicode. In fact the point of Unicode was to address the problems that arose with region specific character sets like Latin-1. Hence why there's Latin-1 characters included in Unicode as well as characters from of locales. What you're referencing is the Latin-1 block within the UTF-8 character set.

> Also ISO-8859-1 is Latin-1

It is. But I was referencing ISO-8859 (without the -1) which covers Latin-1 as well as a bunch of other locales.

> Now there are a few code points in ISO 8859-1 that are undefined, whereas they are defined in Unicode Latin-1, and Windows-1252, but they're mostly the same. The major difference is € and the TM symbol.

You're drifting all over the place there:

1. there's no such thing as "Unicode Latin-1". They're different character sets albeit Unicode will have a Latin-1 block (much like Latin-1 has an ASCII block).

2. With regards to your point about the € and TM differences: that is precisely the reason I suggested using ISO-8859 (without the -1) as a reference rather than a region specific character set.

nogracias · on Jan 21, 2017

6.5MB for the tables to drive this faster de-spacer? No thank you. It would blow out the L1 cache and make your program slower.

https://raw.githubusercontent.com/lemire/despacer/master/inc...

jnordwick · on Jan 21, 2017

My pet peeve: using table lookups so a benchmark show faster when in reality your L1 cache is going to be stomped on. Not only will you be waiting on the L1 cache to populate but you also evict all the useful data.

exDM69 · on Jan 21, 2017

I've seen this elsewhere too. For relatively mundane task (I think it was Morton code conversion), a giant lookup table was constructed and it gave a nice 2x or 4x performance improvement.

But no application will ever be doing only string despacing or Morton codes, so the "fast" lookup table algorithm will make everything else slower by evicting good cache lines. And once something else runs and evicts the lookup tables, the next run will be slow again.

et1337 · on Jan 20, 2017

Cached text-only version: http://webcache.googleusercontent.com/search?q=cache%3Ahttp%...

brianpgordon · on Jan 21, 2017

This reminded me of spray-json's JsonParser. It has a bit of Scala code to seek past whitespace:

  @tailrec private def ws(): Unit =
    // fast test whether cursorChar is one of " \n\r\t"
    if (((1L << cursorChar) & ((cursorChar - 64) >> 31) & 0x100002600L) != 0L) { advance(); ws() }

https://github.com/spray/spray-json/blob/765c83248e0bbe867dd...

hota_mazi · on Jan 21, 2017

The compiler needs to be extremely clever on @tailrec for not blowing up the L1 cache on this code.

To me, this looks like a perfect example of a developer thinking they're very smart while they're actually writing code that's worse that its naive counterpart version (a good summary of Scala in my experience).

jnordwick · on Jan 21, 2017

Can you please explain why it would blow up the L1 cache?

octo_t · on Jan 20, 2017

So using 128bits instructions would imply you had 'words' which were over 16 characters long on average, right?

The average (english) word is ~5 characters long, so most of the time, you'd be forced to check anyway.

drdrey · on Jan 20, 2017

Not all strings are made of English text

sfrailsdev · on Jan 20, 2017

you don't even have to go to non english, just utf-8 has stuff like mathematic spaces and spaces of different em sizes. http://perldoc.perl.org/perlrecharclass.html has a reasonably good list of non ascii spaces, though I'm not sure where to dig through for locale specific lists.

That said, while it may not be a solution for every case, it's a solution for the common case and a starting point for other cases, and thus pretty nifty and potentially useful.

countryqt30 · on Jan 20, 2017

You are so right - everything depends on the dataset. Now after all, just skipping 4 characters (1 byte, 32 bit arch) actually seems VERY reasonable.

Maybe do it log> 1) check if you can jump 16, 2) check if you can jump 8, 3) check if you can jump 4, execute

nkurz · on Jan 20, 2017

This is a good instinct, but doesn't work well in practice unless the prevalence of spaces is very low. And if it's very low, you only need the largest check followed by a single fallback. The problem is that mispredicted branches (for example, 'if statements' that require different code paths without a clear pattern) are relatively expensive.

The scalar operations of reading a character, checking whether it's a space (0x20), and writing it to an output can often be done in a single cycle (the processor is 'superscalar'). A mispredicted branch costs about 15 cycles. Thus for simple tasks like this, if the average distance between spaces is 16 or less, you are likely better off with the simpler straightforward approach.

mnarayan01 · on Jan 20, 2017

I feel like you have to at least include non-breaking spaces if you're going to say you're removing spaces from UTF-8 strings.

criddell · on Jan 20, 2017

Is there a clever way to remove all the space characters?

https://www.cs.tut.fi/~jkorpela/chars/spaces.html

pjscott · on Jan 21, 2017

If you use the naive algorithm at the top of the page but iterate over UTF-8 code points rather than bytes -- which is straightforward, BTW -- you will get some cleverness automatically in the compiler's implementation of the switch statement. I wrote a switch to handle the Unicode "is it space?" function, with the values from the table you linked, and compiled with "clang -Os". You know what it did?

It generated a friggin' binary search tree! Some of the leaves were a sequence of straight-line comparisons, because that's more compact, but the higher levels were all a bunch of "if (c > 0x167F) { ... }" sort of code. At one point it subtracts 8192 from something and then compares with 12, and I think this is because the x86 instruction encoding is shorter and the compiler knows that the register won't be needed again along either of the code paths from that point.

Compilers are amazing sometimes.

userbinator · on Jan 21, 2017

Compiling switches into binary (and even sometimes trinary) trees is an old technique that's been around since at least the days of 16-bit DOS... they will usually choose between a single-level lookup, double-level lookup, or tree depending on sparsity and options used.

Subtraction and addition implicitly sets the flags, so you can generate very small code with inc/dec (1-byte instructions), like this:

    dec ax
    jz ax_was_zero
    dec ax
    jz ax_was_one
    ...

gravypod · on Jan 20, 2017

"ZERO WIDTH NO-BREAK SPACE foobar No width (the character is invisible)"

Time to insert this into some of my friends code.

squeaky-clean · on Jan 20, 2017

I've actually joked with colleagues about this, and often it doesn't work. Try using the U+FEFF in Javascript, in Chrome it's taken as a space (but the console will mark it as a red circle, not an invisible no-width character), and Firefox refuses to even allow it to be entered.

You can also use a character and combining character to make a symbol that looks exactly like a semicolon, but technically isn't. Browsers treat it as a semicolon anyways. I wanted to be able to do `var ; = "foobar";`

Maybe it will sneak by other languages that allow unicode in the code? Though I don't know of many.

nitrogen · on Jan 20, 2017

I've had problems with code having non-ASCII spaces appended, and the terminal rendering them as zero-width, so I had to use hexdump to find the problem.

sedachv · on Jan 21, 2017

Also fun are left-to-right and right-to-left marks.

squeaky-clean · on Jan 22, 2017

This is a level of evil I hadn't considered.

sedachv · on Jan 23, 2017

As Seen in Production™

stelund · on Jan 21, 2017

Maybe the asm to do scan for byte in memory is an alternative. Repne scab.

It won't find all characters in a single scan. But maybe do 3 passes over a buffer which fits in cache.

aib · on Jan 20, 2017

I wonder how much of a difference using a separate destination buffer would make. Apart from some memory/cache/invalidation/magic thing that I'm not certain might occur, it would allow one to use the extended instructions used for scanning.

Annatar · on Jan 21, 2017

  awk '
  {
    gsub(/[[:blank:]\015]/, "");
    printf("%s", $0);
  }' input | tee output

stripping out LF isn't necessary because AWK does it on every record automatically. For even more speed, the code can be translated into ANSI C and compiled with awka[1] using an optimizing C compiler.

[1] http://awka.sourceforge.net/index.html

exDM69 · on Jan 21, 2017

I am surprised to see any speedup in this! I'd expect something trivial like this to be completely memory bound with the CPU sitting almost idle waiting for bytes coming in from memory.

Looking at the benchmark code, this is using rdtsc to read the CPU time stamp counter. That does not take waiting for memory into account, does it?

I wonder if there's a difference when measured in wall clock time. It's still somewhat beneficial to have the CPU work efficiently to give an opportunity for hyperthreading to take place when waiting for memory.

If you really wanted to make something like this faster, you should focus on cache utilization and make use of prefetching instructions. x86 has pretty bad prefetching instructions and pretty good speculative fetching, so don't expect massive speedups but on ARM or Aarch64, you have a finer grained control over cache prefetching (L1 and L2 separately) and you could see much bigger differences.

As for benchmarking this kind of problems: you obviously want to measure real world performance, so you need wall clock time as well as time stamp counter, but I'd look for optimization clues in "perf stat" and other CPU perf counters, with an emphasis on cache misses and branch mispredictions.

The figure you should be staring at is the total throughput of the algorithm, measured in gigabytes per second. You should be getting figures close to the memory bandwidth available (25-50 GB/s depending on CPU and memory).

edit: I measured the wall clock time with clock_gettime before/after all the repeats (using a megabyte sized buffer) and there is indeed no significant difference, here's my results:

    memcpy(tmpbuffer,buffer,N):  0.122945 cycles / ops 1495907352 nsec (1.495907 sec) 
    countspaces(buffer, N):  3.657322 cycles / ops 1544915395 nsec (1.544915 sec) 
    despace(buffer, N):  6.521193 cycles / ops 1621204460 nsec (1.621204 sec) 
    faster_despace(buffer, N):  1.721657 cycles / ops 1500507217 nsec (1.500507 sec) 
    despace64(buffer, N):  3.595031 cycles / ops 1544993649 nsec (1.544994 sec) 
    despace_to(buffer, N, tmpbuffer):  6.307885 cycles / ops 1615101563 nsec (1.615102 sec) 
    avx2_countspaces(buffer, N):  0.190992 cycles / ops 1460961459 nsec (1.460961 sec) 
    avx2_despace(buffer, N):  5.750583 cycles / ops 1615971010 nsec (1.615971 sec) 
    sse4_despace(buffer, N):  0.985002 cycles / ops 1482901389 nsec (1.482901 sec) 
    sse4_despace_branchless(buffer, N):  0.338737 cycles / ops 1460874704 nsec (1.460875 sec) 
    sse4_despace_trail(buffer, N):  1.950657 cycles / ops 1502268447 nsec (1.502268 sec) 
    sse42_despace_branchless(buffer, N):  0.562246 cycles / ops 1468638389 nsec (1.468638 sec) 
    sse42_despace_branchless_lookup(buffer, N):  0.624913 cycles / ops 1472445127 nsec (1.472445 sec) 
    sse42_despace_to(buffer, N,tmpbuffer):  1.747046 cycles / ops 1507705780 nsec (1.507706 sec)

Here's the diff to the original: http://pasteall.org/208511

edit2: surprisingly, Clang is about 10% slower than GCC in my experiments.

nkurz · on Jan 21, 2017

Thanks for looking into this! I know that Daniel appreciates feedback, but rarely reads HN, so an email to him or blog comment with your results might be helpful.

Looking at the benchmark code, this is using rdtsc to read the CPU time stamp counter. That does not take waiting for memory into account, does it?

On modern Intel processors, the "time stamp counter" is monotonically increasing at a constant rate, so it does take memory latency into account. On many Linux systems (including the one used here) clock_gettime() uses the same underlying clock source, so there should be no difference in accuracy for long measurements. The CPUID-RDTSC/RDSTCP-CPUID pattern used here has the advantage of a somewhat lower and significantly more predictable overhead, which helps when measuring shorter events.

I'd expect something trivial like this to be completely memory bound with the CPU sitting almost idle waiting for bytes coming in from memory.

If I remember the numbers right, the Skylake processor this is running on can read about 64B per cycle if the source is L1, about 24B per cycle if the source is L3, and about 6B per cycle from main memory. Using the existing RDTSC framework, I get equal speeds at L3 size, and still get sub .4B/cycle coming from main memory.

I measured the wall clock time with clock_gettime before/after all the repeats (using a megabyte sized buffer) and there is indeed no significant difference

I agree that testing on larger buffers would be informative (as would testing different ratios of whitespace) but I don't think your approach is capturing what you think it is. The macro being used for time measurement uses a different random input for each iteration (look at 'pre'), and the unoptimized time of initializing this dominates the clock time. So while I think your test is worthwhile, I think you need a better way to perform the measurement. I think you'll see that RDTSC maps exactly to wall time in this case, but surprises are definitely possible.

I'd look for optimization clues in "perf stat" and other CPU perf counters, with an emphasis on cache misses and branch mispredictions.

Yes, although as with wall time, one needs to be sure to measure only on the section of code that one is optimizing. Perf makes this difficult, so something like "likwid" with an API that allows profiling fragments of code would be required. I haven't done that yet for this, but at the faster speeds (sse4_despace_branchless) I don't think there are any cache misses or branch mispredictions in the code of interest.

surprisingly, Clang is about 10% slower than GCC in my experiments

My surprise was the opposite. Clang was 25% faster than GCC and ICC on the L1- and L3-sized vectorized branchless (.25 cycles/byte versus .33 cycles/byte, although the code I'm running is not quite what's on Github). Most of this benefit seems to be because clang unrolls 2x to reduce loop overhead.

exDM69 · on Jan 21, 2017

I dropped a comment on his blog and a link to this discussion...

And how sloppy of me! I didn't notice the `pre;` in the code consuming most of the time (to be fair it's just 4 characters and I didn't put too much time to it). When I move that outside of the loop, before the timer, I get results that show improvement with the optimized version.

And indeed it looks like rdtsc is giving similar figures to clock_gettime now. I falsely presumed that it counts retired instructions.

I'm still surprised to see a speedup, and how badly the original version is performing.

My guess is that the conditional stores are poison to the CPU pipelines. The 64 bit version gets most of the performance out of it already, I presume it's because of the more efficient memory usage pattern.

I can't edit my earlier post any more, I'd correct it if I could.

Clang:

    memcpy(tmpbuffer,buffer,N):  0.000000 cycles / ops 65232 nsec (0.000065 sec) 
    countspaces(buffer, N):  0.000000 cycles / ops 65176 nsec (0.000065 sec) 
    despace(buffer, N):  4.547286 cycles / ops 7654310198 nsec (7.654310 sec) 
    faster_despace(buffer, N):  1.582721 cycles / ops 2651677500 nsec (2.651678 sec) 
    despace64(buffer, N):  0.583952 cycles / ops 1025215835 nsec (1.025216 sec) 
    despace_to(buffer, N, tmpbuffer):  0.000000 cycles / ops 63847 nsec (0.000064 sec) 
    avx2_countspaces(buffer, N):  0.000000 cycles / ops 63697 nsec (0.000064 sec) 
    avx2_despace(buffer, N):  0.307253 cycles / ops 602528061 nsec (0.602528 sec) 
    sse4_despace(buffer, N):  0.310504 cycles / ops 534542967 nsec (0.534543 sec) 
    sse4_despace_branchless(buffer, N):  0.353851 cycles / ops 594652080 nsec (0.594652 sec) 
    sse4_despace_trail(buffer, N):  0.314221 cycles / ops 562439990 nsec (0.562440 sec) 
    sse42_despace_branchless(buffer, N):  0.608811 cycles / ops 1020576734 nsec (1.020577 sec) 
    sse42_despace_branchless_lookup(buffer, N):  0.608494 cycles / ops 1020062750 nsec (1.020063 sec) 
    sse42_despace_to(buffer, N,tmpbuffer):  1.779908 cycles / ops 2983955982 nsec (2.983956 sec)

GCC:

    memcpy(tmpbuffer,buffer,N):  0.285702 cycles / ops 489599835 nsec (0.489600 sec) 
    countspaces(buffer, N):  0.000000 cycles / ops 63995 nsec (0.000064 sec) 
    despace(buffer, N):  4.751018 cycles / ops 8014809122 nsec (8.014809 sec) 
    faster_despace(buffer, N):  1.718575 cycles / ops 2898082416 nsec (2.898082 sec) 
    despace64(buffer, N):  0.883421 cycles / ops 1560479651 nsec (1.560480 sec) 
    despace_to(buffer, N, tmpbuffer):  6.424313 cycles / ops 10892716258 nsec (10.892716 sec) 
    avx2_countspaces(buffer, N):  0.031227 cycles / ops 52369789 nsec (0.052370 sec) 
    avx2_despace(buffer, N):  0.315633 cycles / ops 627793484 nsec (0.627793 sec) 
    sse4_despace(buffer, N):  0.319739 cycles / ops 554173689 nsec (0.554174 sec) 
    sse4_despace_branchless(buffer, N):  0.366240 cycles / ops 615638316 nsec (0.615638 sec) 
    sse4_despace_trail(buffer, N):  0.318973 cycles / ops 572262343 nsec (0.572262 sec) 
    sse42_despace_branchless(buffer, N):  0.638114 cycles / ops 1070772787 nsec (1.070773 sec) 
    sse42_despace_branchless_lookup(buffer, N):  0.637988 cycles / ops 1069642163 nsec (1.069642 sec) 
    sse42_despace_to(buffer, N,tmpbuffer):  1.768081 cycles / ops 2963488990 nsec (2.963489 sec)

chrismorgan · on Jan 21, 2017

A variant of this problem yields further interesting possibilities: if you’re trying to remove control characters like CR and LF, but replacing them with whitespace would be acceptable. That way you can work on it in-place, without needing to copy memory or allocate or anything like that.

saretired · on Jan 20, 2017

The “optimized” functions have a bug when the number of remaining bytes is less than the block size.

nkurz · on Jan 20, 2017

Certainly possible, but are you looking at the excerpts in his blog post, or the actual code that he links to? https://github.com/lemire/despacer/blob/master/include/despa...

saretired · on Jan 20, 2017

I'm looking at the excepts in his post, which are broken.

Too · on Jan 21, 2017

Someone file a compiler bug? 14x difference between the readable code and the optimized code is a lot. The first code is extremely straightforward, you shouldn't have deal with that SIMD mess manually.

Sharlin · on Jan 21, 2017

Failure to optimize is never a bug, by definition. Automatic vectorization of sequential code is a very difficult problem in general. Especially when your code still depends on testing single bytes in the input and vectorizing that is only possible using some very clever bit twiddling.

exDM69 · on Jan 21, 2017

Can you come up with a vectorization optimizer pass that could do what the OP did in the article?

This isn't a compiler bug or a shortcoming, it's a non-trivial optimization. Especially given the aliasing between the input and output (ie. it's an in-place algorithm) this is going to be a very difficult optimization.

fisherjeff · on Jan 21, 2017

Adding lookup tables to the naïve implementation gave me a ~3x speedup with virtually no extra effort. Branch penalties are a real killer.

twoodfin · on Jan 21, 2017

You have to be careful with lookup tables. They can be impressive each at a time when microbenchmarking a single loop, but in the aggregate hurt performance by crowding other useful data out of limited cache space.

ascotan · on Jan 21, 2017

I guess you could write CUDA code to do this on a GPU, but then the question becomes why? :/

antonmks · on Jan 21, 2017

Ok, here it is : https://gist.github.com/antonmks/17e0c711d41fb07d1b6f3ada3f5...

0.25 cycles/byte using GTX 1080.

castratikron · on Jan 21, 2017

I don't think this particular problem would be well suited for a GPU. GPUs are not vector processors, they're matrix processors, and the problem that OP is describing is a vector problem. Basically this problem would be limited by the memory bandwidth of a GPU, which isn't a good thing: https://crd.lbl.gov/departments/computer-science/PAR/researc...

mkagenius · on Jan 21, 2017

The Why has multiple answers - one is human curiosity for possibilities of better result (which is the sole cause of all the technology we see around us).

Darge · on Jan 20, 2017

Does anyone know how to do such a benchmark?

sp332 · on Jan 20, 2017

Check out the linked Github repo.

ebbv · on Jan 20, 2017

Optimizing by hand is rarely the fastest thing to do nowadays. I wonder how the original naive approach fairs with all optimizations turned on for gcc and with realistic input?

Too · on Jan 21, 2017

The optimization flags used should be good enough.

https://github.com/lemire/despacer/blob/master/Makefile

CFLAGS = -fPIC -std=c99 -O3 -march=native

cowardlydragon · on Jan 21, 2017

ASCII/1 byte characters? I stopped reading then.

UNICODE or GTFO.

Should be embarassingly parallel for a contiguous array.

foxhill · on Jan 21, 2017

this is in no way embarrassingly parallel at all! in fact, it's entirely non trivial to parallelise, writing the output is an inhertintly a serial operation.

mrkgnao · on Jan 21, 2017

Could you not do something like chunk the input and operate on the pieces in parallel? I'd think you could use something like a rope (à la text editors) to support the concurrent writes.

(I may or may not have forgotten the differences between concurrency and parallelism.)

foxhill · on Jan 21, 2017

yeah, you could split the input into chunks, but then you're left with the problem of recombining the outputs from each portion into one string - that operation must be serial and would likely outweigh the benefits of the initial parallelism in most cases.

johnnyb9 · on Jan 21, 2017

Going to start using this as an interview question!

webo · on Jan 21, 2017

Please don't. Especially for a typical software engineering position.

austincheney · on Jan 20, 2017

Isn't this the very kind of thing Regular Expressions were created to solve?

* Remove spaces - myString.replace(/\u0020+/g, "");

* Remove common line terminators - myString.replace(/(\r|\n)+/g, "");

* Remove all white space - myString.replace(/\s+/g, "");

SonicSoul · on Jan 20, 2017

the post is not so much about most convenient way to remove spaces, but about efficiency of such an operation.

austincheney · on Jan 20, 2017

Is regex execution considered efficient? I have always been impressed with its speed.

Sharlin · on Jan 20, 2017

Even if the regex library compiled the expression down to the very simplest state machine possible, it would still be at most as efficient as the first handwritten code in the article, handling one byte at a time. I'm fairly sure no regex engine in existence is able to detect "simple enough" regexes and optimize them into code that handles 64 or 128 bit blocks at a time.

burntsushi · on Jan 21, 2017

They exist. In fact, if you hand Rust's regex engine the pattern ` |\r|\n`, then it will compile it down to a very simple alternation of three literal bytes. It will never touch the actual regex engine at all. Once the regex engine knows its a simple alternation of three bytes, it will actually use a variant of memchr called memchr3[1], which looks almost exactly like the OP's "portable" approach.

So it's not quite the fastest possible because it doesn't use SIMD, but this is mostly because we don't have good explicit SIMD support on stable Rust yet. Once we do, this immediately becomes a strong candidate to optimize to something like the OP's fastest version (and also probably generalizing to AVX2 instructions as well).

N.B. My sibling comment linked to my blog post on ripgrep, which uses Rust's regex engine. ;-)

[1] - https://github.com/BurntSushi/rust-memchr/blob/master/src/li...

Sharlin · on Jan 21, 2017

Awesome, thank you!

anowell · on Jan 20, 2017

You might checkout the literal optimizations that ripgrep does. IIRC, the underlying regex lib is optimizing a few classes of "simple" regexes with SIMD to achieve better perf than the first handwritten code in that article.

http://blog.burntsushi.net/ripgrep/#literal-optimizations

GregBuchholz · on Jan 20, 2017

Time to break out the HDL then...

http://www.braketlogic.com/

throwanem · on Jan 20, 2017

Simple regexes can be reasonably quick, but you're almost certainly never going to do better than linear time, and if you get fancy, it's easy to go exponential with a large exponent.

pjscott · on Jan 21, 2017

If you want to guarantee that this will never happen, RE2 and similar regular expression engines are guaranteed to run in O(n) time, where n is the length of the target string.

https://swtch.com/~rsc/regexp/

I've had nothing but good experiences with it. (Rust's looks similar, by the way, and quite impressive.)

burntsushi · on Jan 21, 2017

Indeed. Rust's regex engine is descended from RE2. rsc's blog posts were my guide when I first wrote it.

UnoriginalGuy · on Jan 20, 2017

For this? No, it is not considered performant.

Regular Expressions are efficient in that one line of code can save you writing hundreds of lines. But they're normally slower (even pre-compiled) than thoughtful hand written code simply due to the overhead.

Generally the simpler the objective the worse Regular Expressions are. They're better for complex operations. Plus people write regular expressions REALLY poorly, doubly so for UNICODE.

Ideally you should use the standard library for this. For example C# has Char.IsWhiteSpace() which supports tons of UNICODE whitespace and can be updated with whitespace which doesn't even exist today.

burntsushi · on Jan 21, 2017

This isn't true. Regular expressions can be fast even when supporting Unicode by building finite state machines that recognize UTF-8 directly. This particular benchmark explains a bit: http://blog.burntsushi.net/ripgrep/#linux-unicode-word

UnoriginalGuy · on Jan 21, 2017

What isn't true? I never said that regular expressions cannot support UNICODE fast. I said that regular expressions are slower than code due to the overhead in all scenarios.

You're responding to a point never made.

burntsushi · on Jan 21, 2017

I am responding to your claim. I'm saying that not all regex implementations are created equal. Some can be just as fast as what you might write by hand.

MichaelGG · on Jan 21, 2017

Regular expressions can be Unicode aware, right? You should be able to use a shortcut specifier that's equivalent of calling something like IsWhitespace.

UnoriginalGuy · on Jan 21, 2017

Yes. It is people's ability to write good UNICODE regular expressions that is at issue.

stefs · on Jan 20, 2017

i'm sure using a fully featured regular expression engine would be several magnitudes slower than a handcrafted 128bit routine.

those engines too trade power for speed.

burntsushi · on Jan 21, 2017

That's not at all true. See: https://news.ycombinator.com/item?id=13448318

jnordwick · on Jan 20, 2017

You wouldn't be using a finite state recognize (regular expression) but a finite state transducer (regular expression that emits characters instead of just recognizing).

chc · on Jan 20, 2017

Regular expressions were created to do this, but your general string replace function with your general regular expression library might not be the fastest way, which is what this post is about.

hcrisp · on Jan 20, 2017

Not claiming to be the fastest, but here are two Python solutions suggested by Dave Beazley [0], plus one I extended using compiled regular expressions:

    # string replacement
    s = ' hello  world \n'
    s.replace(' ', '')
    # 'helloworld'

    # regular expression
    import re
    re.sub('\s+', '', s)
    # 'helloworld'

    # compiled regular expression
    pat = re.compile('\s+')
    pat.sub('', s)
    # 'helloworld'

[0] https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&c...

niccl · on Jan 20, 2017

my first python thought was

  s = ' hello  world \n'
  s = ''.join(s.split())

again probably not fast, but simple

justinhj · on Jan 21, 2017

Scala: val nospaces = "a string with spaces" filter { _ != ' ' }