Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Another article in the series beating C by moving the goal posts. My comment on the original article about the Haskell version on lobste.rs:

Keep in mind in all comparisons to GNU wc that it does extra work, detecting multi-byte characters and decoding multi-byte characters if present, to correctly count the number of words. perf shows a significant amount of time being spent in multibyte character handling. If you trigger a code path that does not do decoding beyond the byte level, it’s much faster:

    $ time wc wiki-large.txt 
    854100  17794000 105322200 wiki-large.txt
    wc wiki-large.txt  0.42s user 0.02s system 99% cpu 0.438 total
    $ time wc -l wiki-large.txt
    854100 wiki-large.txt
    wc -l wiki-large.txt  0.02s user 0.02s system 98% cpu 0.034 total
(wc -l looks at every byte, but does no decoding.)

From a quick glance, this is also where the 'Haskell beats C' article fails. It’s comparing apples to oranges, the ByteString implementation does not do the same as GNU/macOS wc and returns incorrect results in the presence of non-ASCII punctuation. The article incorrectly states that wc will handle input as ASCII. Unless you do not use a multi-byte locale, macOS wc uses the combo of mbrtowc and iswspace.



Author here. This is not true - I included a link to the manpage (https://ss64.com/osx/wc.html) in the article to avoid this confusion. I did not use GNU wc; I used the OS X one, which, by default, counts single byte characters. From the manpage:

> The default action is equivalent to specifying the -c, -l and -w options.

> -c The number of bytes in each input file is written to the standard output.

> -m The number of characters in each input file is written to the standard output. If the current locale does not support multi-byte characters, this is equivalent to the -c option.

Moreover, I also mentioned in the article that I was using us-ascii encoded text, which means that even -m would have been treated as ASCII text.

Hope that clarifies your issue.


It is not about the character count, but the word count. wc decodes characters to find non-ASCII whitespace as word separators. If you read further in the same man page:

White space characters are the set of characters for which the iswspace(3) function returns true.

That your text is ASCII encoded does not matter, since ASCII is a subset of UTF-8. So at the very least, you need an extra branch to check that a byte's value is smaller than 128 (since any byte that does not start with a leading zero is a multi byte character in UTF-8).

However, if you look at the implementation at

https://opensource.apple.com/source/text_cmds/text_cmds-68/w...

You can see that in this code path it actually uses mbrtowc, so there is also the function call overhead.


It only calls mbrtowc if domulti is set (and MB_CUR_MAX > 1), i.e. only when given the option -m.


You are right! So that's a Darwin oddity, still a wide char function is called in that code path, iswspace, which adds the function call overhead in a tight loop.


If domulti is not set, the wide char function is not called as far as I can tell. Why would it? It's explicitly meant not to do wide char stuff in that case.

FWIW, when this was going around for the first time I took this Darwin version of wc and experimented with setting domulti to const 0, statically removing all paths where it might do wide character stuff. I didn't measure any performance difference to just running it unmodified.


It's about iswspace as I mentioned in the parent comment. Replace the line

    if (iswspace(wch))
by

    if (wch == L' ' || wch == L'\n' || wch == L'\t' || wch == L'\v' || wch == L'\f')
And I get a ~1.7x speedup:

    $ time ./wc ../wiki-large.txt
      854100 17794000 105322200 ../wiki-large.txt
    ./wc ../wiki-large.txt  0.47s user 0.02s system 99% cpu 0.490 total
    time ./wc2 ../wiki-large.txt                     
      854100 17794000 105322200 ../wiki-large.txt
    ./wc2 ../wiki-large.txt  0.28s user 0.01s system 99% cpu 0.293 total
Remove unnecessary branching introduced my multi-character handling [1]. This actually resembles the Go code pretty closely. We get a speedup of 1.8x.:

    $ time ./wc3 ../wiki-large.txt
      854100 17794000 105322200 ../wiki-large.txt
    ./wc3 ../wiki-large.txt  0.25s user 0.01s system 99% cpu 0.267 total
If we take the second table from the article and divide the C result (5.56) by 1.8, the C performance would be ~3.09, which is faster than the Go version (3.72).

Edit: for comparison, the Go version from the article:

    $ time ./wcgo ../wiki-large.txt
      854100 17794000 105322200 ../wiki-large.txt
    ./wcgo ../wiki-large.txt  0.32s user 0.02s system 100% cpu 0.333 total
So, when removing the multi-byte character white space handling, the C version is indeed faster than the (non-parallelized Go version).

[1] https://gist.github.com/danieldk/f8cdaed4ba255fb2954ded50dd2...


Thanks, I finally understood what you are saying. Indeed, the code uses iswspace to test all characters, wide or normal. Strange design choice. For whatever it's worth, even just changing

    if (iswspace(wch))
to something like

    if (domulti && iswspace(wch))
        ...
    else if (!domulti && isspace(wch))
        ...
got something like a 10% speedup on my machine. And replacing isspace with an explicit condition like yours is much faster still. I checked, isspace is macro-expanded to a table lookup and a mask, but apparently that's still slower than your explicit check. I'm a bit surprised by this but won't investigate further at the moment.


Thanks, I finally understood what you are saying.

I am sorry for the unclear comments. I'll stop commenting on a phone ;).

Indeed, the code uses iswspace to test all characters, wide or normal. Strange design choice.

I agree, it's really strange. This seems to be inherited by the FreeBSD version, which still does that as well:

https://github.com/freebsd/freebsd/blob/8f9d69492c3da3a8c1ea...

It has the worst of both worlds: it incorrectly counts the number of words when there is non-ASCII whitespace (since mbrtowc is not used), but it pays the penalty of using iswspace. It's also not in correspondence with POSIX, which states:

The wc utility shall consider a word to be a non-zero-length string of characters delimited by white space.

[...]

C_CTYPE

Determine the locale for the interpretation of sequences of bytes of text data as characters (for example, single-byte as opposed to multi-byte characters in arguments and input files) and which characters are defined as white space characters.


For the people down voting, please read the source code of GNU and Apple wc before down voting.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: