Unicorn: C++ Unicode string library

aurelian15 · on Feb 12, 2016

Looks like a nice project. I'm currently searching for a Unicode library and it appears to me that ICU is the de-facto standard here, which has the benefit of comming pre-installed on pretty much any Linux distribution. Any reason why I should use Unicorn instead? I couldn't find information on how it compares to ICU in the documentation (well, except for the most welcome usage of modern C++).

rspeer · on Feb 12, 2016

It looks like Unicorn can apply operations (such as regexes) to text that is natively in UTF-8, giving it a distinct advantage over ICU, which was written back when UTF-16 seemed like a good idea and has to convert everything into UTF-16.

fantasticfears · on Feb 12, 2016

It's hard but needed to differentiate between UTF-16 and UChar byte array. UChar byte array are not essentially an well-formed UTF-16 string. Beyond, why bother use UnicodeString? It's fairly easy to use. It covers the detail from your sight.

It's indeed super cool to see a modern Unicode C++ library. But anyway, is it really useful for production usage? The answer could be no. In contrast, ICU was old, battle-tested, compact and well-tested.

rspeer · on Feb 12, 2016

I'm talking about using UTF-8 as the string representation, not UChars. UChars are an artifact of UTF-16, and thus require converting all text on input and output, unless you work in a Windows API world where I/O is UTF-16.

Modern programming languages such as Rust gain efficiency by working with unmodified UTF-8. All you lose is constant-time arbitrary indexing, which is a bad idea in most cases anyway.

skystrife · on Feb 13, 2016

It's not super convenient, but one can operate on UTF-8 buffers with ICU via UText (see e.g. http://userguide.icu-project.org/strings/utext#TOC-Example:-...)

Not everything is doable this way, but quite a lot actually is.

jstimpfle · on Feb 13, 2016

Why is a bad idea? Because Unicode has too complicated semantics to split a Unicode string at arbitrary points?

imron · on Feb 13, 2016

Both utf8 and utf16 can contain multicharacter elements. If you split a string at an arbitrary point you risk splitting it inside a multicharacter element.

This will be very common in utf8 that contains non-ascii characters, and very rare with utf16 (only happens with characters outside the BMP).

Neither is something you want in your code, unless you think it's a good idea to corrupt your users' data.

Edit: It's not too difficult to handle these cases and make sure you only split at valid positions, but you do need to be careful and there are a number of edge cases you might not think through or even encounter unless you have the right sort of data to test with - which leads to lots of faulty implementations. e.g. for years MySQL couldn't handle utf8 characters outside the BMP.

jstimpfle · on Feb 13, 2016

My parent was speaking about indexing at the code points level, not at the encoding (byte / character) level.

I do know that Unicode has combining code points (confusingly called combining characters) and nasty things like rtl switching code points. I guess it's turtles all the way down.

vardump · on Feb 13, 2016

> My parent was speaking about indexing at the code points level, not at the encoding (byte / character) level.

You need UTF-32 for (random) indexing of code points. UTF-16 has 16-bit code units. Some UTF-16 code points are 32-bits, using a surrogate pair.

So it's the same trade-off as with UTF-8. Thus no reason not to just simply use UTF-8 in the first place and take advantage of the memory savings.

jstimpfle · on Feb 13, 2016

Again, my original parent's statement was not about encoding or memory savings. The statement was that it was a bad idea to index into an (abstract) unicode string (of unicode code points -- not compositions thereof whatsoever).

I didn't question that, but hoped to get some inspiration for sane usage of unicode handling (which I'm not sure is humanly possible except for treating it as a rather black box and make no promises).

imron · on Feb 13, 2016

Your original parent was all about encodings, and mentioned it was a bad idea to arbitrarily index in to utf8 strings, (no mention of abstract strings of unicode codepoints).

> languages such as Rust gain efficiency by working with unmodified UTF-8. All you lose is constant-time arbitrary indexing

So it's saying Rust mostly benefits from using utf8, but in doing so, it loses the ability to arbitrarily index a character in a string (in constant time).

If it was abstract strings of unicode codepoints then there is no problem - except you'd then be using 32bits per codepoint.

imron · on Feb 13, 2016

Actually, they are not combining code points. Take for example the character 𪚥 (4 dragons).

The codepoint is U+2A6A5, but in UTF16 it requires combining 2 utf16 characters (\uD869 and \uDEA5) in order to reference it.

The codepoint however is still exactly the same (U+2A6A5).

vardump · on Feb 13, 2016

> The codepoint is U+2A6A5, but in UTF16 it requires combining 2 utf16 characters (\uD869 and \uDEA5) in order to reference it.

No, you mean two UTF-16 code units. A character is one or more code points.

nly · on Feb 12, 2016

Looks like unicorn is just using PCRE for regex to me.

weinzierl · on Feb 12, 2016

Comparison with ICU would be interesting but probably unfair given size and age of ICU. Personally I'd like to see it compared to utf8rewind (previously discussed on HN [1]).

[1] https://news.ycombinator.com/item?id=10029979

cmrdporcupine · on Feb 12, 2016

The unicode portion looks reasonable, but why is it necessary for it to include its own flags, file io, file management, and environment classes?

Why is it so many C++ libraries fall into this habit of trying to build one big framework. I'm perfectly happy with gflags -- a unicode library would be nice for my project, but now I won't consider this library.

captaincrowbar · on Feb 12, 2016

Because the whole point is to handle anything that needs Unicode support. A library that only manipulated Unicode strings would be incomplete if you still couldn't use Unicode in command line options, file names, etc.

cmrdporcupine · on Feb 13, 2016

I would recommend breaking them off into separate additional libraries. I don't need unicode for flags, so paying for it at compile and link time seems unwise. Or provide adapter classes that can be used over other frameworks. Just a suggestion.

nly · on Feb 12, 2016

That's what will happen until there's a defacto/standard library for this stuff. Languages like Python and Go have a wider base in the standard library. C++14 still only gives you platform dependent 'wide' strings, UTF-8 string literals, and UTF-8 conversion... which makes things awkward.

vidoc · on Feb 12, 2016

Seems like the word 'Unicorn' is currently the buzzword of 2016 in tech!

maaku · on Feb 12, 2016

Your github pages breaks the back button.

captaincrowbar · on Feb 12, 2016

No idea what you mean, sorry. I'm just using Github's automatically generated web pages, so if there's a problem there it's probably a Github issue.

geekone · on Feb 12, 2016

Probably referring to the Documentation link you provide on the GitHub page, and it breaks back button for me too.

captaincrowbar · on Feb 12, 2016

I just tried that on several browsers; Safari and Chrome are fine, it seems to be only Firefox that has a problem with that. I have no idea whether that's a bug in Firefox or Github, and either way there's nothing I can do about it, sorry.

lomnakkus · on Feb 13, 2016

Hmm... weird. I guess this should either be reported to the GitHub people and/or the Firefox people?

funkaster · on Feb 12, 2016

yes, you can: publish your docs as real web pages and not a link to the htmlpreview of a file inside your repo. That should fix the problem.

dpark · on Feb 12, 2016

I guess he should have said that there's nothing reasonable he can do about it. Creating an entirely separate set of HTML pages would require a new publishing flow, add a new step every time docs update, and generally encourage the docs to fall out of sync with the repo. He could do all of this, or he could do the sensible thing and leave the docs exactly like they are.

funkaster · on Feb 12, 2016

I think it's the htmlpreview: the back throws you into a redirection to the current page.

_vya7 · on Feb 12, 2016

That's not fair. It's pretty well known that Github uses JS to hijack page navigation and make it "smoother" for people. And of course that's going to be faulty, and I emailed them years ago when they made the switch, and asked them to make it an optional behavior because I hate it. But that has nothing to do with OP or OP's link or content. It's like judging a book by the book store.

xjia · on Feb 12, 2016

Anyone can compare this to Boost.Nowide?