More

johnboyer · on Sept 29, 2019

It was my understanding that the asymmetric key pair was generated locally, and only the public key was exchanged. I am unsure about whether or not this is a requirement of the Signal protocol, but Signal itself will only store the private key locally, meaning they would need to alter their software in order to store said keys in a centralized database.

kop316 · on Oct 1, 2019

You are correct. However, that doesn't make the cipher insecure.

The reason I make this distinction is because it makes other attack vectors different. If the cipher was made insecure, then the whole thing couldn't be trusted because anyone can now attack the cipher.

However, if the keys are being stored in a database, it means that the cipher it means you can either attack and get the keys on the local device or the center database.

Those are two radically different attack venues with entirely different consequences on the encryption scheme.

Edit: Thinking about it too, it also makes the defense against it a lot different too. Say I'm in a country that only allows WhatsApp for this reason (WhatsApp allows key sharing). If I wanted to, I could crack the software and just stub out the part that sends the key (or send a dummy key as well). You still get the protections of a secure cipher, and no one else has the key now. If the cipher was weakened, then you couldn't do this.

johnboyer · on Sept 29, 2019

Android/iOS will always have a chokehold on their users. The Librem 5 is a potential alternative, a Matrix-integrated and fully open source phone. Although it still has some rough edges to be ironed out.

johnboyer · on Aug 1, 2018

Most of the STL is header only, which is what the library is compared to in the benchmarks. Both would be just as likely to get inlined.

johnboyer · on Aug 1, 2018

As it turns out, I actually have the Boyer-Moore string searching algorithm as one of the planned features in the Project section on GitHub.

meowface · on Aug 1, 2018

Sorry, I meant no relation as in I assumed you had no relation to Robert Boyer, co-author of the Boyer-Moore algorithm.

MattPalmer1086 · on Aug 1, 2018

Boyer Moore is OK, but there are better algorithms available. The simpler Horspool variant is usually a fair bit faster and also easier to implement. For short strings, e.g. less than 12 character, the ShiftOr algorithm is hard to beat. I've spent quite a bit of time writing and profiling different search algorithms as part of the byteseek project on GitHub. Let me know if you'd like details on other possible options.

johnboyer · on Aug 1, 2018

I would most definitely be interested. I always assumed that certain algorithms are better suited for certain string sizes, but I was never sure which ones. The ideal implementation is probably combining multiple algorithms with ranges of the string size.

MattPalmer1086 · on Aug 1, 2018

Absolutely true that there isn't a single algorithm that beats all the others. Of the general string search algorithms which don't use specialized CPU instructions to obtain speedups, I would recommend:

1. ShiftOr for short strings. Easy to implement.

This algorithm is not sub-linear like Boyer Moore - it examines every position, but it uses bit-parallelism to validate a match, making it outperform algorithms that rely on jumping then separate verification stages, for short strings.

2. Variants of Wu-Manber for longer strings. Hard to find a good description, but not too hard to implement.

Wu-Manber is a search algorithm designed for multi-pattern searching, based on Boyer-Moore-Horspool and hashes. However, it also performs really well for single-pattern searching. I have encountered variants of this in various places, e.g. Lecroq's in "Fast Exact String Matching Algorithms".

These algorithms use a hash of a q-gram to look up how far it's safe to shift. Q-grams tend to appear less frequently in search patterns vs. single characters, so you get longer jumps, at the cost of reading more characters and producing a hash.

3. Horspool (or Boyer-Moore-Horspool).

This algorithm performs quite well - not as well as ShiftOr for shorter patterns or Wu-Manber variants for longer ones, but still respectable. It's essentially Boyer-Moore but only using one of the shift tables, which makes it much easier to implement.

4. Qgram-Filtering by Branislav Durian, Hannu Peltola, Leena Salmela and Jorma Tarhio

For longer patterns, this algorithm outperforms the others mostly. However, it can be a bit complicated to implement well, and it has some nasty worst-cases (rarely encountered) where the performance becomes absolutely dreadful. For that reason I tend not to use it.

There are hundreds of possible search algorithms available now (see https://github.com/smart-tool/smart for implementations of many of them with a benchmark tool). However, it's hard to figure out exactly which algorithm is the best given your data and pattern. For that reason, I tend to keep things simple.

I would just use ShiftOr for short patterns, and another one for longer patterns. I would tend to use a Wu-Manber variant there, but Horspool would probably give acceptable performance.

The only other consideration is the time it takes to build the pattern indexes. For short searches, or if you have to re-build the pattern index on each repeated search, it can actually be quicker to just do a naive "check the string at each position" search, since it doesn't require building anything before searching.

burntsushi · on Aug 7, 2018

In my experience, it is difficult to beat a very simple heuristic in most settings: when provided a string, pick the most infrequently occurring byte in that string, and use that as your skip loop in a vector optimized routine such as memchr. If you guess the right byte, you'll spend most of your time in a highly optimized routine that makes the most out of your CPU.

Picking the right byte can be tricky, but a static common sense ranking actually works quite well. At least, the users of ripgrep seem to think so. :-)

For some reason, I've never seen this algorithm described in the literature. It doesn't have nice theoretical properties, but it's quite practical.

MattPalmer1086 · on Aug 1, 2018

I'd be remiss not to note that there are better and still simple variants of Horspool that outperform the vanilla algorithm.

In any case, if you'd like to discuss search algorithms when you want to implement them, I'd be happy to help. I'm mattpalms, gmail.com.

johnboyer · on Aug 1, 2018

Unicode is supported with UTF-8, only different character types aren't supported because generics are messy in C. I thought of accepting void* to support wchar_t and others, but some performance penalties came with it so I decided against it.

lifthrasiir · on Aug 1, 2018

No, having different character types (I believe you are referring to C11's `char16_t` and `char32_t`?) is not a requirement for Unicode support. At the very least you need to have a single function or two that...

* Receives a string expected to be encoded in UTF-8, and an offset to it expected to be a UTF-8 sequence boundary.

* Scans forward or backward for the next or previous UTF-8 sequence boundary.

* Optionally returns the code point for the scanned UTF-8 sequence.

* Has proper error handling for every imaginable cases: out of boundary, not a boundary, not a valid UTF-8 sequence. (OOB case needs to be handled because it will be the end condition of the iteration. Preferably should be distinct from other error conditions.)

Every other functionality can build upon this little function, in particular the iteration and UTF-8 validation will be trivial. The full Unicode support including case mapping, folding, normalization and property lookup will of course require a not-so-small table but is not strictly necessary anyway.

Björn Höhrmann's Flexible and Economical UTF-8 Decoder [1] will be handy for a concise implementation.

[1] https://bjoern.hoehrmann.de/utf-8/decoder/dfa/

flohofwoe · on Aug 1, 2018

I don't see any functions in the OP's library that would require dedicated UTF-8 handling. The string length is given in bytes, not characters or codepoints. There's no functionality to give you the character at n-th location etc... you can easily implement all UNICODE-specific functionality in a separate library and use it together with the OPs library. IMHO that's even preferable.

rurban · on Aug 1, 2018

Yes, but don't call it string library then. Strings should handle strings, and strings are unicode now. Unicode needs to be normalized and needs case-insensitive support.

And it's not easy. I implemented the third of its kind. First there was ICU, which is overly bloated. You don't need 30MB for a simple string libc. Then there is libunistring which has overly slow iterators, so not usable for coreutils. And then there's my safelibc, which is small and fast, but only for wide-chars, not utf-8.

I fixed and updated the musl case-mapping, making it 2x faster, but this is not in yet. And there's not even a properly spec'ed wcscmp/wcsicmp to find strings. glibc is an overall mess. I won't touch that. wcsicmp/wcsfc/wcsnorm are not even in POSIX.

sundarurfriend · on Aug 1, 2018

How does the utf8proc[1] library that Julia uses compare to these?

[1] http://juliastrings.github.io/utf8proc/doc/

zlynx · on Aug 1, 2018

Why try to redefine the word "string?"

In computer jargon I believe CISC and the PDP-11 have seniority. That's why all multi-word functions like memcpy are in C's string.h header.

lifthrasiir · on Aug 2, 2018

Hey, even C contains a locale-dependent string comparison, namely `strcoll` (since 1990!).

I admit two words "string" and "text" are now interchangable. But that doesn't make strings have less requirements, people are just expecting more out of strings.

johnboyer · on Aug 1, 2018

Hey there, creator of the library here. Strangely enough my initial post about this on HN received no attention, but better late than never!

If any of you have some questions or feedback, I would be delighted to hear it.

sbjs · on Aug 1, 2018

Hey man don't worry about that, that's just the way of life. People notice things when they do, that part's not up to us. I've seen articles here posted years after they were written.

Very nice and thoroughly documented code! Well done! What gave you the initial idea to write this maybe-fastest-ever string library? Did you just have an idea one day of how it could be done and you just went for it? Or did you have a performance issue and come up with this to speed something up at work, or what?

johnboyer · on Aug 1, 2018

I just really needed a string library written in C, and it didn't seem as though there were many options. The first thing I found was the Simple Dynamic Strings library, but it wasn't maintained and depended on GCC extensions, so I decided to write my own. After getting a basic functional concatenation, I benchmarked it against std::string to see how slow it was, and to my great surprise, it was actually faster. From that point, I realized I can actually write some pretty fast code, and I decided to make the library centered around that.

sbjs · on Aug 1, 2018

I'm surprised to hear it was faster, I just assumed C++ standard library is optimized for speed since I thought that's what C++ is known for compared to JavaScript.

slavik81 · on Aug 1, 2018

I like it. That's a good set of principles and a very clean API.

I had a bit of trouble finding stuff in the documentation at first. You direct people towards Modules on the very first line of the Main Page but it's too subtle. When scanning the page looking for how to get to the list of all the classes and functions in the project, my eyes are drawn to the headings which are basically all irrelevant. It would help to add some headings beyond the autogenerated ones, and to provide a link from the struct documentation to set of functions that manipulate that struct (if possible). Additionally, the top of every page says "rapidstring 0.1.0" and I'm not sure what that number means. The version is listed on the main page as 1.0.0.

The documentation is well-written, which I think is a really good sign, and the problems I mentioned are only really an issue for the very first time somebody tries to use the documentation.

johnboyer · on Aug 1, 2018

Thanks for the feedback! I did forget to change the version in the documentation, updated that now. As for the usability of the docs, Doxygen isn't exactly known for its intuitive user interface, but I will try my best to make it easier to navigate.

sbjs · on Aug 1, 2018

I have seen some projects write their own documentation generating scripts, like RxJS, it may be worth writing your own home grown document generator if you can't customize Doxygen to your liking?

johnboyer · on July 1, 2018

Although this is fairly nice, I would recommend Syncthing over this. It has the benefit of not relying on any third party to store your data, it's all exclusively on your devices, along with some very solid security.

WhyNotHugo · on July 1, 2018

Syncthing is pretty much a compeltely different thing.

I use Syncthing a bunch, but I don't really see any overlap between it and this tool.

johnboyer · on July 2, 2018

From my understanding, the primary purpose of this is for backup and syncing your Google Drive files between multiple devices, which is very similar in nature to Syncthing. Is there something else that I am missing?

johnboyer · on July 1, 2018

Although it does help, I would say its only a bit faster than actually just setting a breakpoint of where you want the code to stop, and running till then, unlike Microsoft's example which shows 85 step intos, as if people actually do that.

hhmc · on July 1, 2018

It depends on the purpose of your debugging. If you know exactly where you want to go then, as you say, this has diminished value. but debugging can often be exploratory, in which case this feature could be a real timesaver.

Dinux · on July 1, 2018

We, for example, have lots of classes and structures that are designed to the interfaces provided by the STD containers. Mix that with actual STD containers and you can spend several hours trying to find the problem. Just My Code eliminates much of that since almost always the STD is not the cause for the problem.

johnboyer · on May 29, 2018

Doesn't load in Canada either, at least for me.

johnboyer · on May 21, 2018

>Vim IMproved

Oh, so you are using an improved version of Vi IMproved? ;)

bArray · on May 21, 2018

Aha yeah, it's a recursive improvement on itself, like the GNU acronym!

ibotty · on May 21, 2018

That's neovim, right?

mercer · on May 22, 2018

or Spacemacs

imiric · on May 21, 2018

Emacs + evil-mode, presumably. :)

tom_mellior · on May 21, 2018

I need to use Emacs rather than vim for some things (Proof General, mostly), and I was surprised how well evil-mode works. It's only undo that I don't get. From vim I'm used to be able to undo as many editing steps as I want. In evil-mode, I find that if I type "u" too many times, it runs out of undos to do and starts undoing the undos (i.e., it starts redoing things I want to go away).

Also, search uses different word separators from vim: In vim "foo_bar" is a single word that will be skipped by pressing "w". But in evil-mode the "_" is a word separator, so the above is three words. That breaks my flow every time.

Some of this might not be due to vanilla evil but due to the fact that I'm using it through Spacemacs.

This is my two cents. If by any chance anyone would know off the top of their head how to fix these, I'd be greatful.

triska · on May 21, 2018

For linear undo and undo branches, undo-tree.el is required:

https://www.emacswiki.org/emacs/UndoTree

Regarding "_", you can set up Emacs to treat _ as word constituent by modifying the syntax table. For example:

    (modify-syntax-entry ?_ "w")

For more information, please see:

https://www.gnu.org/software/emacs/manual/html_node/elisp/Sy...

tom_mellior · on May 21, 2018

Awesome, thanks! I'll try these.

JdeBP · on May 21, 2018

Not necessarily. It could be NeoVIM. (-: