Hacker Newsnew | past | comments | ask | show | jobs | submit | moonshadow565's commentslogin

Just because its old doesn't mean it's more portable. If anything it makes me think it's even less portable.


What about encoding it in such way we dont need huge tables to figure the category for each code point?


It means that you are encoding those categories into the code point itself, which is a waste for every single use of the character encoding.


It seems plausible that this could be made efficiently doable byte-wise. For example, C3 xx could be made to uppercase to C4 xx. Unicode actually does structure its codespace to make certain properties easier to compute, but those properties are mostly related to legacy encodings, and things are designed with USC2 or UTF32 in mind, not UTF8.

It’s also not clear to me that the code point is a good abstraction in the design of UTF8. Usually, what you want is either the byte or the grapheme cluster.


> Usually, what you want is either the byte or the grapheme cluster.

Exactly ! That's what I understood after reading this great post https://tonsky.me/blog/unicode/

"Even in the widest encoding, UTF-32, [some grapheme] will still take three 4-byte units to encode. And it still needs to be treated as a single character. If the analogy helps, we can think of the Unicode itself (without any encodings) as being variable-length."

I tend to think it's the biggest design decision in Unicode (but maybe I just don't fully see the need and use-cases beyond emojis. Of course I read the section saying it's used in actual languages, but the few examples described could have been made with a dedicated 32 bits codepoint...)


Can you fit everything into 32 bits? I have no idea, but Hangul and indict scripts seem like they might have a combinatoric explosion of infrequently used characters.


But they don't have that explosion if you only encode the combinatoric primitives those characters are made of and then use composing rules?


You still get the combinatoric explosion, but you have more bits to work with. Imagine if you could combine any 9 jamo into a single hangul syllable block. (The real combinatorics is more complicated, and I don't know if it's this bad.) Encoding just the 24 jamo and a a control character requires 25 codepoints. Giving each syllable block its own codepoint would require 24^9>2^32 codepoints.


> Giving each syllable block its own codepoint

That's the thing - you wouldn't do that! Only a small subset of frequently used combos would get it's own id, the rest would only be composable


Character case is a locale-dependent mess; trying to represent it in the values of code points (which need to be universal) is a terrible idea.

For example: in English, U+0049 and U+0069 ("I" and "i") are considered an uppercase/lowercase pair. In the Turkish locale, these are considered two separate characters with their own uppercase and lowercase versions: U+0049/U+0130 ("I" / "ı") and U+0131/U+0069 ("İ" / "i").


Of course you sometimes need tailoring to a particular language. On the other hand, I don't see how encoding untailered casing would make tailored casing harder.


Holy order


It's not that complex if you remove all the stuff you don't use: https://godbolt.org/z/rM9ejojv4 .

Main things you would need to understand is specialization (think like pattern matching but compile time) and pack expansion (three dots).


> League of Legends runs on a custom game engine developed in 2009.

Developed by Sergey Titov (same engine that powers Big Rigs).


Big Rigs: Over the Road Racing?


Yes, angry video game nerd made a very funny video about it. Other game that i know that runs on same engine is WarZ.


P1061 is in C++26 so you can instead do:

  const auto [...I] = std::make_index_sequence<INPUT_COUNT>{};
  ((SetupInput<I>(options, transport_manager, subscriber_queues[I], thread_pool, templated_topic_to_runtime_topic)),...);
yay!


Oh, that is nice!


That's better, yeah. I still prefer plain ole for loops, but that's much better.


I don't think you can copyright lists of publicly available information (iirc there was some case with phone numbers before). That being said, they also stole code...


ProCD, Inc. v. Zeidenberg was sort of about this:

> For Zeidenberg's argument, the circuit court assumed that a database collecting the contents of one or more telephone directories was equally a collection of facts that could not be copyrighted. Thus, Zeidenberg's copyright argument was valid.[1] However, this did not lead to a victory for Zeidenberg, because the circuit court held that copyright law does not preempt contract law. Since ProCD had made the investments in its business and its specific SelectPhone product, it could require customers to agree to its terms on how to use the product, including a prohibition on copying the information therein regardless of copyright protections.

https://en.wikipedia.org/wiki/ProCD,_Inc._v._Zeidenberg


Moreover, it doesn't seem like static linking to me.

A similar example would be using a GPLv3 licensed JavaScript library in a website. What it implies to other HTML/JS/CSS code is controversial [0]. The FSF actually believed that they should not be "infected" [1], and the legal implications may need to be tested in court.

[0]: https://opensource.stackexchange.com/q/4360/15873

[1]: https://www.gnu.org/licenses/gpl-faq.en.html#WMS


The FSF question is about templates, but the chrome extension in question also seems to have copied nontrivial JS.

I don't think chrome extensions can be modified by the user; there's probably some integrity check. So to be GPL compliant they need to publish source files to rebuild the extension?



Thanks for the list! It seems that unfortunately copyright applies to databases in EU.


Right, or: maybe. Depends on where you are (or maybe better: where they are), and whether data collections fall under copyright or some other protection that is translateable enough for the gpl to apply. But if they really also used code that point is moot.



Lookup FICLONERANGE ioctl


this is either a very big coincidence, or you are in the datamining discord as well. the original archive i base my project on uses RMAN to store everything :D --- thanks for the hint about the FICLONERANGE ioctl... it seems to be fine grained enough to allow me deduplicate on arbitrary offsets, not just whole blocks. will give it a go.


Btrfs is something i originally wanted to use but other people were not fans of linux so custom ad-hoc tooling (RMAN) it was.


i tried FICLONERANGE via a python wrapper btw - it turns out, that i can only clone ranges aligned to block boundaries :(

BTRFS is very neat per se, but documentation and help (most of all in very niche cases like this one here lol) is not that easy to come by. my plan would be to properly process the data set, and then make it available as a BTRFS snapshot... you can export btrfs send as a file as well for storage etc.

if all my tries to use BTRFS fail, i might to write my own tooling and virtual filesytem as well, but optimized for my use case (MPQ files and such). thanks for your input so far.


How exactly does this make their search engine better?


You forget that their real business is ads.


League of Legends is one such game as well.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: