It seems plausible that this could be made efficiently doable byte-wise. For example, C3 xx could be made to uppercase to C4 xx. Unicode actually does structure its codespace to make certain properties easier to compute, but those properties are mostly related to legacy encodings, and things are designed with USC2 or UTF32 in mind, not UTF8.
It’s also not clear to me that the code point is a good abstraction in the design of UTF8. Usually, what you want is either the byte or the grapheme cluster.
"Even in the widest encoding, UTF-32, [some grapheme] will still take three 4-byte units to encode. And it still needs to be treated as a single character. If the analogy helps, we can think of the Unicode itself (without any encodings) as being variable-length."
I tend to think it's the biggest design decision in Unicode (but maybe I just don't fully see the need and use-cases beyond emojis. Of course I read the section saying it's used in actual languages, but the few examples described could have been made with a dedicated 32 bits codepoint...)
Can you fit everything into 32 bits? I have no idea, but Hangul and indict scripts seem like they might have a combinatoric explosion of infrequently used characters.
You still get the combinatoric explosion, but you have more bits to work with. Imagine if you could combine any 9 jamo into a single hangul syllable block. (The real combinatorics is more complicated, and I don't know if it's this bad.) Encoding just the 24 jamo and a a control character requires 25 codepoints. Giving each syllable block its own codepoint would require 24^9>2^32 codepoints.
Character case is a locale-dependent mess; trying to represent it in the values of code points (which need to be universal) is a terrible idea.
For example: in English, U+0049 and U+0069 ("I" and "i") are considered an uppercase/lowercase pair. In the Turkish locale, these are considered two separate characters with their own uppercase and lowercase versions: U+0049/U+0130 ("I" / "ı") and U+0131/U+0069 ("İ" / "i").
Of course you sometimes need tailoring to a particular language. On the other hand, I don't see how encoding untailered casing would make tailored casing harder.
I don't think you can copyright lists of publicly available information (iirc there was some case with phone numbers before).
That being said, they also stole code...
> For Zeidenberg's argument, the circuit court assumed that a database collecting the contents of one or more telephone directories was equally a collection of facts that could not be copyrighted. Thus, Zeidenberg's copyright argument was valid.[1] However, this did not lead to a victory for Zeidenberg, because the circuit court held that copyright law does not preempt contract law. Since ProCD had made the investments in its business and its specific SelectPhone product, it could require customers to agree to its terms on how to use the product, including a prohibition on copying the information therein regardless of copyright protections.
Moreover, it doesn't seem like static linking to me.
A similar example would be using a GPLv3 licensed JavaScript library in a website. What it implies to other HTML/JS/CSS code is controversial [0]. The FSF actually believed that they should not be "infected" [1], and the legal implications may need to be tested in court.
The FSF question is about templates, but the chrome extension in question also seems to have copied nontrivial JS.
I don't think chrome extensions can be modified by the user; there's probably some integrity check. So to be GPL compliant they need to publish source files to rebuild the extension?
Right, or: maybe. Depends on where you are (or maybe better: where they are), and whether data collections fall under copyright or some other protection that is translateable enough for the gpl to apply. But if they really also used code that point is moot.
this is either a very big coincidence, or you are in the datamining discord as well.
the original archive i base my project on uses RMAN to store everything :D
---
thanks for the hint about the FICLONERANGE ioctl... it seems to be fine grained enough to allow me deduplicate on arbitrary offsets, not just whole blocks.
will give it a go.
i tried FICLONERANGE via a python wrapper btw - it turns out, that i can only clone ranges aligned to block boundaries :(
BTRFS is very neat per se, but documentation and help (most of all in very niche cases like this one here lol) is not that easy to come by.
my plan would be to properly process the data set, and then make it available as a BTRFS snapshot... you can export btrfs send as a file as well for storage etc.
if all my tries to use BTRFS fail, i might to write my own tooling and virtual filesytem as well, but optimized for my use case (MPQ files and such).
thanks for your input so far.