This is a great start, but as you get more serious about NLP, keep in mind that a Web crawl from 2006 is not the same as ground truth about language. The top 100,000 words in this data include a lot of ephemeral spam. If you're looking for slang, it'll mostly be outdated. And words like "sitemap" and "Shockwave" are not as common throughout language as the Web alone would indicate.
The later Google Books Ngrams [1] data sets are cleaner, but that comes at a cost. You lose spam, but also a lot of other uses of language, when you only consider text that has been published in print. It's all in the formal register. And then you also get correlated OCR errors, as in [2], and you still get data whose collection ended in 2008.
So what am I suggesting you should use, if data from the Web is bad for some things and data from books is bad for others? Well, both of them, and a lot of other things too. An analysis that depends on word frequencies should use the consensus of many sources.
I've been working on compiling together many of those sources, in many languages, in the Python package wordfreq [3]. Now I'm working on an update to include excellent corpus data just released by the OPUS project, including OpenSubtitles 2016 [4].
Hi Rob - Thanks for posting this. I took a peek at the wordfreq project -- looks interesting. Are you decomposing to a unicode normal form, such as NFKD [0] or storing words as UTF-32 (that's what Python uses internally [1]), combining multi-character code points into the 32-bit code unit [2] represented by UTF-32?
I use NFKC form for scripts that seem to require it, such as Arabic, and NFC for others. If I used NFKC for English, for example, then encountering a brand name with a trademark sign on it would add the letters "tm" to the end of the word.
In general I use tokenization rules that follow the Unicode standards in [UAX 29], with language-specific external libraries for Chinese, Japanese, and Korean, and with some language-specific tweaks to handle cases the Unicode Consortium didn't go into. [0]
I use Python 3 strings, and it's a peculiar bit of abstraction-busting to worry about what they look like inside the Python interpreter. It's only UTF-32 for strings that contain high codepoints. See [PEP 393], "Flexible String Representation".
I don't think there is such a thing as "multi-character code points". At no point do I use UTF-16 (which has code points made of multiple surrogate code points, which are not characters), if that's what you're asking about.
Thanks for the info. I'm looking at this from the perspective of designing a backend datastore and query engine for a knowledge system. The idea is to encode a spatial data structure (similar to Google's S2 Geometry Library [0]) that enables content-based addressing of non-spatial data types for data fusion.
One idea is to make a lattice of unicode characters that builds up to combination of words a la Formal Concept Analysis [1] -- on one level, the characters compose into words that represent properties (key/value pairs), and then the KV pairs compose into higher-level objects. Each property and higher-level object is encoded with an integer derived from its constituent objects/properties, and each object is encoded in such a way that its constituent objects/properties can be determined algorithmically from the integer without having to traverse the structure [2] -- ANS encoding [3] embedded into a space with a VI metric (https://en.wikipedia.org/wiki/Variation_of_information) [4] might make this work. Have you played with this type of design?
Aw, I was hoping for an actual corpus. A couple years ago I had wanted to find a corpus of news stories or similar things in order to analyze relationships between entities (people/places/events) mentioned. It was then I discovered that every corpus I could find was specifically only permitted to be used for natural language analysis, and content analysis was specifically forbidden. Very disappointing. Still hoping to find a dump of a few decades worth of news or something some day to play with.
Take a look at the Google book corpus. I used it several years ago by renting the largest disk/RAM server Hetzner rents and collecting the most common 1gram, 2grams,...5grams in the book corpus. Many classification, summarization, etc. systems that rely on simple bag of words approaches can be improved by including ngrams. For a simple example, to label sentiment of text, considering 2grams lets you correctly handle 'not good', where considering one word at a time, ignoring word order, etc. is not so good.
What's the best source/method to get the frequency distribution to build an Arithmetic coding model [0] for English (multilingual ideally) with emoticons?
Onr of the main points of arithmetic coding is that you don't have to do a frequency count first like you would for Huffman. It's efficient enough that you can build your model piecewise as you go, and only sacrifice a few bits of coding inefficiency.
You're not going to find many public datasets on SMS or other emoji-using entities regardless of language. Twitter or Instagram are your best bet, but building your model on those will only work best on those types of input.
The later Google Books Ngrams [1] data sets are cleaner, but that comes at a cost. You lose spam, but also a lot of other uses of language, when you only consider text that has been published in print. It's all in the formal register. And then you also get correlated OCR errors, as in [2], and you still get data whose collection ended in 2008.
So what am I suggesting you should use, if data from the Web is bad for some things and data from books is bad for others? Well, both of them, and a lot of other things too. An analysis that depends on word frequencies should use the consensus of many sources.
I've been working on compiling together many of those sources, in many languages, in the Python package wordfreq [3]. Now I'm working on an update to include excellent corpus data just released by the OPUS project, including OpenSubtitles 2016 [4].
[1] http://storage.googleapis.com/books/ngrams/books/datasetsv2....
[2] http://drhagen.com/blog/the-missing-11th-of-the-month/
[3] https://github.com/LuminosoInsight/wordfreq
[4] http://opus.lingfil.uu.se/OpenSubtitles2016.php