Nutrimatic is a neat tool (based on the trie datastructure) that can help you find words or phrases things like this - the query i?t?s?a?p?l?e?a?s?u?r?e?t?o?s?e?r?v?e?y?o?u?&.{10,} finds subsequences of at least length 10: https://nutrimatic.org/?q=i%3Ft%3Fs%3Fa%3Fp%3Fl%3Fe%3Fa%3Fs%...
Since it uses Wikipedia as the text source and orders by frequency, the results tend to be more realistic phrases than you might find doing smashing words from a word list together.
Over christmas we had a look at the GCHQ christmas quiz [1]. The idea is that you solve a set of small puzzles, each producing a letter, which then become labels of the nodes of a directed graph, and you have to find a Eulerian path through the graph (visiting some nodes multiple times) which spells out a festive word or phrase.
We solved two or three of the small puzzles, but couldn't get the rest. I wondered if we could solve the overall puzzle without knowing the letters, by exploiting the structure of the graph: perhaps there are simply not many words or phases which could be produced by the graph.
Enumerating the possible paths through the graph was straightforward (there are a few hundred, i think). Matching those paths against individual words, if you have a word list, is easy. But where i was stumped was matching them against phrases containing multiple words. Trying every possible sequence of words, even lazily assembled, felt like it would be a combinatorial explosion that i would never be able to get through.
Using a corpus of phrases from wikipedia is absolute genius.
That's the first time I see a regular expression implementation with intersection, apart from my own from many, many moons ago. A pleasant surprise. I also had added a negation operator, which adds another level of unintuitive behavior. E.g., you might think that !a would block aa from matching, but it doesn't.
hfst (open source rewrite of the Xerox finite state toolkit) includes negation:
$ echo 'a b
c b
c d*' | hfst-regexp2fst > ab.fst
$ echo 'c ?+' | hfst-regexp2fst >cdotplus.fst
$ hfst-intersect ab.fst cdotplus.fst | hfst-expand -c3
cb
cd
cdd
cddd
cdddd
$ hfst-expand -c3 ab.fst
ab
cb
c
cd
cdd
cddd
> Not all of those statements would have been suited to the occasion of our rendezvous at the Maple Diner, but over the course of our years together—17 years, as it turned out—there came a moment for each of them.
Did make me sad, though. Considering the last message found in the text.
The mathematical term for dropping letters is taking a "subsequence", https://en.wikipedia.org/wiki/Subsequence . This is not the same as taking a substring, which is understood to be a contiguous subsequence.
A true nerd:
Gets a love letter - Writes a lengthy computational analysis what else could have been written =P. Joking aside, really interesting blog post
Since it uses Wikipedia as the text source and orders by frequency, the results tend to be more realistic phrases than you might find doing smashing words from a word list together.