Hacker News new | past | comments | ask | show | jobs | submit login
Foldable Words (bit-player.org)
206 points by Veen on Feb 12, 2021 | hide | past | favorite | 20 comments



Nutrimatic is a neat tool (based on the trie datastructure) that can help you find words or phrases things like this - the query i?t?s?a?p?l?e?a?s?u?r?e?t?o?s?e?r?v?e?y?o?u?&.{10,} finds subsequences of at least length 10: https://nutrimatic.org/?q=i%3Ft%3Fs%3Fa%3Fp%3Fl%3Fe%3Fa%3Fs%...

Since it uses Wikipedia as the text source and orders by frequency, the results tend to be more realistic phrases than you might find doing smashing words from a word list together.


Over christmas we had a look at the GCHQ christmas quiz [1]. The idea is that you solve a set of small puzzles, each producing a letter, which then become labels of the nodes of a directed graph, and you have to find a Eulerian path through the graph (visiting some nodes multiple times) which spells out a festive word or phrase.

We solved two or three of the small puzzles, but couldn't get the rest. I wondered if we could solve the overall puzzle without knowing the letters, by exploiting the structure of the graph: perhaps there are simply not many words or phases which could be produced by the graph.

Enumerating the possible paths through the graph was straightforward (there are a few hundred, i think). Matching those paths against individual words, if you have a word list, is easy. But where i was stumped was matching them against phrases containing multiple words. Trying every possible sequence of words, even lazily assembled, felt like it would be a combinatorial explosion that i would never be able to get through.

Using a corpus of phrases from wikipedia is absolute genius.

[1] https://www.gchq.gov.uk/news/christmas-card-2020 - note that the edge with a double arrow is really two edges, one in each direction


That's the first time I see a regular expression implementation with intersection, apart from my own from many, many moons ago. A pleasant surprise. I also had added a negation operator, which adds another level of unintuitive behavior. E.g., you might think that !a would block aa from matching, but it doesn't.

The anagram operator is a new one for me. Nice.


I've once implemented a regular expression engine based directly on https://en.wikipedia.org/wiki/Brzozowski_derivative – it's straightforward to implement both intersection and negation this way.


hfst (open source rewrite of the Xerox finite state toolkit) includes negation:

    $ echo 'a b
    c b
    c d*' | hfst-regexp2fst > ab.fst
    $ echo 'c ?+' | hfst-regexp2fst >cdotplus.fst
    $ hfst-intersect ab.fst cdotplus.fst | hfst-expand -c3
    cb
    cd
    cdd
    cddd
    cdddd
    $ hfst-expand -c3 ab.fst
    ab
    cb
    c
    cd
    cdd
    cddd
    
The regex syntax is a bit quirky due to backwards compatibility with lexicons written in XFST, see https://github.com/hfst/hfst/wiki/Regular-Expression-Operato...


I recommend 'o the pelican' - an entertaining subreddit of folded poetry found in tweets

https://www.reddit.com/r/othepelican/comments/86z3zk/o_the_p...


That's a delightful anecdote to kick off a computational challenge.


> Not all of those statements would have been suited to the occasion of our rendezvous at the Maple Diner, but over the course of our years together—17 years, as it turned out—there came a moment for each of them.

Did make me sad, though. Considering the last message found in the text.


Those 8 lines are a poetic sequence describing a life with a partner. I couldn't help myself though, I had to move one line:

I love you.

I tease you.

I pleasure you.

I please you.

I pester you.

I peeve you.

I salute you.

I leave you.


Yes, this is like a projecteuler question collided with a short story.


I really hope I’m this sharp and poetic when I’m 71.


I hope you are too, you can crack me jokes I won't get.


Wouldn't the regex be the phrase interleaved with question marks: i?t?s?a?p?l?e? etc?

Using that on my /usr/share/dict/words, I get 634 matches, which is about in line with what the author got.


Apart from the ^...$, it's exactly that.


The mathematical term for dropping letters is taking a "subsequence", https://en.wikipedia.org/wiki/Subsequence . This is not the same as taking a substring, which is understood to be a contiguous subsequence.


I want to know more about Lynn.


I really wish this didn't fold to the phrase "I wank now more ably".




A true nerd: Gets a love letter - Writes a lengthy computational analysis what else could have been written =P. Joking aside, really interesting blog post




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: