UTF-8 v9.0 contains 1085 emoji, that should be even easier to compare than rando...

JulianMorrison · on Aug 16, 2016

I think emoji would be hard to compare. A lot of very similar little faces. Is that a wink or a blink or a frown?

An example of urbit's rendering of a 128-bit number into textual form is "racmus-mollen-fallyt-linpex--watres-sibbur-modlux-rinmex". While it might be gibberish, it's gibberish that even a screen-reader program could take a swing at, and humans can easily read.

yarvin9 · on Aug 16, 2016

A similar design is Proquint, which IPFS uses:

https://www.npmjs.com/package/proquint

Proquint (5 letters per 16 bits) is tighter than Urbit's `@p` (6 letters per 16 bits). The Urbit form was designed for synthetic names and restricts itself to phonemes that sound comfortable and natural to English speakers. (Not to say that English should be the universal language, it's actually a terrible language to make everyone learn, just that it is.)

Word lists work reasonably well, but they're quite bulky and they don't take advantage of the human hardware accelerator for learning new words. When you have a GPU, use it. These kinds of synthetic strings also make great passwords, BTW.

(Disclaimer: Urbit guy here.)

jcranmer · on Aug 16, 2016

If you want to make a more universal phoneme-generator, the basic contours of a nearly-universal [1] phonotactics is as follows:

* Strict CV syllable scheme.

* Atonal

* Consonants distinguished only by voiced/voiceless (Chinese, e.g., doesn't do a voicing distinction, but switching to an aspiration distinction would suffice for them)

* 5 vowels: a, e, i, o, u (actual vowel quality may vary; every language that has at least 5 vowels has these 5 vowels) (some languages, particularly indigenous languages in North America, have 3 or 4 vowels, but the intersection yields too few vowels).

* Consonants are harder to inventory. /p/, /t/, /k/, /m/, /n/ are nearly universal, and /b/, /g/, /d/, /s/, /z/ are also quite common. The IPA /j/ (that's the 'y' in 'ya' for English speakers), /w/ (pronounced as you'd think in English) are pretty common semi-vowels. Maybe /l/, /ʃ/, /ʒ/ as well, should you need more consonants.

That gives you 25-75 plausible syllables, depending on how many consonants you go with.

[1] If you go by least common denominator, you end up with maybe 1 vowel and no consonants (there's no consonant phoneme present in every language IIRC).

schoen · on Aug 17, 2016

Doesn't Lojban try to have pretty easy phonotactics or something? They do have consonant clusters, but I thought they did some kind of study and chose their phonemes and some rules on the basis of things that most languages wouldn't find too difficult.

Edit: not suggesting that Lojban's solution is somehow preferable to your advice, just trying to remember what they did about this issue.

jcranmer · on Aug 17, 2016

Lojban uses the consonants I gave (sans /j/ and /w/, although these are counted as dipthongs instead), plus /f/, /v/, /x/, /ʔ/, /h/, and /r/, as well as /ə/ for a sixth vowel. The syllable scheme seems to be largely C(C)VC(C), with largely only mixed voiced/unvoiced and geminate consonant clusters prohibited. That said, they do allow for "buffer" vowels in pronunciation to aid speakers who have trouble with consonants (and yet they have a /ə/?).

From what I can tell, CV(C) (with the second consonant usually having some restrictions) is fairly widespread. However, in my personal (purely anecdotal) experience, pronouncing foreign consonant clusters or unfamiliar final consonants is much harder than pronouncing unfamiliar initial consonants or vowels, so I'd be slightly wary of letting the final consonant go too unrestricted.

JulianMorrison · on Aug 16, 2016

Proquint example for comparison: "pokak-fijus-zavaz-posuf-bizar-luhuf-kulor-marak".

jcranmer · on Aug 16, 2016

If you don't have a font that contains those emoji, you'll get either blocks of ??????? or little square blocks containing difficult-to-see numbers in them.

Do Debian or any of the *BSDs yet default to including a font that supports emoji?

masklinn · on Aug 16, 2016

Other possible alternatives (somewhat more limited in number so longer strings) would be box drawing characters[0] or game blocks (mahjong[1], tiles[2] or cards[3]).

[0] https://en.wikipedia.org/wiki/Box_Drawing

[1] https://en.wikipedia.org/wiki/Mahjong_Tiles_(Unicode_block)

[2] https://en.wikipedia.org/wiki/Domino_Tiles

[3] https://en.wikipedia.org/wiki/Playing_cards_in_Unicode#Playi...

sleepychu · on Aug 16, 2016

I'm not sure if you're sincere but I don't think emoji would be easier to compare. +Might be less friendly to screen reader users?

masklinn · on Aug 16, 2016

> I'm not sure if you're sincere but I don't think emoji would be easier to compare.

The point is to leverage human pattern matching so you want short-ish figures with large differences.

Each hex digits is 4 bits, but each emoji is 10 bits, a 128 bits key is 13 emoji which is significantly more eyeballable than 32 hex digits, and chances are you'll notice EGGPLANT being replaced by CAMERA in 13 pictures easier than you'd notice B being replaced by 8 in 32 characters.

lmm · on Aug 17, 2016

> Each hex digits is 4 bits, but each emoji is 10 bits, a 128 bits key is 13 emoji which is significantly more eyeballable than 32 hex digits, and chances are you'll notice EGGPLANT being replaced by CAMERA in 13 pictures easier than you'd notice B being replaced by 8 in 32 characters.

Compare like with like - what are the two most similar emoji from that set of 2^10 ?

avisser · on Aug 16, 2016

I hate to say it, but comparing two, long hash values via a screen reader doesn't seem viable for humans, regardless of emoji.

Maybe an auralizer to turn the hash into a short piece of music?

JulianMorrison · on Aug 16, 2016

See my above example from urbit. They are something a screen reader would clearly read differently for different values, if not comprehensibly.