Show HN: I modeled the Voynich Manuscript with SBERT to test for structure

patcon · 2025-05-18T18:49:05 1747594145

I see that you're looking for clusters within PCA projections -- You should look for deeper structure with hot new dimensional reduction algorithms, like PaCMAP or LocalMAP!

I've been working on a project related to a sensemaking tool called Pol.is [1], but reprojecting its wiki survey data with these new algorithms instead of PCA, and it's amazing what new insight it uncovers with these new algorithms!

https://patcon.github.io/polislike-opinion-map-painting/

Painted groups: https://t.co/734qNlMdeh

(Sorry, only really works on desktop)

[1]: https://www.technologyreview.com/2025/04/15/1115125/a-small-...

brig90 · 2025-05-18T19:01:41 1747594901

Thanks for pointing those out — I hadn’t seen PaCMAP or LocalMAP before, but that definitely looks like the kind of structure-preserving approach that would fit this data better than PCA. Appreciate the nudge — going to dig into those a bit more.

loxias · 2025-05-19T03:21:14 1747624874

Try TDA ("mapper", or really, anything based on kernel density computed connectivity), it's a whole new world.

This ain't your parents' "factor analysis".

patcon · 2025-05-19T16:24:09 1747671849

Ooooo I will definitely check it out! It's strangely hard to find any comparisons in youtube videos -- it seems TDA isn't actually a dimensional reduction algorithm, but something closely relayed, maybe?

khafra · 2025-05-19T11:04:00 1747652640

LLM model interpretability also uses Sparse Autoencoders to find concept representations (https://openai.com/index/extracting-concepts-from-gpt-4/), and, more recently, linear probes.

staticautomatic · 2025-05-18T20:03:51 1747598631

I’ve had much better luck with umap than PCA and t-sne for reducing embeddings.

patcon · 2025-05-18T22:43:43 1747608223

PaCMAP (and its descendant localmap) are comparable to t-sne at preserving both local and global structure (but without messing much with finicky hyperparameters)

https://youtu.be/sD-uDZ8zXkc

minimaxir · 2025-05-18T17:01:51 1747587711

A point of note is that the text embeddings model used here is paraphrase-multilingual-MiniLM-L12-v2 (https://huggingface.co/sentence-transformers/paraphrase-mult...), which is about 4 years old. In the NLP world, that's effectively ancient, particularly as the robustness of even small embeddings models due to global LLM improvements has increased dramatically both in information representation and distinctiveness in the embedding space. Even modern text embedding models not explicitly trained for multilingual support still do extremely well on that type of data, so they may work better for the Voynich Manuscript which is a relatively unknown language.

The traditional NLP techniques of stripping suffices and POS identification may actually harm embedding quality than improvement, since that removes relevant contextual data from the global embedding.

brig90 · 2025-05-18T17:04:58 1747587898

Totally fair — I defaulted to paraphrase-multilingual-MiniLM-L12-v2 mostly for speed and wide compatibility, but you’re right that it’s long in the tooth by today’s standards. I’d be really curious to see how something like all-mpnet-base-v2 or even text-embedding-ada-002 would behave, especially if we keep the suffixes in and lean into full contextual embeddings rather than reducing to root forms.

Appreciate you calling that out — that’s a great push toward iteration.

Ey7NFZ3P0nzAe · 2025-05-23T05:41:46 1747978906

Be careful: they have super short context length AND silently crop if the text is too long. To me there is really no reason to use them.

I recommend ollama to run the artic-embed-v2 model, it also is multimingual and you can use --quantize when loading the modelfile to get it even smaller.

thih9 · 2025-05-18T19:15:23 1747595723

(I know nothing about NLP)

Does it make sense to check the process with a control group?

E.g. if we ask a human to write something that resembles a language but isn’t, then conduct this process (remove suffixes, attempt grouping, etc), are we likely to get similar results?

flir · 2025-05-19T02:47:38 1747622858

I suppose if you've got a hypothesis about how it was written (eg the Cardan grille method) you could generate some texts via that method and see if they display the same characteristics?

awinter-py · 2025-05-18T23:15:17 1747610117

yes exactly, why did we not simply ask 100 people to write voynich manuscripts and then train on that dataset

cedws · 2025-05-19T13:36:04 1747661764

I had a look at the manuscript for a while and found it suspicious how tightly packed the writing was against the illustrations on some pages. In common language words and letters vary in width, so when you approach the end of the line when writing, you naturally insert a break to begin a new word and avoid overrun. The manuscript is missing these kinds of breaks - I saw many places where it looked like whatever letter might squeeze in had been written at the end of the line.

I wanted to do an analysis of what letters occur just before/after a line break to see if there is a difference from the rest of the text, but couldn't find a transcribed version.

My completely amateur take is that it's an elaborate piece of art or hoax.

IAmBroom · 2025-05-19T19:29:04 1747682944

Some languages do this by split-

ting words at the end of lines.

tetris11 · 2025-05-18T16:44:41 1747586681

UMAP or TSNE would be nice, even if PCA already shows nice separation.

Reference mapping each cluster to all the others would be a nice way to indicate that there's no variability left in your analysis

brig90 · 2025-05-18T16:46:56 1747586816

Great points — thank you. PCA gave me surprisingly clean separation early on, so I stuck with it for the initial run. But you’re right — throwing UMAP or t-SNE at it would definitely give a nonlinear perspective that could catch subtler patterns (or failure cases).

And yes to the cross-cluster reference idea — I didn’t build a similarity matrix between clusters, but now that you’ve said it, it feels like an obvious next step to test how much signal is really being captured.

Might spin those up as a follow-up. Appreciate the thoughtful nudge.

lukeinator42 · 2025-05-18T17:42:03 1747590123

Do you have examples of how this reference mapping is performed? I'm interested in this for embeddings in a different modality, but don't have as much experience on the NLP side of things

tetris11 · 2025-05-18T19:05:18 1747595118

Nothing concrete, but you essentially perform shared nearest neighbours using anchor points to each cluster you wish to map to. These form correction vectors you can then use to project from one dataset to another

jszymborski · 2025-05-18T16:50:58 1747587058

When I get nice separation with PCA, I personally tend to eschew UMAP, since the relative distance of all the points to one another is easier to interpret. I avoid t-SNE at all costs, because distance in those plots are pretty much meaningless.

(Before I get yelled out, this isn't prescriptive, it's a personal preference.)

minimaxir · 2025-05-18T19:34:45 1747596885

PCA having nice separation is extremely uncommon unless your data is unusually clean or has obvious patterns. Even for the comically-easy MNIST dataset, the PCA representation doesn't separate nicely: https://github.com/lmcinnes/umap_paper_notebooks/blob/master...

jszymborski · 2025-05-18T20:57:36 1747601856

"extremely uncommon" is very much not my experience when dealing with well-trained embeddings.

I'd add that just because you can achieve separability from a method, the resulting visualization may not be super informative. The distance between clusters that appear in t-SNE projected space often have nothing to do with their distance in latent space, for example. So while you get nice separate clusters, it comes at the cost of the projected space greatly distorting/hiding the relationship between points across clusters.

tomrod · 2025-05-18T17:54:55 1747590895

We are of a like mind.

DonaldFisk · 2025-05-19T12:46:36 1747658796

This is very interesting. You should post a link to https://www.voynich.ninja/index.php

I'm not familiar with SBERT, or with modern statistical NLP in general, but SBERT works on sentences, and there are no obvious sentence delimiters in the Voynich Manuscript (only word and paragraph delimiters). One concern I have is "Strips common suffixes from Voynich words". Words in the Voynich Manuscript appear to be prefix + suffix, so as prefixes are quite short, you've lost roughly half the information before commencing your analysis.

You might want to verify that your method works for meaningful text in a natural language, and also for meaningless gibberish (encrypted text is somewhere in between, with simpler encryption methods closer to natural language and more complex ones to meaningless gibberish). Gordon Rugg, Torsten Timm, and myself have produced text which closely resembles the Voynich Manuscript by different methods. Mine is here: https://fmjlang.co.uk/voynich/generated-voynich-manuscript.h... and the equivalent EVA is here: https://fmjlang.co.uk/voynich/generated-voynich-manuscript.t...

Avicebron · 2025-05-18T16:41:48 1747586508

Maybe I missed it in the README but how did you do the initial encoding for the "words"? so for example, if you have ""okeeodair" as a word, where do you map that back to original symbols?

brig90 · 2025-05-18T16:45:39 1747586739

Yep, that’s exactly right — the words like "okeeodair" come directly from the EVA transliteration files, which map the original Voynich glyphs to ASCII approximations. So I’m not working with the glyphs themselves, but rather the standardized transliterated words based on the EVA (European Voynich Alphabet) system. The transliterations I used can be found here: https://www.voynich.nu/

I didn’t re-map anything back to glyphs in this project — everything’s built off those EVA transliterations as a starting point. So if "okeeodair" exists in the dataset, that’s because someone much smarter than me saw a sequence of glyphs and agreed to call it that.

us-merul · 2025-05-18T16:55:49 1747587349

I’ve found this to be one of the most interesting hypotheses: http://voynichproject.org/

The author made an assumption that Voynichese is a Germanic language, and it looks like he was able to make some progress with it.

I’ve also come across accounts that it might be an Uralic or Finno-Ugric language. I think your approach is great, and I wonder if tweaking it for specific language families could go even further.

veqq · 2025-05-18T17:44:24 1747590264

This thread discusses the many purported "solutions": https://www.voynich.ninja/thread-4341.html While Bernholz' site is nice, Child's work doesn't shed much light on actually deciphering the MS.

us-merul · 2025-05-18T17:49:05 1747590545

Thanks for this! I had come across Child’s hypothesis after doing a search related to Old Prussian and Slavic languages, so I don’t have much context for this solution, and this is helpful to see.

philistine · 2025-05-19T02:21:35 1747621295

With how undecipherable the manuscript is, my personal theory is that it's the work of a naive artist and that there's no language behind it. Just someone aping language without knowing the rules about language: https://en.wikipedia.org/wiki/Naïve_art

It's not a mental issue, it's just a rare thing that happens. Voynich fits the whole bill for the work of a naive artist.

cronopios · 2025-05-19T07:30:23 1747639823

And that naïve artist somehow managed to create a work that follows Zipf's law, 4 centuries before it was discovered?

DonaldFisk · 2025-05-19T12:23:22 1747657402

Random Texts Exhibit Zipf’s-Law-Like Word Frequency Distribution: https://www.nslij-genetics.org/wp-content/uploads/2022/12/ie...

It also applies to a range of natural phenomena, e.g. lunar craters and earthquakes: https://www.cs.cornell.edu/courses/cs6241/2019sp/readings/Ne...

So the fact that word frequencies in the Voynich Manuscript follow Zipf's law doesn't prove it's written in a natural language.

poulpy123 · 2025-05-19T12:14:05 1747656845

Why would it not ?

riffraff · 2025-05-19T06:02:47 1747634567

You're not alone. Many have hypothesized this is just made up gibberish given the unusual distribution of glyphs.

Not a recent hoax/scam, but an ancient one.

It's not like there weren't a ton of fake documents in the middle age and renaissance, from the donation of Constantine to Preserve John's letter.

philistine · 2025-05-19T13:59:22 1747663162

The way you describe it is why it’s not readily accepted. It’s misunderstood. You called it a hoax/scam and a fake. It’s not!

Whoever made the document was sincere in making up something that doesn’t exist. They had no intention to mislead. You wouldn’t call a D&D campaign a hoax because it features nonexistent things?

riffraff · 2025-05-22T06:17:48 1747894668

So you're saying it's a prop or a rulebook? (an old xkcd comic mentioned this).

I doubt it's a rulebook cause it's not a real language.

If it's a prop, it would be extremely expensive.

Just to get the parchment you'd have to slaughter a herd of ovines, then you'd have to process it, then you'd have to pay a skilled professional or more for months of work to draw and write.

So I think the profit motive is more likely, and given we know of a ton of scams like this from this period, it seems the most plausible.

But I'll be happy to be proven wrong if someone finds more info in the future.

GolfPopper · 2025-05-18T23:33:36 1747611216

Edward Kelly[1] was in the right place at the right time, and I recall reading many years ago (though I cannot now find the source) some evidence that he was familiar with the Cardan grille[2], which was sufficient to convince me that he was mostly likely the author, and that the book was intended as a hoax or fraud.

1.https://en.wikipedia.org/wiki/Edward_Kelley

2.https://en.wikipedia.org/wiki/Cardan_grille

renhanxue · 2025-05-19T01:20:54 1747617654

These days the manuscript is quite conclusively dated to the first half of the 15th century; the parchment it's written on is definitely from that period, since it's been carbon dated to 1404–1438 with 95% confidence. The general style is also consistent with that dating. For example, medievalist Lisa Fagin Davis writes in a recent paper: "[t]he humanistic tendencies of the glyphset, the color palette, and style of the illustrations suggest an origin in the early fifteenth century" [0].

Edward Kelly was born over a hundred years later, so him "being at the right time" seems to be a bit of a stretch.

[0]: https://ceur-ws.org/Vol-3313/keynote2.pdf

emmelaich · 2025-05-19T01:36:07 1747618567

I think it's entirely possible the inks are much later. Possibly Kelly erased whatever was on the parchment previously. In fact the drawings might have made liberal use of the original, just to hide that fact.

Which is worse actually. Kelly may have semi-erased an existing valuable manuscript.

renhanxue · 2025-05-19T01:44:17 1747619057

The hypothesis that the manuscript is a palimpsest (that is, written on an old parchment that was scraped clean of a previous text; such recycling was common because parchment was expensive) has been thoroughly rejected. That sort of thing is detectable, in fact there's an entire field of research dedicated to recovering lost texts from palimpsests, but the Voynich manuscript shows absolutely no signs of that.

emmelaich · 2025-05-19T02:37:20 1747622240

You're right. I have just read a bit about this[0] and agree. I do still believe that it's possible for the expensive parchment to have been obtained by someone uneducated or "naive", or quack and used by them.

[0] https://manuscriptroadtrip.wordpress.com/2024/09/08/multispe...

quantadev · 2025-05-18T20:38:07 1747600687

Being from the 15th Century the obvious reason to encrypt text was to avoid religious persecution during "The Inquisition" (and other religion-motivated violence of that time). So it would be interesting to run the same NLP against the Gospels and look for correlations with that. You'd want to first do a 'word'-based comparison, and then a 'character'-based comparison. I mean compare the graphs from Bible to graphs from Voynich.

Also there might be some characters that are in there just to confuse. For example that bizarre capital "P"-like thing that has multiple variations seems to appear sometimes far too often to represent real language, so it might be just an obfuscator that's removed prior to decryption. There may be other characters that are abnormally "frequent" and they're maybe also unused dummy characters. But the "too many Ps" problem is also consistent with just pure fiction too, I realize.

codesnik · 2025-05-18T20:26:15 1747599975

what I'd expect from a handwritten book like that, if it is just a gibberish, and not a cypher of any sorts - the style, calligraphy, the words used, even letters themselves should evolve from page 1 to the last page. Pages could be reordered of course, but it still should be noticeable.

Unless author hadn't written tens of books exactly like that before, which didn't survive, of course.

I don't think it's a very novel idea, but I wonder if there's analysis for pattern like that. I haven't seen mentions of page to page consistency anywhere.

veqq · 2025-05-18T21:35:00 1747604100

> I haven't seen mentions of page to page consistency anywhere.

A lot of work's been done here. There are believed to have been 2 scribes (see Prescott Currier), although Lisa Fagin Davis posits 5. Here's a discussion of an experiment working off of Fagin Davis' position: https://www.voynich.ninja/thread-3783.html

empath75 · 2025-05-19T13:19:25 1747660765

My favorite part of this thread is like a dozen different people replying that it's already been deciphered and none of them posted the same one.

bunderbunder · 2025-05-19T15:49:01 1747669741

> Traditional analyses often fall into two camps: statistical entropy checks or wild guesswork.

I'd argue that these are just the camps that non-traditional, amateur analysis efforts fall into. I've only briefly skimmed Voynich work, but my impression is that, traditionally, more academic analyses rely on a combination of linguistic and cryptological analysis. This does happen to be informed by some statistical analysis, but goes way beyond that.

For example, as I recall the strongest argument that Voynichese probably isn't just an alternative alphabet for a well-known language relies on comparing Voynichese to the general patterns for how writing systems map symbols to sounds. That permits the development of more specific hypotheses about how it could possibly function, including how likely it is to be an alphabet or abjad, and, hypotheses about which characters could plausibly represent more than one sound, possible digraphs, etc. All of that work casts severe doubt on the likelihood of it representing a language from the area because it just can't plausibly represent a language with the kinds of phonological inventories we see in the language families that existed in that place and time.

There's also been some pretty interesting work on identifying individual scribes based on a confluence of factors including, but not limited to, analysis of the text itself. Some of the inferred scribes exclusively wrote in the A language (oh yeah, Voynichese seems to contain two distinct "languages"), some exclusively wrote in the B language, I think they've even hypothesized that there's one who actually used both languages.

There isn't a lot of popular awareness of this work because it's not terribly sexy to anyone but a linguistics nerd. But I'd guess that any attempt to poke at the Voynich manuscript that isn't informed by it is operating at a severe disadvantage. You want to be standing on the shoulders of the tallest giants, not the ones with the best social media presence.

gwillen · 2025-05-19T13:59:16 1747663156

Confirm or deny my suspicion: your post and your comments in this thread are substantially written by ChatGPT?

brig90 · 2025-05-19T15:13:10 1747667590

I'm definitely not the most comfortable writing in public forums, so guilty as charged with throwing my comments through an LLM to make sure my point isn't being misconstrued.

tough · 2025-05-19T14:51:47 1747666307

em dashes innit

many english as second language speakers use LLMs as translators nowadays tho

pawanjswal · 2025-05-19T02:19:39 1747621179

This is hands-down the nerdiest and coolest deep-dive into the Voynich I’ve seen.

brig90 · 2025-05-19T03:07:44 1747624064

Honestly, I had never even heard of the manuscript before this weekend. I’ve been looking for interesting ways to strengthen my understanding of NLPs, and thought: 1) maybe this would be a good fit, and 2) maybe it hadn’t been approached in quite this way before?

That second part wasn’t super important though — this was more about learning and experimenting than trying to break new ground. Really appreciate the kind words, and hopefully it sparks someone to take it even further.

user32489318 · 2025-05-18T19:14:14 1747595654

Would analysis of a similar body of text in a known language yield similar patterns? Put it in another way, could you use this type of an analysis on different types of text help understand what this script describes?

frozenseven · 2025-05-19T03:59:24 1747627164

Really cool work here. Have you considered applying these same techniques to the Rohonc Codex? As far as I know, the only other book similar to the Voynich Manuscript.

brig90 · 2025-05-20T02:57:00 1747709820

Honestly I've never heard of Rohonc Codex. I'll have to check it out! Thanks!

ck2 · 2025-05-18T19:11:35 1747595495

> "New multispectral analysis of Voynich manuscript reveals hidden details"

https://arstechnica.com/science/2024/09/new-multispectral-an...

but imagine if it was just a (wealthy) child's coloring book or practice book for learning to write lol

Avicebron · 2025-05-18T19:24:43 1747596283

> but imagine if it was just a (wealthy) child's coloring book or practice book for learning to write lol

Even if it was "just" an (extraordinarily wealthy and precocious) child with a fondness for plants, cosmology, and female bodies carefully inscribing nonsense by repeatedly doodling the same few characters in blocks that look like the illuminated manuscripts this child would also need access to, that's still impressive and interesting.

bdbenton5255 · 2025-05-19T01:19:37 1747617577

Another great natural mystery that machine learning could tackle is earthquake prediction. Sure you could find some patterns modeling historical data.

brig90 · 2025-05-19T01:23:47 1747617827

Totally — I love that kind of sideways thinking. Earthquake prediction feels like one of those massive, noisy systems where patterns might exist, but they’re buried deep in complexity. I’ll admit, I know absolutely nothing about seismology, so I have no idea how realistic that kind of modelling would be — but yeah, it feels like one of those domains where structure might be hiding in what looks like chaos.

Appreciate the nudge — always fascinating to see where people take this kind of thinking.

PaulDavisThe1st · 2025-05-19T01:47:44 1747619264

What data would you feed it and why?

psychoslave · 2025-05-19T09:12:20 1747645940

Most likely seismograph logs, I guess? Not sure gas emissions are something we can expect to provide a more ahead of time alert. That is my guess is that by the time some out of charts in gas emissions are showing, it's too late to make a whole city population move? I'm nothing like a volcanologist though, this is just very wild guess.

On the other hand, it's a bit wild to build a whole city next to volcanos that are definitely going to wake up in less than a few centuries, to begin with.

poulpy123 · 2025-05-19T12:15:50 1747656950

There are already works using machine learning on thebtopic

andrewla · 2025-05-19T13:42:51 1747662171

Although I skimmed the methodology out of curiosity, what really drew my eye was the transcription in the repository of the manuscript. This led me down a rabbit hole leading here [1] about historic efforts to transcript or transliterate the manuscript.

[1] https://www.voynich.nu/transcr.html

marcodiego · 2025-05-18T20:22:48 1747599768

How expensive is a "brute force" approach to decode it? I mean, how about mapping each unknown word by a known word in a known language and improve this mapping until a 'high score' is reached?

munchler · 2025-05-18T21:57:21 1747605441

This seems to assume that a 1:1 mapping between words exists, but I don't think that's true for languages in general. Compound words, for example, won't map cleanly that way. Not to mention deeper semantic differences between languages due to differences in culture.

raverbashing · 2025-05-19T07:52:12 1747641132

Correct

Mapping words 1:1 is not going to lead you anywhere (especially for a text that has stood undecoded for so long time)

It kiiiinda works for very close languages (think Dutch<>German or French<>Spanish) and even then.

brig90 · 2025-05-18T20:27:01 1747600021

That’s a really interesting question — and one I’ve been circling in the back of my head, honestly. I’m not a cryptographer, so I can’t speak to how feasible a brute-force approach is at scale, but the idea of mapping each Voynich “word” to a real word in another language and optimizing for coherence definitely lines up with some of the more experimental approaches people have tried.

The challenge (as I understand it) is that the vocabulary size is pretty massive — thousands of unique words — and the structure might not be 1:1 with how real language maps. Like, is a “word” in Voynich really a word? Or is it a chunk, or a stem with affixes, or something else entirely? That makes brute-forcing a direct mapping tricky.

That said… using cluster IDs instead of individual word (tokens) and scoring the outputs with something like a language model seems like a pretty compelling idea. I hadn’t thought of doing it that way. Definitely some room there for optimization or even evolutionary techniques. If nothing else, it could tell us something about how “language-like” the structure really is.

Might be worth exploring — thanks for tossing that out, hopefully someone with more awareness or knowledge in the space see's it!

quantadev · 2025-05-18T20:47:20 1747601240

Like I said in another post (sorry for repeating) since this was during 1500s, the main thing people would've been encrypting back then was biblical text (or any other religion).

Maybe a version of scripture that had been "rejected" by some King, and was illegal to reproduce? Take the best radiocarbon dating, figure out who was King back then, and if they 'sanctioned' any biblical translations, and then go to the version of the bible before that translation, and this will be what was perhaps illegal and needed to be encrypted. That's just one plausible story. Who knows, we might find out the phrase "young girl" was simplified to "virgin", and that would potentially be a big secret.

edoceo · 2025-05-19T00:09:21 1747613361

Is this grey cause it talks about religion? That stuff was bigger in 1500 than 2000, from that lense as religious text seems a reasonable track to follow.

quantadev · 2025-05-19T00:30:10 1747614610

Other than war plans, religious text was pretty much the only thing in the 1500s that would have been encrypted. However war plans would be very unlikely to be disguised as a botany book, for all kinds of reasons. War plans are temporary, not something you'd dedicate that level of artistic effort and permanence to.

tough · 2025-05-19T13:30:23 1747661423

The art of war by Sun Tzu is pretty timeless tho

quantadev · 2025-05-19T15:25:03 1747668303

Right, because it's not a war plan. A war plan is about when, where, how, who, etc, for specific attack(s).

tough · 2025-05-19T16:00:51 1747670451

yes indeed, more like a war blueprint? like in general strategies applicable to many battles (so you can infer plans for any n wars of the future)

idk

quantadev · 2025-05-19T16:24:33 1747671873

I mean it's theoretically possible a 1500s King might have made that book illegal, because of it's general knowledge. That's a legit point.

Sadly the radio carbon dating disproved two of my far out theories, which was, 1) The book survived from some earlier 'iteration' of life on the planet, where all plants were simply different. or 2) All planets form the same 'kind' of carbon-based life, and this book was sent/delivered to us by another planet.

Sadly, it's probably just someone's form of "art", and not even "real".

marcodiego · 2025-05-18T20:29:59 1747600199

It might be a good idea for a SETI@home like project.

mellow_observer · 2025-05-19T07:30:29 1747639829

I don't think that's likely possible. How would you determine the score? Where would you get your corpus of medieval words? How would you deal with the insane computational complexity?

Pecularities in Voynich also suggest that one to one word mappings are very unlikely to result in well described languages. For instance there's cases of repeated word sequences you don't really see in regular text. There's a lack of extremely common words that you would expect would be neccessary for a word based structured grammar, there's signs that there's at least two 'languages', character distributions within words don't match any known language, etc.

If there still is a real unencoded language in here, it's likely to be entirely different from any known language.

gthompson512 · 2025-05-19T02:24:14 1747621454

Sorry if I missed it, but what about keeping the suffixes and trying to do some finetuning on the source then clustering sentences or at least pages which given the media should be consistent-ish

brig90 · 2025-05-19T03:04:18 1747623858

Great question — and something I've been thinking about. I stripped suffixes mostly to normalize some of the repeated endings (aiin, dy, etc.) that felt like filler, but you’re totally right that preserving them might preserve structure I lost.

Clustering by sentence or page would be interesting too — I haven't gone that far yet, but it’d be fascinating to see if there’s consistency across visual/media sections. Appreciate the insight!

bpiroman · 2025-05-18T23:30:08 1747611008

I thought it was old turkish?

https://www.youtube.com/watch?v=p6keMgLmFEk&t=1s

bpiroman · 2025-05-18T23:31:58 1747611118

English translation of the manuscript is timestamped below:

https://youtu.be/p6keMgLmFEk?feature=shared&t=559

Nursie · 2025-05-19T01:48:59 1747619339

Seems not - https://www.youtube.com/watch?v=UgVZZrZ1eqY

There's also a very long thread about it here - https://www.voynich.ninja/thread-2318.html - that seems to go from "that's really interesting, let's find out more about it" to "eh, seems about the same as other revelatory announcements about Romance, Hebrew etc"

thearn4 · 2025-05-19T16:08:28 1747670908

Voynich is one of my favorite unsolved puzzles. This approach looks fascinating, so thanks for sharing your work here!

theRealEros · 2025-05-28T16:35:19 1748450119

I feel like we are missing an important point...the transformer model that has been used here is trained on known languages. This means it cannot extract meaningful embeddings from a text in a unknown language...are the plots just noise then?

GTP · 2025-05-18T20:03:30 1747598610

The link to the write-up seems broken, can you write the correct one?

brig90 · 2025-05-18T20:13:14 1747599194

Apologies but its not letting me edit post any longer (I'm new to HN), here's the link though: https://brig90.substack.com/p/modeling-the-voynich-manuscrip...

rossant · 2025-05-18T19:46:40 1747597600

TIL about the Voynich manuscript. Fascinating. Thank you.

adzm · 2025-05-18T21:21:18 1747603278

It is a great coffee table book!

mach5 · 2025-05-19T11:58:41 1747655921

theres no need to do any of this, its fake, its a forgery

Tade0 · 2025-05-19T13:00:49 1747659649

That much is understood, but how did someone manage to produce such a huge body of text which isn't complete nonsense?

AStonesThrow · 2025-05-19T16:24:52 1747671892

https://m.xkcd.com/593/

glimshe · 2025-05-18T16:57:10 1747587430

I strongly believe the manuscript is undecipherable in the sense thats it's all gibberish. I can't prove it, but at this point I think it's more likely than not to be hoax.

lolinder · 2025-05-18T17:11:42 1747588302

Statistical analyses such as this one consistently find patterns that are consistent with a proper language and would be unlikely to have emerged from someone who was just putting gibberish on the page. To get the kinds of patterns these turn up someone would have had to go a large part of the way towards building a full constructed language, which is interesting in its own right.

ahmedfromtunis · 2025-05-18T18:24:42 1747592682

Personally, I have no preference to any theory about the book; whichever it turns out to be, I'll take it as is.

That said, I just watched a video about the practice of "speaking in tongues" that some christian congregations practice. From what I understand, it's a practice where believers speak in gibberish for certain rituals.

Studying these "speeches", researches found patterns and rhythms that the speakers followed without even being aware they exist.

I'm not saying that's what's happening here, but maybe if this was a hoax (or a prank), maybe these patterns emerged just because they were inscribed by a human brain? At best, these patterns can be thought of as shadows of the patterns found in the writers mother tongue?

InsideOutSanta · 2025-05-18T18:06:07 1747591567

> would be unlikely to have emerged from someone who was just putting gibberish on the page

People often assert this, but I'm unsure of any evidence. If I wrote a manuscript in a pretend language, I would expect it to end up with language-like patterns, some automatically and some intentionally.

Humans aren't random number generators, and they aren't stupid. Therefore, the implicit claim that a human could not create a manuscript containing gibberish that exhibits many language-like patterns seems unlikely to be true.

So we have two options:

1. This is either a real language or an encoded real language that we've never seen before and can't decrypt, even after many years of attempts

2. Or it is gibberish that exhibits features of a real language

I can't help but feel that option 2 is now the more likely choice.

neom · 2025-05-18T18:10:24 1747591824

For some reason your comment reminds me of this: https://en.wikipedia.org/wiki/Prisencolinensinainciusol - https://www.youtube.com/watch?v=fU-wH8SrFro

buildsjets · 2025-05-19T01:21:29 1747617689

Japanese psychedelic band Kikagaku Moyo writes lyrics that are basically Japanese baby talk / babble.

https://youtube.com/watch?v=idfOZFdTM-8&t=2m39s

https://genius.com/Kikagaku-moyo-dripping-sun-lyrics

tonymillion · 2025-05-18T22:53:31 1747608811

And let’s not forget “Ken Lee”

https://youtu.be/vUAaHkGpJy8

CamperBob2 · 2025-05-18T18:25:19 1747592719

Or Dead Can Dance, e.g. https://www.youtube.com/watch?v=VEVPYVpzMRA .

It's harder to generate good gibberish than it appears at first.

cubefox · 2025-05-18T18:19:33 1747592373

Creating gibberish with the statistical properties of a natural language is a very hard task if you do this hundreds of years before the discovery of said statistical properties.

vehemenz · 2025-05-18T20:19:36 1747599576

I'm not sure where this claim keeps coming from. Voynichese doesn't exhibit the statistical qualities of any known natural language. In a very limited sense, yes, but on balance, no. There is too much repetition for that.

userbinator · 2025-05-19T01:56:03 1747619763

Those statistical properties are inherent in how the human brain works.

InsideOutSanta · 2025-05-18T18:32:12 1747593132

veqq · 2025-05-18T18:02:53 1747591373

> consistent with a proper language

There's certainly a system to the madness, but it exhibits rather different statistical properties from "proper" languages. Look at section 2.4: https://www.voynich.nu/a2_char.html At the moment, any apparently linguistic patterns are happenstance; the cypher fundamentally obscures its actual distribution (if a "proper" language.)

andoando · 2025-05-18T17:38:36 1747589916

Could still be gibberish.

Shud less kee chicken souls do be gooby good? Mus hess to my rooby roo!

edoceo · 2025-05-19T00:11:53 1747613513

This reads like bayou-louisanian or possibly Ozark (source: family)

Loughla · 2025-05-18T21:16:07 1747602967

If you're going to make a hoax for fun or for profit, wouldn't it be the best first step to make it seem legitimate, by coming up with a fake language? Klingon is fake, but has standard conventions. This isn't really a difficult proposition compared to all of the illustrations and what-not, I would think.

int_19h · 2025-05-18T21:28:45 1747603725

If you come up with a fake language, then by definition the text has some meaning in said language.

vehemenz · 2025-05-18T20:17:28 1747599448

Even before we consider the cipher, there's a huge difference between a constructed language and a stochastic process to generate language-like text.

lolinder · 2025-05-18T20:43:15 1747600995

A stochastic pattern to generate language-like text in the early 1400s is a lot more interesting than gibberish.

tough · 2025-05-19T13:32:46 1747661566

I ching is similar conceptually to llms today (ancient chinese)

https://en.wikipedia.org/wiki/I_Ching

himinlomax · 2025-05-18T18:35:08 1747593308

There are many aspects that point to the text not being completely random or clumsily written. In particular it doesn't fall into many faults you'd expect from some non-expert trying to come up with a fake text.

The age of the document can be estimated through various methods that all point to it being ~500 year old. The vellum parchment, the ink, the pictures (particularly clothes and architecture) are perfectly congruent with that.

The weirdest part is that the script has a very low number of different signs, fewer than any known language. That's about the only clue that could point to a hoax afaik.

poulpy123 · 2025-05-19T12:32:24 1747657944

The book is obviously a hoax (either voluntary or not), the question is if the text is a cypher, a transliteration, a fake language, or just gibberish.

As far as I know it's just gibberish since it doesn't follow the statistics of the known languages or cyphers of the time.

andyjohnson0 · 2025-05-18T16:38:35 1747586315

This looks very interesting - nice work!

I have no background in NLP or linguistics, but I do have a question about this:

> I stripped a set of recurring suffix-like endings from each word — things like aiin, dy, chy, and similar variants

This seems to imply stripping the right-hand edges of words, with the assumption that the text was written left to right? Or did you try both possibilities?

Once again, nice work.

brig90 · 2025-05-18T16:41:11 1747586471

Great question — and you’re right to catch the assumption there. I did assume left-to-right when stripping suffixes, mostly because that’s how the transliteration files were structured and how most Voynich analyses approach it. I didn’t test the reverse — though flipping the structure and checking clustering/syntax behavior would be a super interesting follow-up. Appreciate you calling it out!

fader · 2025-05-19T14:32:17 1747665137

It's generally agreed that the Voynich Manuscript was written (and should be read) left to right. For example, margins are aligned on the left and uneven on the right, indicating that the writer started from the left.

veqq · 2025-05-18T17:20:55 1747588855

The best work on Voynich has been done by Emma Smith, Coons and Patrick Feaster, about loops and QOKEDAR and CHOLDAIIN cycles. Here's a good presentation: https://www.youtube.com/watch?v=SCWJzTX6y9M Zattera and Roe have also done good work on the "slot alphabet". That so many are making progression in the same direction is quite encouraging!

https://www.voynich.ninja/thread-4327-post-60796.html#pid607... is the main forum discussing precisely this. I quite liked this explanation of the apparent structure: https://www.voynich.ninja/thread-4286.html

> RU SSUK UKIA UK SSIAKRAINE IARAIN RA AINE RUK UKRU KRIA UKUSSIA IARUK RUSSUK RUSSAINE RUAINERU RUKIA

That is, there may be 2 "word types" with different statistical properties (as Feaster's video above describes)(perhaps e.g. 2 different Cyphers used "randomly" next to each other). Figuring out how to imitate the MS' statistical properties would let us determine cypher system and make steps towards determining its language etc. so most credible work's gone in this direction over the last 10+ years.

This site is a great introduction/deep dive: https://www.voynich.nu/

brig90 · 2025-05-18T17:25:30 1747589130

I’m definitely not a Voynich expert or linguist — I stumbled into this more or less by accident and thought it would make for a fun NLP learning project. Really appreciate you pointing to those names and that forum — I wasn’t aware of the deeper work on QOKEDAR/CHOLDAIIN cycles or the slot alphabet stuff. It’s encouraging to hear that the kind of structure I modeled seems to resonate with where serious research is heading.

akomtu · 2025-05-18T17:59:28 1747591168

Ock ohem octei wies barsoom?

nine_k · 2025-05-18T16:37:22 1747586242

In short, the manuscript looks like a genuine text, not like a random bunch of characters pretending to be a text.

<quote>

Key Findings

* Cluster 8 exhibits high frequency, low diversity, and frequent line-starts — likely a function word group

* Cluster 3 has high diversity and flexible positioning — likely a root content class

* Transition matrix shows strong internal structure, far from random

* Cluster usage and POS patterns differ by manuscript section (e.g., Biological vs Botanical)

Hypothesis

The manuscript encodes a structured constructed or mnemonic language using syllabic padding and positional repetition. It exhibits syntax, function/content separation, and section-aware linguistic shifts — even in the absence of direct translation.

</quote>

brig90 · 2025-05-18T16:39:59 1747586399

Yep, that was my takeaway too — the structure feels too consistent to be random, and it echoes known linguistic patterns.

gchamonlive · 2025-05-18T16:52:47 1747587167

I'd be surprised if it was indeed random, but the consistency is really surprising. I say this because I imagine that anyone that would be able to produce such text is a master scribe that put countless hours writing other works, so he's supposed to be very familiar with such structure, therefore even if he was going for randomness, I would doubt he would achieve it.

InsideOutSanta · 2025-05-18T18:08:14 1747591694

> the structure feels too consistent to be random

I don't see how it could be random, regardless of whether it is an actual language. Humans are famously terrible at generating randomness.

nine_k · 2025-05-18T18:55:07 1747594507

The kind of "randomness" hardly compatible with language-like structure could arise from choosing the glyphs according to purely graphical concerns, "what would look nice here", lines being too long or too short, avoiding repeating sequences or, to the contrary, achieving interesting 2D structures in the text, etc. It's not cryptography-class randomness, but it would be enough to ruin the rather well-expressed structures in the text (see e.g. the transition matrix).

InsideOutSanta · 2025-05-18T19:07:58 1747595278

>choosing the glyphs according to purely graphical concerns, "what would look nice here", lines being too long or too short, avoiding repeating sequences or, to the contrary, achieving interesting 2D structures in the text

I wouldn't assume that the writer made decisions based on these goals, but rather that the writer attempted to create a simulacrum of a real language. However, even if they did not, I would expect an attempt at generating a "random" language to ultimately mirror many of the properties of the person's native language.

The arguments that this book is written in a real language rest on the assumption that a human being making up gibberish would not produce something that exhibits many of the properties of a real language; however, I don't see anyone offering any evidence to support this claim.

cookiengineer · 2025-05-18T21:53:58 1747605238

[flagged]

brig90 · 2025-05-18T22:03:12 1747605792

This doesn’t burst my bubble at all — if anything, it’s great to hear that others have been able to make meaningful progress using different methods. I wasn’t trying to crack the manuscript or stake a claim on the origin; this project was more about exploring how modern tools like NLP and clustering could model structure in unknown languages.

My main goal was to learn and see if the manuscript behaved like a real language, not necessarily to translate it. Appreciate the link — I’ll check it out (once I get my German up to speed!).

Nursie · 2025-05-19T02:27:09 1747621629

Their theories appear not to be incredibly well accepted amongs people who take an interest, and appears (like other 'translations') to grant the translator so many degrees of freedom as to be effecitvely unfalsifiable.

0points · 2025-05-19T06:07:05 1747634825

There's been a large amount of claims of decipherment, but so far none of them have stood up to scrutiny.

So, sorry but you are not busting any bubbles today.

cookiengineer · 2025-05-19T16:37:22 1747672642

Please link evidence to said scrutiny then?

ablanton · 2025-05-18T18:07:51 1747591671

Wasn't it already deciphered, though?

https://www.researchgate.net/publication/368991190_The_Voyni...

Reubend · 2025-05-18T18:37:02 1747593422

Most agree that this is not a real solution. Many of the pages translate to nonsense using that scheme, and some of the figures included in the paper don't actually come from the Voynich manuscript in the first place.

For more info, see https://www.voynich.ninja/thread-3940-post-53738.html#pid537...

krick · 2025-05-18T20:35:24 1747600524

I'm not really following the research, so it's rather a lazy question (assuming you do): does any of it follow the path Derek Vogt was suggesting in his (kinda famous) videos (that he deleted for some reason)? I remember when I was watching them, it felt so convincing I thought "Alright, it looks like there must be a short leap to the actual solution now."

Yet 10 years later I still hear that the consensus is that there's no agreeable translation. So, what, all this mandaic-gypsies was nothing? And all coincidences were… coincidences?

mellow_observer · 2025-05-19T07:49:50 1747640990

If you spend some time working on Voynich yourself you'll find that it's actually fairly doable to come up with some translation where you can find a few words that seem to agree with each other. And when you allow yourself some permissions like unorthodox spellings or characters that can mean different things in different places, then it's not so hard to even be able to 'translate' a few seemingly reasonable sentences. This gives a lot of hope to the translator and any who follow them

So far none of these ideas have been shown to be applicable to the full text though. What you would expect with a real translation is that the further you get with your translation, the easier it becomes to translate more. But with the attempts so far is that we keep seeing that it becomes more and more difficult to pretend that other pages are just as translatable using the same scheme you came up initially. It eventually just dies a quiet death

cookiengineer · 2025-05-18T21:56:16 1747605376

Check out Rainer Hannig's instructions:

https://www.rainer-hannig.com/voynich/