Great story with many details, including technical details. As a summary, \* The...

darkwater · on Nov 11, 2019

Other free software used include LibreOffice, GIMP and Inkscape

rawoke083600 · on Nov 11, 2019

"legacy text encodings..." That sounds horrible !

katet · on Nov 11, 2019

They are particularly horrendous..I've had the misfortune to work with government-provided PDFs using custom font glyphs in lieu of proper encodings. In some cases this was the only way to encode particular languages/scripts before Unicode (Jawi was my personal experience). There are now better ways, but poorly-exposed operating system support means most people with these needs still have custom fonts as the entrenched method of text entry.

Some of the encodings were so esoteric we resorted to OCR instead to extract the embedded text. It was quite frustrating to know that somebody - somewhere - knew what each octet represented, but it wasn't remotely Google-able (in English, at any rate).

(Tamil was also problematic, and still is, even with Unicode, as I understand it)

dhosek · on Nov 11, 2019

Back in the 90s I assembled a binder of all the (not yet) legacy encodings then in use sourcing from ECMA and elsewhere. It was four inches thick double-sided. Unicode had just seen its initial release and it wasn't clear if that would be the universal text encoding or if it would be ISO-10646 which attempted to maintain a semblance of backwards compatibility with the morass of non-Latin/extended Latin text encodings then in use. There were five commonly used encodings covering different sets of Chinese characters alone (Japan, Korea, mainland China, Hong Kong and Taiwan all had their own encodings and selections of characters). Kids today with their UTF-8/16/32 don't know how good they have it.

zozbot234 · on Nov 11, 2019

Isn't the Unicode codepoint repertoire pretty much identical to ISO 10646? AIUI Unicode only differs by standardizing additional character properties and rulesets, but the encodings are supposed to be identical.

stultus · on Nov 12, 2019

They weren't using Unicode at all. Instead they were using 'prehistoric' fonts patched by changing the glyphs in ASCII fonts(100s of them). This too without proper conventions. They way to convert these to Unicode is by creating character maps in a font-by-font fashion.

yorwba · on Nov 12, 2019

ISO 10646 does not work as you describe. I'm pretty sure most of the character encodings defined by creating a custom font were never standardized by ISO.

aitchnyu · on Nov 12, 2019

Back in the 90s, the leading Malayalam newspaper had us download an exe of a font installer, maybe only for IE5.