Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Great story with many details, including technical details.

As a summary,

* The migration involved 100 employees at 14 different offices at the state of Kerala.

* At the same time there was a migration from legacy text encodings to Unicode. They used a GUI program to convert text in legacy encoding to Unicode (language: Malayalam).

* It is a full software migration to free software, including the operating system.

* They use a Linux distribution based on Kubuntu

* The typesetting software is Scribus.




Other free software used include LibreOffice, GIMP and Inkscape


"legacy text encodings..." That sounds horrible !


They are particularly horrendous..I've had the misfortune to work with government-provided PDFs using custom font glyphs in lieu of proper encodings. In some cases this was the only way to encode particular languages/scripts before Unicode (Jawi was my personal experience). There are now better ways, but poorly-exposed operating system support means most people with these needs still have custom fonts as the entrenched method of text entry.

Some of the encodings were so esoteric we resorted to OCR instead to extract the embedded text. It was quite frustrating to know that somebody - somewhere - knew what each octet represented, but it wasn't remotely Google-able (in English, at any rate).

(Tamil was also problematic, and still is, even with Unicode, as I understand it)


Back in the 90s I assembled a binder of all the (not yet) legacy encodings then in use sourcing from ECMA and elsewhere. It was four inches thick double-sided. Unicode had just seen its initial release and it wasn't clear if that would be the universal text encoding or if it would be ISO-10646 which attempted to maintain a semblance of backwards compatibility with the morass of non-Latin/extended Latin text encodings then in use. There were five commonly used encodings covering different sets of Chinese characters alone (Japan, Korea, mainland China, Hong Kong and Taiwan all had their own encodings and selections of characters). Kids today with their UTF-8/16/32 don't know how good they have it.


Isn't the Unicode codepoint repertoire pretty much identical to ISO 10646? AIUI Unicode only differs by standardizing additional character properties and rulesets, but the encodings are supposed to be identical.


They weren't using Unicode at all. Instead they were using 'prehistoric' fonts patched by changing the glyphs in ASCII fonts(100s of them). This too without proper conventions. They way to convert these to Unicode is by creating character maps in a font-by-font fashion.


ISO 10646 does not work as you describe. I'm pretty sure most of the character encodings defined by creating a custom font were never standardized by ISO.


Back in the 90s, the leading Malayalam newspaper had us download an exe of a font installer, maybe only for IE5.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: