Coming up with actually good, creative things is deceptively simple yet surprisi...

wanderingmind · on Nov 7, 2022

I think, most innovative ideas are deceptively simple *in hindsight*, but I don't think they were obvious in first place. Would love to hear from some Neuroscientist lurking on HN on the causality of such feeling.

Archelaos · on Nov 7, 2022

I briefly looked at your specifications. They seem to provide some very useful features laking in other formats, but I think you have still a long way to go to polish the specifications. The main issue I have with them is the lack of formal definitions for individual entities. I would recommend treating data normalization very precisly and defining every syntactical element in Backus–Naur form.[1] The specification of Unicode characters should be made independent of a specific serialization (such as UTF-8).[2] Formal definitions are extremly helpful to clearify edge cases and to implement a processor one piece after the other.

For example, instead of specifying uppercase/lowercase sensitivity in an extra section on "Letter Case", I would recommend defining the allowed character range for every item individually. In addition, it should be specified in detail how characters need to be handled by a conforming processor. Just for example: What Unicode version must be supported? The whole code range or only a part of this version? Is a processor required to check for violations of certain rules, such as illegal case? Should or can a processor normalize case or must it not? If it should or can, must it implement the "default algorithms for case conversion, case detection, and case-less matching" from the Uncode spec,[3] or may it, for example, deviate and support only a subset of Unicode characters?

To take up your own example, here is an edge case that is only very difficult to specify verbally, but can be made very clear in Backus–Naur form: Do you want to allow leading zeros for numbers? If so, infinitly many? A leading '+'? ...

It might be helpful to dinstinquish two phases for a conforming processor, namely checking syntactic conformity of an entity to its Backus–Naur form (eg. Is this entity a number?) and validating a syntactically correct entity in its context accoring to some validation rules (eg. is the number out of range?).

[1] See https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form

[2] For example, you are defining: "A string-like array MUST contain only valid UTF-8 characters.", but when I look at "Escape Sequences", you are using Unicode code points, and not UTF-8 byte sequences. -- Your sepcifications may, of course, require that data exchange must happen in UTF-8 (and should say something about byte order marks). Besides, they should not be conserned with any particular serialisation, but should only work with deserialised Unicode code points.

[3] See sect. 3.13 of https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf

kstenerud · on Nov 7, 2022

Hey thanks for taking the time to critique!

I actually do have an ANTLR file that is about 90% of the way there ( https://github.com/kstenerud/concise-encoding/tree/master/an... ), so I could use those as a basis...

One thing I'm not sure about is how to define a BNF rule that says for example: "An identifier is a series of characters from unicode categories Cf, L, M, N, and these specific symbol characters". BNF feels very ASCII-centric...

Archelaos · on Nov 7, 2022

Just base everything on Unicode code points.

I recommend the XML 1.0 (Fifth Edition) specification as an inspiration how to do it: https://www.w3.org/TR/xml

For example, it defines syntactically valid names (for XML tags, attribute names, etc.) as follows:

    [4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]

    [4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
    
    [5] Name ::= NameStartChar (NameChar)*