I briefly looked at your specifications. They seem to provide some very useful f...

kstenerud · on Nov 7, 2022

Hey thanks for taking the time to critique!

I actually do have an ANTLR file that is about 90% of the way there ( https://github.com/kstenerud/concise-encoding/tree/master/an... ), so I could use those as a basis...

One thing I'm not sure about is how to define a BNF rule that says for example: "An identifier is a series of characters from unicode categories Cf, L, M, N, and these specific symbol characters". BNF feels very ASCII-centric...

Archelaos · on Nov 7, 2022

Just base everything on Unicode code points.

I recommend the XML 1.0 (Fifth Edition) specification as an inspiration how to do it: https://www.w3.org/TR/xml

For example, it defines syntactically valid names (for XML tags, attribute names, etc.) as follows:

    [4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]

    [4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
    
    [5] Name ::= NameStartChar (NameChar)*