Coming up with actually good, creative things is deceptively simple yet surprisingly difficult; in fact it's quite a slog.
I've been working on a new data format [1] that I'd expected would take me two years to create. 5 years later, I'm finally seeing the light at the end of the tunnel. The format that exists now is almost unrecognizable compared to the initial version, or even what I had after the first year of development (digging through the git repo would prove interesting).
Other than devouring every data format spec I could get my hands on, one big help was to _leave_ it for weeks, even a month, and then come back to it. This allowed me to view something with a fresh perspective that I simply couldn't see if I were working on it every day.
Case in point: Unicode escapes.
I wasn't happy with the current state of affairs for unicode escapes (\uDDDD or \UDDDDDDDD) which are clunky. I wanted something better, without the zero stuffing.
My first incarnation was as follows:
* backslash
* a digit from 0-9 specifying how many hex digits follow
* that many hex digits
So for example \0 is NUL, \19 is TAB, \51f415 is dog, etc.
Sure, it works, but that length prefix really makes it hard to read.
A year after that, I came up with the following:
* Initiated with \+
* The hex digits to represent the codepoint
* .
So for example \+0. is NUL, \+9. is TAB, \+1f415. is dog, etc.
That was much less confusing, since the numbers are properly isolated from the escape characters.
It wasn't until months later that I finally spotted the obvious after a break:
* Initiated with \{
* The hex digits to represent the codepoint
* }
So for example \{0} is NUL, \{9} is TAB, \{1f415} is dog, etc.
And there have been so many of these moments that never would have happened if I'd pushed the format through after only 2 years.
I think, most innovative ideas are deceptively simple *in hindsight*, but I don't think they were obvious in first place. Would love to hear from some Neuroscientist lurking on HN on the causality of such feeling.
I briefly looked at your specifications. They seem to provide some very useful features laking in other formats, but I think you have still a long way to go to polish the specifications. The main issue I have with them is the lack of formal definitions for individual entities. I would recommend treating data normalization very precisly and defining every syntactical element in Backus–Naur form.[1] The specification of Unicode characters should be made independent of a specific serialization (such as UTF-8).[2] Formal definitions are extremly helpful to clearify edge cases and to implement a processor one piece after the other.
For example, instead of specifying uppercase/lowercase sensitivity in an extra section on "Letter Case", I would recommend defining the allowed character range for every item individually. In addition, it should be specified in detail how characters need to be handled by a conforming processor. Just for example: What Unicode version must be supported? The whole code range or only a part of this version? Is a processor required to check for violations of certain rules, such as illegal case? Should or can a processor normalize case or must it not? If it should or can, must it implement the "default algorithms for case conversion, case detection, and case-less matching" from the Uncode spec,[3] or may it, for example, deviate and support only a subset of Unicode characters?
To take up your own example, here is an edge case that is only very difficult to specify verbally, but can be made very clear in Backus–Naur form: Do you want to allow leading zeros for numbers? If so, infinitly many? A leading '+'? ...
It might be helpful to dinstinquish two phases for a conforming processor, namely checking syntactic conformity of an entity to its Backus–Naur form (eg. Is this entity a number?) and validating a syntactically correct entity in its context accoring to some validation rules (eg. is the number out of range?).
[2] For example, you are defining: "A string-like array MUST contain only valid UTF-8 characters.", but when I look at "Escape Sequences", you are using Unicode code points, and not UTF-8 byte sequences. -- Your sepcifications may, of course, require that data exchange must happen in UTF-8 (and should say something about byte order marks). Besides, they should not be conserned with any particular serialisation, but should only work with deserialised Unicode code points.
One thing I'm not sure about is how to define a BNF rule that says for example: "An identifier is a series of characters from unicode categories Cf, L, M, N, and these specific symbol characters". BNF feels very ASCII-centric...
I've been working on a new data format [1] that I'd expected would take me two years to create. 5 years later, I'm finally seeing the light at the end of the tunnel. The format that exists now is almost unrecognizable compared to the initial version, or even what I had after the first year of development (digging through the git repo would prove interesting).
Other than devouring every data format spec I could get my hands on, one big help was to _leave_ it for weeks, even a month, and then come back to it. This allowed me to view something with a fresh perspective that I simply couldn't see if I were working on it every day.
Case in point: Unicode escapes.
I wasn't happy with the current state of affairs for unicode escapes (\uDDDD or \UDDDDDDDD) which are clunky. I wanted something better, without the zero stuffing.
My first incarnation was as follows:
So for example \0 is NUL, \19 is TAB, \51f415 is dog, etc.Sure, it works, but that length prefix really makes it hard to read.
A year after that, I came up with the following:
So for example \+0. is NUL, \+9. is TAB, \+1f415. is dog, etc.That was much less confusing, since the numbers are properly isolated from the escape characters.
It wasn't until months later that I finally spotted the obvious after a break:
So for example \{0} is NUL, \{9} is TAB, \{1f415} is dog, etc.And there have been so many of these moments that never would have happened if I'd pushed the format through after only 2 years.
[1] https://concise-encoding.org/