Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
NestedText, a nice alternative to JSON, YAML, TOML (nestedtext.org)
302 points by nestedtext on Oct 3, 2020 | hide | past | favorite | 284 comments


> data type does not change based on seemingly insignificant details (2, 2.0, 2.0.0, “2”)

…only because NestedText does not support numeric types at all. That seems like throwing out the baby with the bathwater.


I think avoiding numeric types is a good decision. It tends to eventually cause problems when one implementation converts numbers to doubles, another to either doubles or longs depending on whether they have a . or e, and another which converts them to bignums (or passes them as strings to the caller).

One should remember that any sane application will be parsing the config file into internal data structures and validating it anyway so it gets little benefit from the numbers being already “parsed”.

There are also issues when something looks numeric but doesn’t parse (eg 1.2.3, 3/2, 12in, 4h30m2s, 2:30, 2020-02-29, etc). One way to deal with these is a tokenisation rule like in Common Lisp: if it is a valid number syntax then treat it as a number, otherwise it’s a symbol, but this can lead to issues (eg you would need to know that when your number needs more than float precision or otherwise doesn’t follow the rules, it should be in quotes. It seems crazy to pass that detail on to the poor sod who has to write the config file).


> One should remember that any sane application will be parsing the config file into internal data structures and validating it anyway so it gets little benefit from the numbers being already “parsed”.

The benefit of standard numeric and boolean types is that different tools can exchange data in a well-understood way.

Getting rid of yaml's 30 ways to write "true" and "false" by making everything is a string just means that you now have 30 tool-specific ways to write "true" and "false".

The "everything is a string" approach already exists in shell scripts and TCL and it's not really that great.


Except YAML is straight-up wrong at times, unless you know all the edge cases. I just learned unquoted NO is coerced to False. A classic case of leaky abstraction/"bad magic".

Unless you have actual type annotations/tags (eg xml, jsonld, graphql), everything IS a string. There's no assumptions otherwise.


YAML 1.2 removed most of the nonsense booleans and only recognizes true | True | TRUE | false | False | FALSE.

https://yaml.org/spec/1.2/spec.html#id2805071

Of course, even though YAML 1.2 is a decade old, there are still many parsers that accept YAML 1.1.


Huh, I had no idea, thanks for the heads up!


That seems like a pretty sensible approach - eliminates the "Norway problem" altogether. I'd guess you have to use the "%YAML 1.2" directive in all documents in order to get it though...


You don’t need this directive to get YAML 1.2 parsers to parse your document as YAML 1.2; that’s the default. You only need it to instruct YAML 1.1 parsers to raise a warning (and I guess for parsers of future version of YAML to fall back to 1.2 behavior, if that ever happens).

https://yaml.org/spec/1.2/spec.html#id2781553

https://yaml.org/spec/1.1/#id895631


> Unless you have actual type annotations/tags (eg xml, jsonld, graphql),

YAML has actual type annotations (tags).


Wait, really? scrambles to check woah...

I've never seen typed yaml, this is wild.

    negative: !!int -12
    zero: !!int 0
    positive: !!int 34
Can't say I love the notation, but indeed that is type annotations. I guess neither "yaml type hints" nor "yaml type annotations" are the right query. Had to search explicitly for "yaml tags".


The neat thing is that you can use tags to represent custom data structures or functions. For example, AWS Cloudformation uses YAML tags as a shorthand for template functions [0].

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...


Depending on the parser you can also use them to call arbitrary code. This used to be the default behavior of `pyyaml.load`.


Types are !!screaming at reader. Don't want to start blameshitting, but is yaml supposed to be a simple human-readable format?


It is screaming with double negations!


It's not screaming without double negations!!


I would hazard to say that any magic is bad in engineering. (Sorry Perl fans!)


There isn’t really any magic in Perl, just lots of unfamiliar lexicon if you come from more traditional programming languages. Perls problem is it’s history is rooted in command line usage so there’s a tonne of inherited reserved single character variable names and such like that are optimised for keystrokes rather than readability. However you can certainly write Perl programs that don’t follow those older conventions and look more like a modern language.


> There isn’t really any magic in Perl

http://p3rl.org/guts#Magic-Variables

    perl -MDevel::Peek=Dump -mTie::Scalar -e'
        //g; Dump $_; tie $c => "Tie::StdScalar"; Dump $c; Dump \%ENV
    ' 2>&1 | grep MAGIC


That kind of magic in Perl has a different kind of meaning than "magic number/bool/string parsing" in YAML.

Perl tied-variable magic just means there are (effectively) getter and setter properties attached to the variable. "Magic" is just the name that was chosen in the implementation, and it stuck.

It's used to implement variables with special, automatic meanings, like $$ for "current pid" and $! for "last error".

It's also used to implement variables with user-defined behaviours on access, which is quite handy for a lot of abstractions.

A lot of modern languages support both of these things, because they are useful, but it's not called magic in those languages, it's called something like "watchers", "proxies", "getters and setters" or "hooks".

No, the criticism of YAML-style "magic" is that it leads to entirely surprising behaviour from innocuous input. Perl magic is not that kind. If you're using a special variable, you already know why.


That’s just playing around with the same reserved variables I spoke of before and there is nothing magical about them aside their silly name. In fact the opposite is true, they’re actually predictable and well documented. They just happen to have terse names as a throwback to command line usage (eg you wouldn’t call $1 a magic variable in Bash because it happens to work the same as ARGV[1]).

In fact in Perl, you can opt for longer, readable, lexicon over the terse single character variables; and that’s literally how modern Perl should be written.

Whereas the problems described with YAML is where it can automatically alter your data based on what the parser “thinks” the data should represent. Which is generally what people mean when they talk about “magic” in IT: systems that don’t honour your input and instead automatically convert it into something else. Perl doesn’t do this even in spite of it looking like executable line noise to many.


I think "magic" is often just abstraction. And while abstraction is certainly necessary to speed up work and declutter the brains of the end users, too much abstraction takes control away and bad abstraction takes the wrong decisions in your name.

If you take a look at how string handling works in most programming languages there is a lot of "magic" going on there. Which isn't a bad thing necessarily, because most programmers don't want to deal with the intricates of strings unless they really have to. The key is that this magic doesn't get in your way and doesn't do too magical things nobody ever asked of it.


technology without magic? nonsense


Insert Arthur C. Clarke quote here.


That was the GPs point, YAML contains a lot of insanity, but let’s not throw out what it got right along with everything it got wrong.


Encoding meaning in whitespace is an abomination. That’s all I have to say about YAML.


Encodingmeaninginqhitespaceisanabomination.That’sallIhavetosay.


Encoding meaning in whitespace is an abomination. That’s all I have to say about Python.


> The benefit of standard numeric and boolean types is that different tools can exchange data in a well-understood way.

In Python:

    >>> import json
    >>> json.loads('{"x": 9007199254740993}')
    {'x': 9007199254740993}
In my browser's JavaScript console:

    > JSON.parse('{"x": 9007199254740993}')
    {x: 9007199254740992}
(Consider what happens if you try to send a tweet's ID, a perfectly normal number like 205052027259195393, through JSON. Or if you try to serialize a stack trace on a 64-bit system, where addresses are also perfectly normal numbers.)


I see what you are getting at and it is worrisome indeed, however, why would one encode an id as simple number?


Why wouldn't you? It's a number. It's also sorted (up to about one second of inconsistency between Twitter worker processes on different machines), and you want to sort them by numeric sort, not by lexicographic sort as you would with strings.

I mean, there's a pretty clear argument here: Twitter themselves used to return these numbers as numbers in the API until they realized they were about to hit this problem. https://developer.twitter.com/en/docs/twitter-ids


Using numeric types as ids probably leaks into other areas, where being numeric is then assumed. When you want to change it at some point, you got to be very careful suddenly. A string could simply contain a new kind of id. Seems less refactoring effort would be required. On the other hand, switching from numeric to strings might give you compiler errors when types do not match, so perhaps it might even make refactoring simpler.


You can always add a second or third key. Stop messing with the primary key. Many systems have a numeric primary key and then only expose a hash or UUID to the public.


>why would one encode an id as simple number?

That's an incredibly weird complaint when the real problem is that javascript's JSON.parse doesn't use BigInt for large numbers.


I have very little understanding of JSON, so why would this happen?


It's more a JS thing: The lone JS number type is double-precision floating point, meaning beyond a certain range there are integers that cannot be accurately represented as a JS number.

The JSON standard doesn't place restrictions on size or precision of numbers, instead just noting that implementations can vary their treatment of and limits on numbers. While JS uses doubles for all numbers, many other languages emit an integer type for a JSON integer. So, once you go beyond the range where a double can accurately represent all integers, you run the risk of a mismatch in how the number is interpreted by different languages parsing the same JSON.

Of course the spec also allows you to create way too big or too precise numbers that would be problematic in most languages as well; it's just that this is a somewhat common bugbear.

I wouldn't necessarily call it a flaw in JSON though, more an issue with JSON.parse or really just a fact of life when dealing with numbers in JS. Alternatives to the built in JSON.parse exist to read large integers as strings or bigints.


JavaScript is using double as a number. There's no such thing like integer or floating point numbers per se, only so called Number.

Described issue is not problem of a JSON but engine which parsed it and language which stands behind the parser. Any config format will eventually have same result and same issue.

So forcing programmer to parse every single piece of data for sake of "it's his responsibility" is not a case here.

I also disagree this is in any way programmer responsibility to create standardized way of creating parser for everything. This format gives you nothing but indentation so you are forced to create documentation for everything field, what type it's and what kind of values it takes. Lots of extra work for nothing when you have any other format.


It's a Javascript problem. Integer caps out at 9007199254740991 (253-1), and after that it's treated as a BigInt. As long as you're working with BigInt literals, it'll be accurate, but when you convert back and forth you can lose precision.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


Python promote to Bignum automatically.

Javascript number is a floating point

    Number.MAX_SAFE_INTEGER
    // 9007199254740991
    Number.MAX_SAFE_INTEGER + 1
    // 9007199254740992
    Number.MAX_SAFE_INTEGER + 2
    // 9007199254740992
no automatic promotion to BigInt

    9007199254740991n + 2n
    // 9007199254740993n


The example uses numbers outside the range of JS numbers (too many significant digits), so when the JSON is evaluated as a JS number some of the least significant bits are rounded. Personally, I think it is a mistake to use values in JSON that can’t be represented in JS, but the standard doesn’t explicitly forbid it.


JavaScript numbers are double-precision floating point numbers.


The solution to such problems is a better standard for the handling of numbers, not dropping support for numbers altogether.


> The benefit of standard numeric and boolean types is that different tools can exchange data in a well-understood way.

I don't think that's what this tool is for. This tool is for humans to read and edit data. That's a different use case from automated programs exchanging data, for which I agree you should be using standardized numeric and boolean data types and not making everything a string. But how that standardized data gets determined from data that humans enter should be up to the individual application.


Agreed. Serialization is not really possible with this tool, due to the following:

> A key that requires quoting must not contain both single and double quote characters.

You can't really serialize user data with that restriction.


It is still useful for humans to have a well-understood way to input typed data.

If you make everything a string then the interpretation of "no" as boolean true or false is left to each tool, and there are even tools which have different interpretations of "yes/no" for each field.


What is the use of this then? Humans don’t need structured data to exchange information and if this is insufficient to communicate it to the machine then it won’t be a great human to machine format either.


> Humans don’t need structured data to exchange information

Maybe we don't need it, but it often helps.

Also, this data format isn't necessarily for humans to exchange data with other humans, but for humans to give data to applications in a format that's much easier for humans to use.

> if this is insufficient to communicate it to the machine

Not at all. Each specific application can easily parse this data format according to its own needs. What this data format doesn't specify is a single translation into application data that is the same for all applications. But applications don't need or want that, because they have different use cases.


I see. That makes sense. Now that I think about it maybe if its in front of some database or storage that only handles strings it would work well as well.


I think this is an “input format” vs “serialization format” issue. NestedText is meant for user input so it’s focusing more on the UX of the syntax. In that case, having to parse strings is a reasonable task for developers to take on


If NestedText is meant to be written by non-programmer humans, the details of null, undefined, 0, false, "NO", nil, or whatever else, would likely also be lost on them.

Most likely if their input is meant to be machine interpreted they would need to be trained to provide specific inputs anyway. I like that NestedText doesn't hide that problem. It lets the user organization decide how it wants to manage that problem, and what symbols or words are understood by the people authoring the files.


>you now have 30 tool-specific ways to write "true" and "false"

ISO8601 joins the chat.


> I think avoiding numeric types is a good decision.

Only if this format is intended for use-cases that never need to deal with numbers.

> One should remember that any sane application will be parsing the config file into internal data structures and validating it anyway so it gets little benefit from the numbers being already “parsed”.

That statement couldn't possibly be more wrong.

Number parsing (and encoding!) is a decidedly non-trivial problem. You need to concern yourself with -- at a minimum -- all of the following:

- Unsigned 64-bit numbers.

- A series of digits that would be bigger than a 64 bit whole number. Convert to float? Truncate in some way? Error?

- NaN

- Infinity

- Negative zero

- Denormal numbers.

- Differentiating between decimal/currency types and floating point numbers. Not all decimal values can be exactly represented as floats!

- Efficiently encoding floating point to use the minimum digits without losing precision.

- Parsing those minimal numbers with perfect "round-tripping".

- Doing the above efficiently.

- Securely too! Efficient parsers cut corners on sanity checks. I hoped you fuzzed your parser...

The above can easily amount to many kilobytes of extremely complex code. Look up "ryu" as an example of what Google came up with to make JSON number parsing reasonably efficient.

Meanwhile, reading a fixed-length number from a binary format can be done in a single machine instruction. One. It might not even take an entire CPU clock cycle! Okay, two, if you need to bounds-check your buffer, but there's ways to avoid that.

Afterwards, the bounds check is again literally just two machine instructions in complexity. That's not the difficult bit!

The difficult bit is the parsing.


You’ve given lots of examples of things that make parsing numbers difficult but I don’t see why they are relevant to a config file written by humans. I think it makes sense to have the number parsing owned by the thing which cares about the number format.

One example you provide is decimals for currency values but I claim you would want such values to look like $1234 in config files so that when they are reviewed or written, the person reading the file knows they are looking at a dollar value and can be concerned if it is too large.

I’m not suggesting that applications write their own number parsing. Just do uint64::parse or parseInt or Double.of_string, or whatever else you need to access your language’s number parsing routines.


Is the format written and read by a human?

> Just do uint64::parse or parseInt or Double.of_string, or whatever else you need to access your language’s number parsing routines.

Okay, so the computer is doing the parsing.

Those functions are notoriously inconsistent in their behaviour, particularly across different programming languages. If you're not careful, you'll end up accidentally using the internationalised versions of those functions. Even if you're careful, other people won't be.

Remember, data formats are for interchange. They have to be language agnostic. They have to be well-defined, and it should be possible to write a parser for them without having to guess at the precise details.

The harmful consequences of the Robustness Principle are now well-recognised in computer science: https://tools.ietf.org/id/draft-thomson-postel-was-wrong-03....

Some things need to be done properly, nor not at all.


If you go fully against Robustness principle, you lose the reason to use textual formats as well, since they are designed to be forgiving of human errors in input and catch them in syntax.

And - it is certainly OK in many instances to have fixed-width, fixed byte-order binary encoding as the format's basis. It comes with the twin downsides of wholly different categories of errors cropping up, and with the lack of a universally agreed upon tool for human entry.

Perhaps text was a fashion, though. I definitely have had thoughts in that vein lately. And in that case we shouldn't always be rushing to use it as the source of truth when we have many good, machine-level agreements about numeric formats.


I wrote some config for my application, which knows how to read it. Why do I want some other application In some other programming language to read it too?

I am far more worried about localisation issues than language issues. If you are storing something central to multiple applications I'd argue a text file is the wrong tool


But which of these are problems in configuration files written by a human? That is the aim of the format. Moreover in applications were there could be issues, it would most certainly be tied to very specific fields and you would want specific application logic to handle that field. Now if people misuse it as a data exchange format or so, yes I agree with you, but at that point just use a binary format instead.


> That is the aim of the format.

That doesn't matter at all. The author's aims will be ignored if this format is used for anything even vaguely important. Eventually it'll need tooling to both read and write it.

DevOps pipelines, applications with GUIs, or something will need to both parse and generate this format in a consistent way.

There is no such thing as a human-write-only format in widespread use.

Even programming languages are regularly generated by tools such as RPC API codegen tools, LINQ-to-SQL and the like.


In my opinion, the best solution to these issues is to:

1. Declare numbers as numbers in the configuration language. E.g. "decimal(1e1000)".

2. Parse declared numbers with a lossless format like Python's decimal.Decimal.

3. Let users decide at their own risk if they want to convert to a lossy format like float.


Then you just roll your preferred serializer on top of format in a properly composable form.


I am of the opposite view: As long as there are no edge cases, supporting more types in data languages is good, it leads to failing faster.

Too many types could get overwhelming, but I like where amazon's Ion is [1]. It actually supports multiple number types, with decimal being the default for values with a dot.

> you would need to know that when your number needs more than float precision or otherwise doesn’t follow the rules, it should be in quotes

Not really. The configuration value should either be a number or not, which is determined by the application reading the config. As a config writer you only care to make the type match (so, in json, if the application uses number you make sure you use number, and if it expects a string you use that)

(disclaimer: I work for amazon, but have nothing to do with Ion other than having used it. Opinion is my own, not my employer's, yadda yadda)

[1] http://amzn.github.io/ion-docs/


I wonder if you could just include explicit type information?

  name(str): Dave
  age(int4): 22
  dob(date): 2020-02-01
  photo(base64):TWFuIGl....


That's not the only way to deal with them, you can specify type like bson.

The nice thing about that is it solves the problem rather than hoping it doesn't matter or assuming each program's validator will think to note all of the possible data types not compatible with the program natively. If a u32 is defined in the file and you've only got doubles to work with it's a given you'll have to deal with it in your tool specific validation. For everyone else it's well defined.

The downside is it's a bit more verbose and if you have all of that info already it's pretty easy to jump to just using a binary format which will be more efficient anyways.


My ideal config language would have unambigous syntax for at least 64-bit signed integers and doubles (so, for example the spec for the language requires 56 to be parsed as an integer and 56.0 to be a double. Additional types would be ok too, as long as the syntax is unambiguous and obvious.


Both of those are true for TOML.

The spec requires that a full 64 bits of signed integer be parsed and understood, that hexadecimal, octal, and binary values be integers, and that floating point values be parsed as doubles.

It doesn't support hexadecimal float, however, which is a pity: having a guaranteed bit-identical format is a nice affordance.


> I think avoiding numeric types is a good decision. It tends to eventually cause problems when one implementation (...)

It seems you're mixing up the language definition with implementations that try to follow the language definition.

If different implementations have different results then either they are buggy or the language has some important holes in the specification.

Either way,the solution to this problem is not less validation.

> One should remember that any sane application will be parsing the config file into internal data structures and validating it anyway so it gets little benefit from the numbers being already “parsed”.

That's the whole point, isn't it?

I mean, if you already acknowledge the fact that this parsing and validation is a basic requirement, why handle it as an afterthought and force developers to add their own hand-rollef absurd and unnecessary type checks and type coversions?

Wouldn't it simply easier to let the language and the parser do that already?

I mean, no one ever complained that JSON had string types. In fact, one of JSON's main complaints is that it doesn't support enough types, such as timestamps.


In the design of YAML, Ingy made the case that we shouldn't have types for scalar values, that they should all just be strings. His argument was that each application knows what it needs and should have the ability to direct how those scalars are processed. If the format needs to be standardized across applications, one can use a schema system layered on top. In retrospect, I think he was right and I was wrong. For example, Fast Healthcare Interoperability Resources (FHIR) serialized as JSON treat numeric values as decimals, e.g. `2.3` is not a floating point value. Moreover, since parsing numbers is slow, it should be deferred till you actually need to interpret the numeric value.


Indeed.

It's less of a problem if you're using it for configuration files, where a program knows what key's values need to be cast to an integer or float.

But it seems disastrous if you wanted to use it for storing or transmitting data, above all between applications. You're immediately throwing out the possibility of being able to serialize and then deserialize data in basically any programming language.

I shudder at the idea of an API that accepted NestedText, where I'd need to worry about whether my floating-point output was compatible with its floating-point string parser. Yikes. I want the serialization format to handle that. Isn't a major criticism of JSON that it doens't have a built-in datetime representation?


To be fair, this seems to be designed specifically for config files. Just as JSON should be used for data transmission but not for config, I could imagine, this should be used for config but not for data transmission.


What if someone inputs 2.2.5 when the code is expecting an int? Or “abc”? It seems like it’s pushing all the config validation into user code which sucks.


The code already has that problem. Certainly, if you're using YAML, a user could type 'version: 2.2.5' where you're expecting a major version number (an int), and all of a sudden your code is passed a string instead (or a string-flavored variant). You can imagine the same sort of problem in JSON too, usually from someone leaving off a quote where you're expecting a string. NestedText's philosophy seems to be, you're going to need your code to handle this anyway, we'll just pass you a string in all cases and it's up to you to convert it to an int on your own and validate it.

Frankly, in most languages, this is better because you don't have the types of objects randomly change based on user input. (In a few languages, with a few libraries, you can specify the type of the document to the parser and have it fail to parse the entire document if it can't deserialize to the right type, in which case this is a little weaker. But you can still do that with NestedText, just one step after the parser - have your own function that takes a ComplexStructure<..., String, ...> and returns a ComplexStructure<..., int, ...> or throws an error.)


> The code already has that problem.

Only in untyped or dynamically-typed languages. But even in JavaScript one may write +obj.version instead of obj.version to make it numeric. Evading this is a straight way to hell.

In a statically typed one, conversion is typically generated from description and type checking applies just at reading.

The problem with vaguely specified format is in more simplex cases. Shall we accept 45x as number (and what it value will be, 45 or 0)? 045? 045x? What date is 1/2/3, 1-2-3? And so on.


In a statically-typed language, this is an entirely reasonable thing to want, but also, existing languages like JSON and TOML don't give you that either:

- There's no way to say that you want an integer; you get a floating-point value.

- See https://news.ycombinator.com/item?id=24676484 , you can't reliably accept integers over 2^53 without taking them as strings.

- Someone can always specify something of the actual wrong type. (Imagine changing YAML "version: 1.9.1" to "version: 1.10". You can't just stringify 1.10, you'll get "1.1"!)

So, in a practical data format, the schema for your document needs to say something like "This is a number, which must be an integer between 0 and 2^16" or "This is a string, make sure to quote it" or whatever, and a generic statically-typed JSON- or YAML-parsing library isn't going to handle that for you. And telling your users "the input format is JSON" doesn't answer that question: you must make it explicit to users.

Fortunately, you can handle it just fine in a statically-typed language in one of two ways. One is to accept an object from your parser that consists of variant types and pass it through your own function that validates it against a schema, and then either returns a more-restrictively-typed object or throws an error. Such a function could easily do string conversion too if given NestedText input, as I mentioned. The other is to pass some information into your parser saying, don't act like a generic JSON/YAML parser, instead interpret these particular fields in this particular way and accept only things with this structure. If you're doing that, you can easily tell the parser to use this particular string-to-integer function on the strings in NestedText and then return an appropriately-typed object containing an integer to you.


> Shall we accept 45x as number (and what it value will be, 45 or 0)? 045? 045x? What date is 1/2/3, 1-2-3? And so on.

I think the point is that the answers to those questions may well be application-specific. In which case it is better to not bake them into the file format.


You wouldn't use this for arbitrary object serialization. Lists can only contain strings, not other lists or dictionaries.

As for stringification, JSON's data types are mismatched with pretty much everything that isn't JavaScript to some degree. If you need to serialize a 64-bit integer to JSON, you serialize it as a string because the parser on the other end is probably going to try to parse it as a double-precision floating point number. Once you've started serializing numbers as strings anyway, it's not too far to "serialize every scalar as a string".


Well, there's nothing about the JSON spec that says you can't serialize 64-bit integers to JSON. The grammar allows unlimited precision. Implementations may vary.


> Lists can only contain strings, not other lists or dictionaries.

I don’t think this is true. None of the examples have lists containing non-string objects, but the documentation doesn’t seem to draw a distinction between lists and dictionaries wrt what can be placed in them.

(Both lists and dictionaries are initially described as only containing strings, and later this description is expanded to include nesting; this counterintuitive arrangement may explain the confusion.)


Ah ha, got it. The page never really makes it clear that this is for configuration files only, not serialization.

Indeed, not being able to have lists of dictionaries, or lists of lists, is very restrictive. Seems to be for very simple configurations only. E.g. a set of preferences, but not a set of monitor calibrations. (Inventing arbitrary dictionary keys seems pretty hacky.)


So, this is basically YAML, "but better". I can repeat once more that "easily understood and used by both programmers and non-programmers" is unapologetically stupid concept that can never succeed. So I see how all of this will sound all too familiar to anybody with a little experience, which makes them to automatically dismiss this YAYAML.

But YAML is really quite complicated, and JSON (which shouldn't be used for config files at all) and TOML (which I love and wish it would gain more popularity) aren't exactly alternatives to YAML. So, I would be actually totally ok with "YAML, but better", as a way to deprecate YAML.

Now, it is clear from the start that this cannot deprecate YAML, because it doesn't even have booleans and numbers. But, surprisingly, I can accept this as well: ok, let's just assume that being good at dealing with strings may be enough.

The problem is, it isn't clear at all from the docs, if this is better than YAML at anything. It raises dozens of questions. I'll start with the most basic ones (using [] as a wrapper/delimiter): how do I represent values [ a], [a ], ["a"] and [""] in this file format?


>basically YAML, "but better"

That was my intention behind this, too:

https://github.com/crdoconnor/strictyaml/

The general structure of YAML is fine I think but its feature set grew a little bit out of control.

The "cleanliness" of the format leads to one of its inherent weaknesses - syntax can't be used to encode type information so you either need a schema to encode type information (strictyaml approach) or have magic conversions (yaml approach) or to assume strings (strictyaml w/o schema/nestedtext).

The interesting thing I discovered about schemas building this is that it kind of pays to make them extensible and build them in a turing complete language. Schema validation done using a non-turing complete language (e.g. jsonschema) allows for cross language usage but it ends up being a kind of blunt object.


What kind of values are [<whitespace>a] and [a<whitespace>] supposed to be? They look like typical YAML syntax traps to me.

Why should JSON never be used for configuration? It is sufficient for declaratively expressing anything I have encountered. Do we really need references or other stuff from YAML? For configuration this seems unnessecary, provided that the program, which interprets the result of parsing the JSON is well written.


> What kind of values are [<whitespace>a] and [a<whitespace>] supposed to be?

I don't understand the qustion. They are supposed to be exactly that, [<whitespace>a] and [a<whitespace>]. I assure you, I've encountered many situations, where whitespace at the beginning or the end of the value is actually meaningful for reasons you (creator of the app) have no control over.

> Why should JSON never be used for configuration

Many reasons, actually, but the most important (IMO) being that original JSON specification doesn't support comments, nor most actual parser implementations do. Configuration file that doesn't support comments is trash and causes very real inconvenience for users. Using additional key values for comments (even if such atrocity doesn't bother you conceptually) isn't a solution in many cases (for example, when your intention is to comment out a list item).


> original JSON specification doesn't support comments

Just for the record, JSON initially had comments and they were later removed according to Douglas Crockford.


It is only now, that I realize, that [<whitespace>a] and [a<whitespace>] are really supposed to be 1 string inside a list. However, I already see problems with this kind of syntax:

[<whitespace>a,<whitespace>b]

How will this be interpreted/parsed? Will there be a whitespace before "b" after parsing? That would mean, that I am not able to separate visually more clearly, by adding a whitespace between list elements, which is widely considered to be a good practice in programming languages. The reason is readability.

The next thing is, that on the same line whitespace is added, but what about multiple lines defining a list? Here we do not add the line breaks and indentation to the string. It's not consistent in this way.

So I personally would never write string like that. I would always make use of quotes in such situations and probably in YAML in general, simply to make it clear, that I do wish to have the leading whitespace in the string, and it is not simply a typo, resulting from removing a former first element from the list.

JSON is limitting, but for configuration I think it's kind of limitations are often good. With comments in JSON I am still not sure, because sometimes I'd like to write them there, but would not like to include another dependency, only to be able to parse away the comments from JSON. Then I better write good docs elsewhere.


Or you can do it this style, but yeah, good point. I can see now why a lot of apple config files are still XML. https://www.freecodecamp.org/news/json-comment-example-how-t...


JSON lacks comments and will fail for a missing or extra comma, so it's not great for configuration written by humans.

You can use HJSON which is the json with comments. It's fully compatible with json so easy to introduce into anything that does json. https://hjson.github.io/


Failing with an extra comma also makes it harder than necessary to write JSON by machines.


JSON5 also supports comments and multiline strings with `\`-escaped newlines: https://json5.org/

Triple-quoted multiline strings like HJSON would be great, too.


I'd rather take "extra comma" failure than "extra space" space failure. First one can be caught by any IDE, second one will take you a couple of minutes to find out (when building CI for example).


[<whitespace>a] could be a markdown string that begins with code.


From "The description of YAML in the README is inaccurate" https://github.com/KenKundert/nestedtext/issues/10 :

> I will mention something else. The section about the "Norway problem" is not quite accurate. Some YAML loaders do in fact load no as false. These are usually YAML 1.1 loaders. YAML 1.2's default schema is the same as JSON's (only true, false, 'null and numbers are non-strings).

> Any YAML loader is free to use any schema it wants. That is, no loader is required to to load no as false. Good loaders should support multiple schemas and custom schemas. The Norway problem isn't technically a YAML problem but a schema problem.

> imho, YAML's biggest failing to date is not making things like this clear enough to the community.

> Note: PyYAML has a BaseLoader schema that loads all scalar values as strings.


I have an even stupider question: How do I have a key with a colon at the end?

    this:: right
    this: left
Will my program be able to tell its right from its left?


> I'll start with the most basic ones (using [] as a wrapper/delimiter)

So far as I can tell, this doesn’t use [ or ] in its syntax at all, so all the values you give would be represented exactly as strings without any problem.


> So far as I can tell, this doesn’t use [ or ] in its syntax at all

Yes, that's why the GP chose those characters to delimit their example strings. I'll try again using ` characters to delimit the example strings.

> so all the values you give would be represented exactly as strings without any problem.

But what would those string representations hold? If I parse the below strings in a javascript context, what would I get?

    ` a`, `a `, `"a"` and `""`
Do I get this?

    " a", "a ", "\"a\"" and "\"\""
Or do I get this?

    "a", "a", "a" and ""
Or do I get something inbetween?


OK, I got it. I still don't see the problem. The examples include strings starting with whitespace (One of the multi-line strings), so that's not a problem. Single line strings start with the first character after the colon and a space. If that happens to be a space character, so be it. Strings are terminated by a newline, so trailing spaces aren't a problem. Quote characters are just characters with no special meaning in this format, so those aren't a problem either. Unless the library implementing this format has a bug, there shouldn't be any issues.


You know what I want? Schemas. And clear error messages.

I want to know beforehand what I can put in a config file and I want a fast and hard failure if what I put in there is not good.

And this should be implemented at the file format parser level, with hooks for apps to add on top of the default behavior, so that every app that implements this format gets these things almost for free.


Haven’t you described cap’n’proto, protobuf, thrift, flatbuffers etc?

I know cap’n’proto also has fantastic support for using the schema for config files. You can just compile any constant as a stand-alone serialized message that you mmap into your code in a safe way. It can’t do complex math and things (at least yet) but you can express lists, dictionaries, and reference other constants, so as a config file replacement I love it. I’ve also found the format to be far more regular and consistent than you get with things like text protobuf (you’re still using the schema language instead of another format)


> You can just compile any constant as a stand-alone serialized message that you mmap into your code in a safe way.

Are you suggesting using a binary format for your config files? I think most people would find that more trouble than a decent text format.

> ... than you get with things like text protobuf

You can just use protobuf's canonical JSON representation (thought the lack of ability to use comments is annoying).


You store your configuration as plain text in your repository and whatnot. When it comes to deployment you just compile it to a binary file.

Cap’n’proto also has plain text and JSON serialization formats if you really want to have your deployed config file be directly human-editable and deserialize from that. I was just noting a very cool feature of having your config written in cap’n’proto and it’s what Cloudflare uses to maintain a bunch of config internally if I read Kenton’s allusions to it correctly.


Just to be clear, I'm saying you use Cap'n'Proto constants to store your schema: https://capnproto.org/language.html#constants

You can then compile it into whatever format (JSON, plain text, binary) that you want for actually reading it from disk.


I think the parent is typing to say that the data is stored in a map which is read to a proto, etc. Kinda like what GPRC does over HTTP. Which kinda makese sense. The schema gives you a great idea of what "should be", and the typing/errors/etc are understood by the host language.


We had that a decade ago. It was called XML and XML Schema. All IDEs support it.

JSON was a huge step backwards in the name of simplicity. And now when we are going to add similar functionality to JSON, something else is going to come out in the name of simplicity (like NestedText).


I think XML sort of failed simplicity.

In a minute I can read and write json from most languages I use.

In the same amount of time, I'm still wondering if I should use a tag or attribute in xml. cdata? expat?

It's not that xml isn't a good technology. It's that it's not appropriate for general use, especially in comparison to simpler alternatives.


If your child node has unique name among its siblings and does not contain nested nodes, then it's an attribute. Otherwise it's an element. Seems pretty obvious to me.

The fundamental issue with XML is its impedance mismatch with common data structures which forces using Object to XML mappers (whether explicitly or implicitly). It's more or less solved with XML Schemas or DTDs, but if you're looking at just XML, you can't tell whether some element is an array or a single node. Thus JSON is better suited for serialization.


> If your child node has unique name among its siblings and does not contain nested nodes, then it's an attribute. Otherwise it's an element. Seems pretty obvious to me.

That is really not what attributes are for. I feel a bit of a fraud posting that because I'm not an XML expert and so not really clear what they actually are for. (This reenforces the parent's point: you need to be an expert to know what such a fundamental feature is for.) I remember it's something like "something used to help interpret the actual value" e.g. units of measurement. But most of the time, even if it's non-repeating with no children, you're supposed to use elements rather than attributes.

One problem here is that attributes are so much more compact (and so often easier to read) than elements that it's tempting to use them in places where you ought to use an element (and many people over time have given in to that temptation). Another problem is that the distinction between attributes and elements is almost never useful. That was the parent comment's point by the looks of things.

> The fundamental issue with XML is its impedance mismatch with common data structures

That's probably part of it, but I think at least as problematic is that it has many features that most of the time you don't need and don't want to have to care about. Things like CDATA (also mentioned by the parent comment), custom entities, external entities, DTDs (which can be inline in XML files so you need to know all about DTDs to understand XML properly). That's why there are all sorts of weird XML vulnerabilities that JSON doesn't have. Did you know you can make an XML file that reads your /etc/passwd file when it's parsed? That is not an issue with JSON.


HTML tag and attribute is markup. Strip it and document would be still legible for a human being. Markup is non human part - presentation, semantic web.

Confusion arise once a human observer is lost.


Thanks, I found this explanation really helpful, and almost obvious in retrospect (as the best explanations often are!).

I had been thinking that all of these extra features that XML have are just a case of massive overengineering that no one would ever need. In fact it's a case of taking something fundamentally meant for text documents with extra markup, as the name implies, and misapplying it to config files and IPC messages which are just not the original domain at all.


Thank you.

I think we should draw on XML strength points. People read articles in browser, not plain text. "Add to cart" is just a POST request with id

    curl -d id=foo
yet we have forms and interactivity. Like in literate programming text and data live together, interactive application like a Smalltalk image.

In XML we can separate data from presentation.

    <?xml-stylesheet type="text/css" href="foo.css"?>
    <?xml-stylesheet type="text/xsl" href="bar.xsl"?>
    <root>...
Machine receives data, human receives application with documentation, builder. That's exactly what we have today except UI can be plugged to any stored document. To good to be true.

I think XML was killed by poor usability. Plain text XML, XHTML and XSLT authoring is not fun.

I am trying to uncover it from DOM perspective [1], so far I like it more than Markdown. XHTML and HTML is just a serialization format. HTML is not a good one [2], [3], [4]. XSLT may have nice GUI or compact syntax like RELAX NG.

[1] http://sergeykish.com/live-pages

[2] http://sergeykish.com/script-style-is-cdata-in-html

[3] http://sergeykish.com/pre-newline-ignored-in-html-test

[4] http://sergeykish.com/content-after-html-appended-to-body-in...


> Did you know you can make an XML file that reads your /etc/passwd file when it's parsed?

Not only can SGML (but not XML on its own) read /etc/passwd, it can format it into fully-tagged markup and then render it into eg an HTML table. Demonstrating what SGML/XML is actually designed for: encoding and authoring semistructured text. This can't be overstated in discussions like these where use cases for config formats, service payload formats, and actual text authoring are all thrown into the same basket when they shouldn't.

Btw: you can parse and canonicalize this new config file format into markup using the same SGML mechanism you'd be using for CSVs like /etc/passwd, namely short references

Btw2: you can skip/ignore markup declarations in XML, including whole declaration sets (DTDs) since these can be recognized using plain greedy regexpes, though you can't ignore entity declarations when actually used in your XML body text


> you need to be an expert to know what such a fundamental feature is for.

No you don't... the parent commenter explained to you what it's for in a simple and concise manner... you chose to not accept that even though you're not an expert in this, and then complains you need to be an expert to do it?!?


The parent commenter gave an explanation that, yes, was simple and concise, and also good enough for you to believe it (or you already thought that way). But it's also wrong. That just reinforces my point.

(The true difference is explained in sibling comments to yours, by sergeykish and tannhaeuser, if you're interested.)


The parent commenter explained it in a somewhat obtuse way.

I don’t doubt they meant to be clear, but reading it they were not and raised more questions than were answered.

As an example:

Wouldnt attributes be better served as details about the current element?

Wouldn’t elements be better served as “I am a child of the parent”?

Why would I use an attribute as a “non-repeating child” when semantically that doesn’t make sense when looking at the document? The attribute is inside the element’s definition, and seems to me attributes should be used to further describe the element being presented itself, and not be structural or describe itself as a child in any way.


JSON Schema [1] is actually a mature standard now, with decent tooling support, mostly through OpenAPI (formerly Swagger), which extends it with support for endpoints.

It's much simpler to use than XML Schemas, and arguably results in cleaner data models, since it doesn't have anything analogous to XML namespaces that allow for arbitrary mixing of schemas.

[1] https://json-schema.org/


> We had that a decade ago. It was called XML and XML Schema.

It would be true if XML was not full of all this SGML debris like "entities" (really, uncontroller macros), if real schema formats was flexible enough (I needed <c> inside <a> and <c> inside <b> when they totally different), etc.

But when a config reader tool has to deal with 40+-year legacy of enterprise guys wanting to embrace the universe, but all this doesn't allow to control contents without external measures like regexp checking... that simply shuts up facing real world.


Magento is a popular codebase that made XML-based configuration a fundamental part of its architecture. The results were terrible and caused numerous headaches and countless hours lost to trying to troubleshoot inscrutable configuration issues. The Magento 2 codebase began a shift away from XML for configuration, although it still uses some.

There may be room for an argument that Magento did XML badly (it did many things badly), but I don't believe I've ever seen XML done well.


I love XML configuration in Spring.


I don't get it. The @Configuration and @Bean annotations are at least 100 times more readable and powerful than whatever garbage people used to write into their xml files to define beans. 20 lines of xml are often equivalent to like 8 lines of Java and each of those Java lines is shorter than the xml equivalent. Repeating closing tags is not very interesting.


We have it today for JSON, it's called JSON Schema and many IDEs support it.


Exactly, jsonschema allows one to describe exactly how the json should look like including inter field validation. And with tools like reactjsonschemaform you can generate a ui on top of it for free.


I spent years working with xml, xslt, xml schema. Frankly when I first saw json I thought it was terrific. Nothing has changed my mind since. Why do you feel like it is a huge step backwards?


XML is fatally flawed because you can't safely put one XML doc inside another one. Because of this rather fundamental problem, it never was any good for anything, and it never will be.


Sure you can. At work we talk to a system that requires that we do exactly this. The solution they chose is entirely trivial and safe: include the embedded doc as a base64 encoded string...

And yes, I'm being sarcastic.


I don't understand why can't you safely put one XML doc inside another one? Many XML formats are literally built using this feature, like SOAP.


SOAP was and is an epic disaster, so that hardly seems like a refutation. The known way to embed an entire XML doc into a SOAP message was to use CDATA, which isn't a general solution because it means the embedded doc can't have ]]> in it anywhere. You could also base64-encoded the included doc.

Both of these solutions and all other known solutions to this problem are, as I'm sure you can see, just awful.

You can't just paste XML in XML because of the <?xml?> thing, because of entities, and because of half a dozen other misfeatures of XML.


You miss the point entirely.

You put XML fragments inside a parent XML document using namespaces.

This is very well supported, and used extensively.

Trying to "escape" XML to nest it in a parent XML document is Wrong with a capital W.


> You put XML fragments inside a parent XML document using namespaces.

could you post or link to an example? i'm not very familiar with advanced XML features

or for a simple example: what would it look like to put `child` into `parent` using namespaces?

  # parent-doc.xml
  <parent>
    <!-- embed here -->
  </parent>

  # child-doc.xml
  <child x=3 y=5/>


Roughly speaking, you can do things like the following:

    <!-- The special XMLNS attribute binds a short alias to a long name -->
    <p:parent xmlns:p="urn:some:unique:string">
        <c:child xmlns:c="urn:some:other:child:name" x=3 y=5>
            <c:subchild> <!-- No need to repeat the fully qualified unique name -->
                <p:tada>You can even interleave!</p:tada>
            </c:subchild>
        </c:child>
     </p:parent>
Note that while this is possible to write by hand, typically namespaces are for documents generated and processed by tools. The XML Schema Definition (XSD) format has full support for namespaces, so you can define documents based on modular chunks. E.g.: you can "import" the SVG namespace into a diagramming XML document format namespace, but restrict its usage to only the child nodes of an "img" tag. Or MathML as the children of "graph" nodes. Both SVG and MathML can potentially import a shared "font" namespace. Or whatever.

In the XML Reader API, each element has a "fully qualified" name that includes the long namespace prefix. If you use the API correctly, your tool can handle nested documents, or gracefully ignore them if it's appropriate.

The fiddly part is making this efficient, i.e.: avoiding a full string comparison against a long URI or URN. You typically have to "register" the namespaces you're interested in, and the API gives you some sort of efficient token instead of a string to use from then on.

I'm not saying it's perfect. Nothing is in XML. It was designed by committee, it brought too much of the legacy SGML baggage with it, but its namespace capabilities are a lot better than nothing at all, in much the same way that C# or Java don't have perfect type systems, but they're superior to loosely typed languages.


You don't embed plain text XML in CDATA, right? You escape it

    function escapeXml(unsafe) {
        return unsafe.replace(/[<>&'"]/g, function (c) {
            switch (c) {
                case '<': return '&lt;';
                case '>': return '&gt;';
                case '&': return '&amp;';
                case '\'': return '&apos;';
                case '"': return '&quot;';
            }
        });
    }
Or you convert to the same encoding, strip XML declaration, expand entities. In short work with adequate tools.


Good for nothing?

Well, except for handling complex content documents like in all ebooks and, in sgml form, all webpages like this one.


XInclude works pretty well.



Came here to say the same, Cuelang is by far the best config system and paradigm I have tried. All else seems so last century, though Cuelang has its foundation in NLP systems from last century :]


Never seen this, it's awesome! Might be an improvement over jsonnet, which was my favorite approach


Slightly off-topic, but yes, having fail-fast deserialisation is great.

I wrote a json/kotlin-serialisation library once and purposely restricted some json-features to achieve that:

1. Fields can arrive in any order - this is standard

2. Field names are matched case-insensitively - so keyA and keya are the same, because who would use two variables differing only by case. Serialization keeps the original casing of the name.

3. Missing fields throw an error. if they are nullable, they have to be explicitly set to null - so that you can be sure the serialization side upgraded to the latest version of a protocol if a field was added, and things don't just work by chance.

4. Nullable strings are not coerced to empty strings or anything like it. Kotlin is null-safe, so if it's a string, it has to be "". If it's, for whatever reason, a nullable string, you can set it to null.

5. Enums are also serialized case-insensitively - so you an write "keyA": "eNumVaLuE" if you want - typos should not break the code here, no on would you two enums differing only by case. IIRC booleans could also be TRUE, tRuE, truE etc. (but NOT t or f, or yes or no, or 0 or 1 or empty).

6. Superfluous properties are silently ignored.

These rules were a great tradeoff for quick development, mixing languages and having fail-fast behavior with a stable protocol.

(https://medium.com/@fabianzeindl/generated-json-serialisatio...)


JSON schemas are available for a number of JSON/YML config formats from JSON Schema Store[0]

[0] https://www.schemastore.org/json/


> You know what I want? Schemas.

I can see this work perfectly fine in typed languages like C#: `NestedText.Deserialize<T>("nestedtext")` where the deserialize method handles the actual mapping of nested text objects to `T` by providing the deserializer a class / classes that handles the string -> scalar(s) mapping for the given T. That would, sort of, function as a Schema.

I think the only thing, from glancing over the project, that would need to be supported to make this really useful is nested lists/dictionaries. I don't see how this can be done but maybe I'm missing it.


You can always do that, defining the schema in the client to produce sensible checks, even with JSON. The problem is that wherever the spec is underspecified is another place where two different clients can deserialize differently, and both be correct.

And the problem with stringly typed systems is that everything is underspecified


Protobufs have a text representation.


Yes indeed - it's actually pretty nice. You just define a message for your configuration schema:

  message Config {
    repeated Server server = 1;
  }

  message Server {
    string address = 1;
    int32 port = 2;
    bool standby = 3;
  }
And then you use the text representation in a config file:

  # main instance
  server { address: "127.0.0.1" port: 4567 }
  # backup instance
  server { address: "127.0.0.1" port: 9876 standby: true }
And load it into a message instance:

  Config config;
  google::protobuf::TextFormat::ParseFromString(input, &config);


Unfortunately, it is undocumented and has no formal spec, and this appears to be intentional, with no plans for improvement: https://github.com/protocolbuffers/protobuf/issues/3755.


Wow, I use pb's a ton and didn't know this. I'd upvote this twice if I could!

It looks oddly like HCL. I wonder...


As of protobuf 3 they also have a canonical JSON representation, which you can access from all the supported languages.


You want XML from 15 years ago? Yes, me too. Schemas and includes.


I've used XML. I don't want namespaces, I don't want the verbosity, I don't want entities, I don't want the security vulnerabilities.

I should have mentioned that I want something simple and readable.


Like in Windows where you configure by clicking check boxes that can get disabled if invalid, with tooltips explaining what they do, additional help if you press F1, etc. ?

It would be nice if we had such tools.


There's JSONSchema, and there are GUIs for handling/inspecting them.


This seems on its face to be a significant improvement on the goals of YAML, but I think the tradeoffs it makes will likely move YAML’s problems into a different place, creating a whole different set of difficulties understanding what a given data is, means, or does.

The problem with human friendly formats is that the thing that typically makes them human friendly is removing things that make reading and editing difficult, but make disambiguation possible. If the format ever needs to be read by a machine, something has to do that disambiguation.

If it’s not provided by the format, you’ve turned every usage into a potential source of bugs that would otherwise be restricted to interchange/stack implementation incompatibilities. In other words, now your format can have a different set of expectations even on the same system.

The natural response to that problem will be to bolt on validation, types, and documentation that is provided arbitrarily (and with varying quality).

IMO, efforts in human friendly formats should focus less on stripping out funny characters, and more on which minimal set of funny characters provide:

- Good readability

- Good editability

- Clarity of structure

- Clarity of data types

- Reasonable tolerance and flexibility for variance in arbitrary formatting/style preference (particularly in delimiting long form/multiline text and annotations), because no one can agree what good readability or editability means

- A flexible type system that allows machines and humans to know what a given datum is without variation or surprises

- Maybe humans should just use a GUI?


I generally find that the biggest problem with human friendly formats like YAML, which I think this also has, is that they tend to decouple readability from writability, and this encourages all sorts of complexity and polymorphism that seem superficially expressive, but end up just being difficult to work with. I've seen so many cases where YAML schemas turn into a quasi-DSL, because the developer thought that it was more important to have a clean looking configuration than one that is easy to edit. The result is that things like indentation get really weird, because the developer didn't optimize for having a sane underlying model.

A great comparison for this is CircleCI's config syntax and that used by GitHub Actions. The Circle format is extremely error prone; about half the time when I'm modifying a Circle config, I'll end up pushing a broken config, even though the YAML syntax itself is valid. With the GitHub Actions format, I almost never screw it up. I don't think it's a coincidence that if you convert a Circle configuration to JSON, it looks twisted and bizarre, whereas if you do the same with a GHA config, it looks perfectly ordinary and sensible.

If you think of YAML as "a prettier version of JSON", and design as if your users will work primarily with JSON, you can do fine with it. If you think of it as a medium for building your own configuration language, you'll make something awful. The problem is that any human friendly format is going to inherently encourage the latter.


See also the travesty that is Ansible's YAML-based DSL, which includes fun stuff like an in-line replacement language with tokens enclosed in braces, which of course you have to quote in some cases so that pyyaml doesn't think they are dicts.


This is basically my goal with https://concise-encoding.org

Also one more goal is twin binary and text formats that are 1:1 compatible, so that you can write it in text and transmit in binary.

I'm still finishing up the reference implementation, and then will start on the schema.


I’m getting fed-up with this constant reinvention of the serialization game. Back in the days I used to be skeptical of IT and informatics precisely for this reason: always arguing about slight variations of the same mundanities: XML, XSD, IDL, ASN.1, Avro, JSON... Emacs vs. Vim, Weakly vs. Strongly typed, and so on...

What can an ICT professional claim at the end of its career? “Hey, I’ve argued about shit all my life!”


I feel the same way. I've settled on JSON everywhere. The only thing I don't have in my toolkit is an easy binary ser/deser format.


Also comments, schemas, hashmaps with anything other than string keys, and sparse reads.


Your statement implies that arguing is something bad. Arguing is presenting arguments that attempt to show (prove) why something is true or false. It's the most important tool (convincing each other) in the progress of civilization. If you don't try to convince each other, the only alternative, in the end, is just shoot whoever you disagree with.

And it also diminishes your value as a team member. If you can't convince others, means the reasons you present are weak and nobody would be interested in listening to you and therefore there's not much reason in having you around.


> And it also diminishes your value as a team member. If you can't convince others, means the reasons you present are weak and nobody would be interested in listening to you and therefore there's not much reason in having you around

Jeez, you extrapolated a personal observation all the way to a character assassination and firing letter paragraph.

Standups and 1on1 with you must be a blessing... a joy


If you think what I said is wrong, you're welcome to explain why. Personal attacks are neither productive nor interesting.

My reasoning is the following:

People would only listen to you if you can prove what you say is right, because nobody is interested in hearing wrong things or unexplained things, they just aren't helpful.

"We should use nodejs!" "Why?" "I don't want to argue, we just should." Is that helpful?

If you don't have the reasoning skills to convince others, you can't present constructive ideas and back them up with an explanation.

If you can't do that, literally, what is your value to the team? Blindly and quietly execute the will of other team members? That would take too much energy from those people, to direct you on every step of the way.

When you hire engineers, you expect them to give more than they take, otherwise they're a drain on the team resources.

Collective problem solving is impossible without arguing. Arguing is trying to improve something, identify mistakes, logical contradictions, basically you're doing the work of a compiler that checks your program for correctness. Would you want a compiler that always agreed with you, whatever you fed into it? Don't think so. Same thing with engineers working together to reach a common goal. You're checking and improving each other's ideas.

Edit: moreover, if you lack reasoning skills to convince others, means you lack reasoning skills themselves. How are you going to solve problems in the first place?


Can you please stop attacking me.

Please.

Thankfully we’ll never work together so can we just continue our existences as we did before, blissfully unaware of each other?


Presenting arguments isn’t bad if it’s about something important, but I think by “Hey, I’ve argued about shit all my life!” the GP takes issue more with the “shit” than the arguing.


It's explicitly stated in their comment:

> arguing about slight variations of the same mundanities: XML, XSD, IDL, ASN.1, Avro, JSON... Emacs vs. Vim, Weakly vs. Strongly typed, and so on...

There could be a lot of valid and important arguments around these things. Except maybe vim and emacs, who gives a shit about that.


Yes, when coding always ask yourself how you would tell your grandchildren about what you accomplished.


I don't like any machine readable format that doesn't have some indicator that it is a complete document (like JSON or XML does). I've had a production issue where a format like this one was used where the file was read before it was finished writing it and ended up with a corrupt configuration as half of it was elided without any way to know.


This is the best criticism of this entire thread.

It'd be easy to employ ... as a document separator / end indicator that could be checked for.


You could add that yourself - so an application specific marker, using this format.


> For example, in JSON 32 is an integer, 32.0 is the real version of 32, and “32” is the string version. These distinctions are not meaningful and can be confusing to non-programmers.

I'm really struggling with this assertion; IMHO one of the problems with JSON is the lack of more sophisticated scalar types.

That being said, this appears extremely readable, so my concerns could definitely be alleviated by a decent schema.


This looks fantastic. I was recently reading a config file from Rust and ended up going with JSON5 but this is simpler to read. In a language like Rust you don't need the types because you specify that in the struct anyways. Sure, this means that there are effectively more types than strings but the user doesn't need to differentiate.

In Rust you do something like this:

  #[derive(Clone,serde::Deserialize)]
  struct {
    an_int: u64,
    a_float: float,
    ordered_map: linked_hash_map::LinkedHashMap<String, chrono::DateTime<chrono::Utc>>,
    unordered_map: std::collections::HashMap<i32, String>,
  }
So there is no need to worry about if maps are ordered or if a value is an integer, real or string in the format itself. Ironically for Python (which the reference implementation is in) it does seem much more annoying to have to manually call `int()` on each element.

I'm just a little sad that tabs are disallowed. I really think the best rule for indentation-sensitive languages is that each line must either have the same indentation in which case it is the same level, same indentation plus any amount in which case it is the next level, or the exact indentation of any previous level in which case it is a dedent. These "solutions" which just forbid tabs are half-assed and ones that try to convert tabs to a set amount of spaces just lead to confusion.

Additionally it would be nice if there was an example of a dict inside a list. I think it would work like the following but can't confirm from reading the site.

  -
    key: value
  -
    key: value
    other-key: other-value


Correct. You'll get:

    [{'key': 'value'}, {'key': 'value', 'other-key': 'other-value'}]


Maybe what we need is a widely used UI for trees, and editors for them. The editor reads a schema and tells you what blanks to fill in. Export XML, JSON, S-expressions, whatever - any tree structure.

The trouble is, open source cannot do good GUIs. If a problem is best expressed with a GUI, open source consistently blows it. See Gimp, Blender, Inkscape, FreeCAD, all of which are notably worse than their commercial competitors.


Proposal for someone with more free time than I:

Improve the format of org-mode. Make it primarily easy for humans to interact with but also easy for cheap scripts to parse and manipulate. Create a super-fast CLI for it which ships with the ability to read keybindings from a file. Ship with emacs keybindings as a default but also a file with the spacemacs keybindings. Add the ability to run the CLI as a daemon that can be started from neovim.

Keep the format open so someone else can write some npm package for including an editor in VSCode or a webapp.


> The trouble is, open source cannot do good GUIs. If a problem is best expressed with a GUI, open source consistently blows it.

What about browsers? Firefox and Chromium are open source (and fit under any reasonable definition of "GUI")


> Indentation is used to indicate the hierarchy of the data

After dealing with 500+ line kubernetes configurations, this is a bad idea.


Indentation by tab is good for shallow data or code e.g. Python. Space is bad. For deeper levels, I'd go for a visible, countable symbol Level 1 .Level 2 ..Level 3


This format seems to support one-space indents, so you could use an editor which highlights spaces easily and set indent level to 1 for visibility. The alternative is unfriendly config or another method of doing the same thing, and both are antithetical to having a simple data format.


I’m not sure the world needs another sexpr/sgml isomorphism, though it does look pleasing to the eye for the given examples. What this doesn’t solve though: yaml+jinja use cases, code as text, schemas outside of the document, everything else that makes a language out of a syntax tree.


> With NestedText any decisions about how to interpret the leaf values are passed to the end application, which is the only place where they can be made knowledgeably. The assumption is that the end application knows that Enrolled should be a Boolean and knows how to convert ‘NO’ to False.

That assumption is... not applicable to most scenarios I come across and will likely lead to issues being pushed downstream and introduction of subtle bugs and defects.


The special interpretation of NO in YAML has led to bugs and defects:

> I once disabled our product for the entire country of Norway for a day because `NO` in YAML evaluates to `false`

https://twitter.com/aarondjents/status/1307692593493553160


It'll lead to incompatible NestedText codecs/serdes :(

There should be at least some support for some standardized representations (basically JSON + ISO8601 datetime + some encoding for embedding whatever stringified serialization, eg. just how HTTP uses chunks and unique boundary tokens).


I have had great experience using JSONNET (https://jsonnet.org/) as a configuration language. It supports variables, inheritance, operators, functions, substitutions, types, with just the right amount of power, expressiveness, and simplicity.

In my opinion, JSON is best used as a wire-protocol. It is awkward as a configuration language.

YAML works for short configs, but becomes unmaintainable for longer configs. I think the primary problem is that the indentation is significant. I also think the language spec is far too complex.

INI format works for short configs, but also becomes unmaintainable for longer configs. Ironically I think this is because INI is too primitive, the opposite problem of YAML, but has the same effect.

I am not familiar with TOML or DHALL, mostly because I stopped looking after I implemented the JSONNET system and liked it so much.

Addendum: I have used text-formatted protobufs in limited situations with good results. But I don't think that protobufs is a good general purpose configuration language.

Addendum2: The amazing thing about simplicity of the INI file format is that I was able to write a "single line" sed program to parse it in a bash script. The following finds the value of the $key in the $section in the $config_file INI file (definitely works on GNU sed, I think it works on MacOS sed too, not 100% sure though):

    sed -n -E -e \
        ":label_s;
        /^\[$section\]/ {                                                       
            n;                                                                  
            :label_k;                                                           
            /^ *$key *=/ {                                                      
                s/[^=]*= *//; p; q;                                             
            };                                                                  
            /^\[.*\]/ b label_s;                                                
            n;
            b label_k;
        }" \
        "$config_file"


> I think the primary problem is that the indentation is significant.

An editor problem, perhaps? We don't maintain office documents using vim; why edit structured configuration files using a plain text editor, if doing so is arduous?

My text editor[0] abstracts the underlying hierarchical data format behind a tree-based widget[1]. Whether YAML, JSONNET, NestedText, CSON, XML, or TOML backs the widget becomes an implementation detail.

[0]: https://github.com/DaveJarvis/keenwrite

[1]: https://dave.autonoma.ca/blog/2019/07/06/typesetting-markdow...


This actually looks great. Many people are complaining about the lack of data types, but most of the time you

a) Do not have values of different types occupy the same fields

b) Have a schema defined (explicitly or implicitly as part of the parsing), especially because you're likely working with a type system

This isn't good if you want `x = parseJSON(blob)` kind of API, but that's definitely not what you want for any kind of human-editable config.

It seems simpler than TOML, I'd give it a try.


JSON is a serialization protocol, not a configuration syntax. It's designed to be written/read by machines. It's convenience for humans is that it can be relatively easily be read or written by humans as well.

Protobufs is similarly a serialization protocol combined with an RPC layer for server and client stubs. Its canonical serialization format is binary.

Neither are particularly well suited for human written configuration files. A "JSON without the quotes around keys and allowing Javascript comments and commas at the end of objects or lists and a mechanism to escape multi-line strings" would probably cover most of the required cases.

YAML is an attempt at that, but also attempts to solve a bunch of other problems in a complicated and fault-inducing way (eg, relying on indentation for hierarchy).


This is an interesting idea. I don't know whether it will take off.

One thought that occurred to me several times in the last year or so is that roughly the level of abstraction offered by NestedText might make sense as part of a hierarchy of abstractions that could be built on.

We already have that with text files. Because end-of-line character combinations are special, text files in a given encoding are already more structured than streams of characters.

So, assuming UTF-8 character encoding:

  CharacterStream
    Text (EOL character combinations)
      NestedText
        MyNestedTextFormat (domain specific semantics)
With non-text files, this has already happened more than once. For example, both Zip files and Sqlite files are used as base formats for specifying other formats.


ASCII has always had separator and message delineation characters [1][2] that can be used instead of the CR, LF, SP, TAB, "|", "," etc. They are not "human readable", but are very easy to parse.

They are even UTF-8 transparent and can be easily converted to/from CSV, TSV, PSV, and, with a definition of the equivalent of a "close brace" could allow for a multi-level hierarchy.

[1] https://en.wikipedia.org/wiki/Control_character#Data_structu...

[2] https://en.wikipedia.org/wiki/C0_and_C1_control_codes


Yep.

We were recently storing tokens in a database, and I chose to use SOH for the metadata and SOX for the text.

One byte width, no collisions with printable text, and that's what they're there for.

I'd love to see a CSV replacement that used SOH ... SOX for headers, RS as "commas", and GS as "newlines". You'd be able to cleanly concatenate multiple files, since the first line is no longer special, and you'd be able to have commas, newlines, and in fact any printable text whatsoever inside the data.

And the semantics are perfectly clean. Again, that's what they're there for. Some small challenges for hand-editing that a competent text editor can easily rise to.


Me too. I think this could be achieved. But there would need to be good editor support to make it successful. And good keyboard support too.


There's a lot of value to the ecosystem here if it could be standardized - possibly it should be an RFC.

Combined with what rswail wrote about encoding hierarchies, with careful design, those CSV sections could be embedded as tables.

If that was used as a base format for other formats, then objection that encoding for boolean and numerics wasn't standardized might go away.


Looking closer at it.. I think US would be commas, and RS the newlines? Leaves the imagination what GS could be used for..


Hmm I think it would be nice to reserve US for when you have multiple entries in a single column.

Like if the column was phone numbers and occasionally there's more than one, that sort of thing. Thinking of each cell as a "record" and allowing it to have more than one "unit" makes sense to me.

But anything would be better than ever having to whip up a script to fix a CSV with comma-separated dollar values in it, ever again.


That is an interesting thought. Perhaps it is possible to arrange a type of nested structuring when this is needed. Like a CSV inside a value. C for "control code separated" of course :-)

Very thought inducing... I think the main impediment is that these characters are not visible and not so easy to type. If they were, we might not have got the number of CSV variants that have evolved.


Yes, the challenge is editor support.

What I'd want is an emacs special mode, that displays RS as a red* comma, GS as a simple newline, US as a red semicolon, and regular newline as a red "\n".

Comma, newline, and semicolon insert the control characters, while M-, etc insert the literal characters. Not sure exactly how to handle header lines but this is the general premise.

*red as in "whatever method of visually distinguishing them as special works for you"


I don't know whether that's an "aside" comment, but it's interesting and informative anyway!

Another variant of the idea that I was mulling over was to base a roughly NestedText level of abstraction on UTF-8 transparent characters, and then combine that with what Animats was talking about as a standardized GUI for trees, dictionaries etc.

A recent trend is that programming languages have more than one bijectively equivalent syntax. For example, ReasonML [0] and OCaml are two bijectively text based syntaxes. That idea could be extended to a NestedText like syntax being bijectively equivalent to a text based syntax. Editors like Visual Studio Code infer that sort of information continually on the fly, but it sort of gets lost in the toolchain. Compilers could operate at a higher-level of abstraction than lexing/scanning. Git merge might also work better if could operate at a NestedText like level of abstraction.

[0] https://en.wikipedia.org/wiki/Reason_(syntax_extension_for_O...


Every time we have a conversation that touches on any serialization or configuration file format, it doesn’t take long for someone to pull out the flail and start beating on XML again.

XML might not be the best format for everything, and I for one am glad to use other formats for simple structured data. But when it comes to representing complex content, there is no other format that even comes close to being as useful and usable.

* All digital publishing of ebooks uses XML inside ZIP files.

* All contemporary mainstream word processors (Word, LibreOffice) use XML inside of ZIP as the basic file format.

* Automatic customized conversion processes from Word to InDesign or from InDesign to EPUB use XML at the heart.

* Let’s not forget the web itself, which is still mostly SGML in the form of HTML. Not XML per se, but only different in the details.

Not only is XML the only practical serialization format for working with publication content, but the presence of mature schema tooling is intrinsic to making publication automation robust in a given context.

I’m very glad for JSON and JSON schema in the domain of APIs. But in the domain of content data, it’s all XML.

Every serialization format has a domain for which it is most appropriate (whether or not it is the best choice in that domain.)

I’m really liking the shape of nested text for the domains in which I would have used YAML.


Specifically, lines that begin with a word or words followed by a colon are dictionary items; a dash introduces list items, and a leading greater-than symbol signifies a line in a multiline string. Dictionaries and lists are used for nesting, the leaf values are always strings.

No doubts that there are use cases for this, but calling something that casts everything to strings an alternative to the above formats seems like a bit of a stretch.


It's not a bad idea, as text file formats are all strings anyway, to let the application do the conversion. It's the one with the domain knowledge anyway. Perhaps it could do with a companion library to specify the data format and emit parse errors as appropriate. But to keep that out of the syntax makes a lot more sense than the insanity that is yaml.


I've been using toml for my latest project and the one thing that really bitten me is that [1,2,3] is a valid array [1.1, 1.2, 1.3] is a valid array, but [1, 1.5, 2] isn't valid and throws an exception due to it having heterogeneous types.


This is something that was fixed in the forthcoming TOML 1.0 spec. Parsers that have been updated to support it will allow heterogeneous types in arrays, which your application can convert to a vector of floats or whatever collection type it uses internally.


Fantastic. So far that has been literally my only complaint about TOML.


Why choose this over the other options? Just some thoughts:

1. I like that comments are part of the standard. I wrote my own C++ JSON parser that allows for comments too.

2. Is it strict about indentation? One thing you can never get programmers to do on significantly sized teams is consistent indentation. Is that tab+space? Or spacex5? Is it going to break if a tab sneaks into Git? (Setting up Git push rules just annoys and confuses people.)

3. "without the syntatic clutter of JSON" - I happen to like it. I can compact it quite far if I need to. I also like the fact I can spit it out over a debug server and JS will just magically start reading it.

4. Something really cool would have been the introduction of typed data. One way we achieve this via JSON is to create a template file which would declare something like (in a file named 'template.json' or something):

    {
      "data" : { "type": "float", "default": 0.0, "min": -1.0, "max": 1.0 }
    }
Obviously this requires checking in the code, but it does build up some kind of format checking and sanity. It can also warn you that it's using a default rather than a config defined value.

It would be nice if there was then the ability to define type syntax... But I fear this might be going too far.

5. Another thing I do with JSON is inheritance. So you define a 'parent' property at the top of a file, the values are loaded from the parent and then the child loads theirs over the top. Why have this? We usually need some per-application configuration but mostly it stays the same. It saves having to write it multiple times. You can even break it down into sections to keep each configuration file smaller.

(NOTE: For inheritance, a top tip is to implement "maximum depth", encase you get into a loop.)


For second point: most of the languages have linters that can enforce unified indentation, also there is https://editorconfig.org/ standard


One cool advantage this format has that is not mentioned anywhere is its potential for localization.

If everything is a string, and you need to parse values yourself, you can tread "prawda" and "fałsz" as booleans, instead of "true" and "false".


How is that useful?


It is, when you're accepting data from users, instead of data from other programs.

An end user with no programming experience could reasonably understand a file written in this format. This can't be said about json or yaml, not when they don't know what "true" and "false" mean.


Hm I don't know. I know "we technical people" sometimes have the feeling that users don't understand anything, but if you teach someone how to write yaml they'll pick it up quickly (I just did that to my designer colleges). I don't like the idea of having so many very similar data formats, it just gets more confusing.


Just give me a JSON with support for comments(and ideally trailing commas) and I'm happy. I don't think there would be a need to simplify that further, and barely a way to do so without making sacrifices at some other important aspect.


Check out JSON5


Interesting idea to remove number and string types. The downside being that you spend more time writing parsers of the data. But if you had some sort of separate schema that generated all of that for you, it would no longer be such a big deal.

They mention that their key/value pairs are ordered. The downside here is that not all languages (eg. javascript), support them.

I also prefer them to be unordered. The downside with ordered dictionaries, is that you need to always be asking "does sequence matter here?". So it adds an additional thing to think about, more tests need to be written, and certain optimisations can't always be made.


Both Objects and Maps in JS are in stable insertion order.


Apologies, it looks like pre es6, they were unordered objects. But now there are rules that guarantee insertion order.


I actually think this is a great idea. There are a few immediate questions I have though.

1) how does one define a key with a ": "

2) as others said, significant whitespace, specifically trailing, seems to lead problems with keys that have nested key/val, array, multiline string.

3) the github repo associated with the parser has an isue that questions how to verify if the file was truncated.

I think Deco: https://github.com/Enhex/Deco better solves the problem the OP has. It is provably delimiter collision free, unlike this (see my point 1 Unless I'm mistaken).


> For example, in JSON 32 is an integer, 32.0 is the real version of 32, and “32” is the string version. These distinctions are not meaningful and can be confusing to non-programmers.

Au contraire mon frere. What's the point of this data format if it's only intended to be for humans and not computers, because for computers the data type is critically important. For example, I've more than once seen the extremely ill-advised idea to treat zip code as a numeric, which completely screws your data model once you want to support zip+4 or international postal codes that contain letters.


Isn’t that a perfect example of a situation where stringly-typed-data-by-default would help?


No, because you’re pushing the choice on the applications giving them more opportunities to screw up. If your data is already typed and something is a string odds are the application will either just follow that lead or look up why the data format authors picked that type in case it’s important or relevant.


Is it missing "list of dicts"? All the examples show only lists of simple strings. If so, that seems like a major problem.

Taking the main example, what do you do if you have more than one vice president?


This is the biggest problem with this format! You probably won’t need it for configs, but they are written by engineers which are fine with other formats. For business data missing list of dictionaries is a deal-breaker. This is yet another shallow project similar to dribble-design portfolios of some people: looks cool at first glance, but has so big problems that it won’t fly.


This supposed "biggest problem" doesn't actually exist. Lists of dictionaries work fine, and I don't see any reason they wouldn't be useful for configs either. Maybe you should actually try it before dismissing it so shallowly?

This:

  list of dicts: 
      - 
          key 1: Hi 
          key 2: there! 
      - 
          key 1: I'm a list 
          key 2: of dicts!
parses as

  {'list of dicts': [{'key 1': 'Hi', 'key 2': 'there!'}, {'key 1': "I'm a list", 'key 2': 'of dicts!'}]}


Did you just guess that syntax? It's not in the documentation.

Hidden features effectively do not exist. The parent may have gone overboard with judgement, but this really is the fault of the project maintainers. I'll admit I mentally filed it in the "useless toy" category without this feature.

I went looking for it specifically because it's one of the ugliest and most confusing parts of YAML. To be pathological, what if you have a list of lists?


Sure, the documentation could be improved, but the syntax is extremely simple, consistent, and predictable. A list of lists is also exactly what you would expect:

  list of lists: 
      - 
          - first sublist 
          - goes here 
      - 
          - second sublist 
          - is here 
produces

  {'list of lists': [['first sublist', 'goes here'], ['second sublist', 'is here']]}
I don't think either of these are hidden. Yes, explicit examples would be nice, but assuming things not shown are impossible seems strange to me. Would you assume Python can't do lists of dicts, or lists of dicts of dicts? A quick search fails to find any examples on python.org (I did find lists of lists though).


That syntax is not obvious or predictable, and several people in this commentary made the same assumption. Fix your docs.


I have no connection to this project. I'm speaking purely from my own experience of spending a few minutes looking at the docs and trying a couple of things in Python. The sublists and subdicts were obvious to me as soon as I learned the three forms lines can take. The only thing I wasn't sure about was whether the sublists or subdicts had to start on the next line, so I tried it out: yes they do.


Is it just me or NestedText is kinda subset of YAML? I personally see very little advantage of using this over json or yaml, but having more flavors and variation is always a good thing.

Edit: forgot to mention. I personally don’t like most general data serialization formats for configuration, the one I can probably tolerate is XML, but even that I only use when it’s part of requirement. The way I usually implement program behavior configuration is through runcommands(runcom, .rc)


Very YAML. I checked the example, and the difference is that YAML places the > following the attribute name, and this format places it before the text.

BTW, I happen to like YAML as a configuration format. It's much more readable than JSON. It's not that suitable for serialization, and probably shouldn't be used to create huge config files, but for the rest, it's as good as it gets.


I see two problems with this for my use.

"The format holds dictionaries (ordered collections of name/value pairs), lists (ordered collections of values) and strings (text) organized hierarchically to any depth."

1. Ordered dictionaries are not conveniently supported in all languages I tend to use.

2. The only element type is string which means that parsing of common types has to be separately and differently by each implementation that uses such a file with potential for unspecified differences.


> Ordered dictionaries are not conveniently supported in all languages I tend to use.

Ordered dictionaries are a fundamental, extremely basic data type present in every language I've used in three decades.

a) What modern, widely used language doesn't have them?

b) Why?

c) Okay, so you've picked a bad language that has only hashtables. You can still implement ordered keys using an array of the keys & values in sorted order, and a hashtable of keys to the array indices.


In the C++ and Rust standard libraries you have a choice between unordered maps and maps sorted by key, but not maps ordered by insertion.


The value of these representations and libraries is that they work in many places, not "It works for me."


I think the idea of leaving interpretation of the values to the end application is a good idea. I forget the details, but there was something about this idea related to the TCP/IP stack, with error correction being done at the application level rather than at one or more intermediate levels.


You are thinking of the end-to-end principle http://web.mit.edu/Saltzer/www/publications/endtoend/endtoen...


Lack of data types and schema support aside, there is one significant flaw with this, and it’s the one thing I always hated about YAML: the lack of delimiters. In a code editor that matches start and end blocks, having delimiters is a huge advantage. Heck, even in minified JSON, I can easily identify start/end blocks simply by placing the cursor on the open or closed parenthesis. Also how would you “minify” a format like this where whitespace is significant? I think this is a nifty as purely an alternative for simple (shallow data) uses cases like configuration files, however using this for data exchange, serialization, portability, and deeply nested, complex data structures could be problematic, just as it always has been for YAML.


The Python community would like a word.


There's a pretty good usecase which I highly doubt will ever be implemented: copy/paste between applications.

I don't think this will ever happen, but having a human readable plain text output whenever you select any data from a UI is a very powerful idea. EVE Online did this back in the early 2000s and let you copy virtually anything in UI. This led to awesome third party tools that could work with the game with a minimal learning curve from the user (everything was "paste what you copied from the game"). It did use a custom serialization format... But a plain text one nonetheless.

IO via copy paste between a wide variety of apps in a known, and very simple, structured format would be awesome in my books.


This cites as an advantage over JSON:

> Unicode characters without encoding them

But JSON doesn’t generally require special encoding for unicode characters.

Technically, JSON is UTF-8 encoded, but that’s true of NestedText too so that can’t be what they mean.

Also, of JSON it says:

> in JSON 32 is an integer, 32.0 is the real version of 32...

But JSON doesn’t distinguish integer from real. It only has number.

On the format itself, I wonder if it needs some testing and hardening. The definition seems ambiguous, but maybe just needs a formal grammar. Just from reading it, tab handling seems like an issue. I think you could have documents that look right but have invisible issues due to tabs. E.g., it sounds like tab can be a character in a dictionary key name, which looks like indentation.


What is the representation for objects in the “name and arguments form” (eg enums in rust, variants in Haskell, other tagged/discriminated unions)? It seems like they are becoming more popular in various programming languages.

What is the way to represent a map/dictionary where the keys are not strings? In json I also don’t have a good way except I guess a list of objects with a key and value field but it feels heavy. In Lisp I would probably use an alist (whereas a plist might be used for the “string” (or symbol, rather) keys). In fact, how does one even write a list of lists with this?


Standardization is much more important than perfection. This is a horrible idea.


Oh great, another alternative that seems to not really fix anything significant, while creating a slew of other things that will be criticized.

It just shifts problems to other areas, which some may find even worse.


I'm not a fan of rigid formats for human-entered metadata structures like config. I think it violates a kind of Postel's law .. where there ought to be considerable flexibility for the human to specify the structure (and therefore in what the format reader accepts). If you want rigidity, just use an appropriate programming language or, say, s-expressions. The rigid formats ask for their own linters to be created and maintained.

In my indian music notation publication tool Patantara, I use a text preamble format that's limited and far simpler for people to use than any of these "data structure capable" formats - including NestedText/YAML/JSON. Here is a quick description -

1. You specify key value pairs with lines of the form -

    key with allowed spaces = any textual value
2. The first " = " on a line separates the key from the value. The key is normalized to lower case and spaces in the key are normalized to single space, and the key is LR-trimmed. The textual value is mostly kept as is (except for LR-trimming), but can accumulate more content.

3. Lines which don't have a " = " are just strings that get appended to the value of the immediately preceding key. So

    key = value1
          value2
          value3
will result in the key "key" having the single string value "value1\nvalue2\nvalue3".

4. If the same key is given multiple times, the values are concatenated with line breaks. So

    key = value1
    key = value2
    key = value3
gets you, for the key "key", the value "value1\nvalue2\nvalue3".

5. The reader doesn't care about data types. However, the application can decide how to parse the string values. One convenience I use for lists of items is "comma or whitespace separated values", where a list of values is given as "value1, value2, value3" or "value1\nvalue2\nvalue3" or "value1 value2 value3" or any combination including "value1, value2,\nvalue3 value4".

Nested structure is not possible with this format (unless you impose more structure on the value in the application), but it serves very well as a metadata format for text files in my application.

Ref: https://blog.patantara.com/blog/2017/04/16/controlling-who-c...


I like it. For use cases that are for non technical users to manipulate and maintain it makes sense to me.

In Emacs, org files are often used for configuration similarly.

While I'm a fan of EDN most of the time, which doesn't suffer from the issues they mention JSON having, not having to wrap things in quotes for text, and using indent for nesting definitely would make it nicer for non technical people. So I can see a use case for this as a simple text interface for forms that is targetted at non-technical users.


Interesting insights into the thought process of the author, there's very little as to positive statements of why you'd want to use this format, but they do however spend a great deal of time discussing neutral features and edge cases of currently wildly adopted alternatives.

One might call this 'argument by gotcha' interesting to imagine the kind of lifeworld that makes one think this is a worthwhile form of persuasion.


this has two of my favorite features!

1. comments aren't part of the parse tree, so I can quickly strip all comments by a quick parse print cycle. This is really helpful if you want to do automate changes to the files.

2. I can freely truncate the file at (almost) any point, and still have a valid config file. It's not obvious up front, but if your file gets snipped at some point, it's won't cause parse errors (usually) consumers of the file.


The parent comment is irony. I've been wrestling with some declarative configuration, and it's getting to that frustrating scale where things are almost the same but a little different.

2. isn't too bad in yaml, since json is valid yaml, a broken file can be reliably picked up with leading '{' and trailing '}'

With regard to 1 - both comments and nonstandard formatting like the above {} hack, are currently hard.

Anyway, got to spend some time outside, and away from screens. Apologies for the snark.


2. Add someone says in another comment, isn't that a mistake? What if my file is accidentally cut? In which cases would I want to snip my file without editing it?


Could be missing the start of the file too, in rarer cases.

Wonder what the best start of file and EOF-equivalent sentinel would be to use for YAML-esque files.

Something like a double colon?

  ::
  item1:
    - etc..
  ::


Regarding #2, would you really be ok with loading half a configuration file? Or am I misunderstanding your point?


>NestedText is a file format for holding data Then why not call it NestedData? NestedText is misleading because text and data are different. That was a distinction many users of XML failed to make when they used it to store data and ended up dissing XML for being obese when it was designed for large text documents (wherein XML tags are an insignificant portion of the document), and not data.


I like this quite a bit. Keep stringly-typed text files as strings; let code sort it out.

Pydantic would work great in conjunction with this to coerce types.


My Easy Data Transform software already supports yaml, json, xml, csv, tsv and Excel. The last thing I need is yet another format!


Looks good. Another very worthy competitor the #2 position of the best config language.

These felicitations were brought to you by the .ini gang


But this is an interoperability nightmare. Number, date and boolean formats are going to be a pain in the ass to deal with.


I'm mostly using HOCON these days. Strange that it's not included in the alternatives section.


It's kinda suckless, and I love it.

I think they made some good decisions. All types are strings and left to the application to parse and document. Excellent.

And, I think the desicion for the angle-bracket-multiline syntax was great because I think one could create a parser with awk/grep. Simple.


How would you include "\n >" in some multi-line text in this format?


    >
    > >


There is such a thing as JSON Schemas if you really want to nail down types.

I don't see this thing getting any traction. JSON is pretty much the standard at this point and I don't see anything changing that anytime soon.


There was a time you could, with as much justification, say that about XML. Not that long ago, actually.


I like this idea very much; two comments though:

1. Doesn't org-mode already solve this problem nicely, in a more widely-standardized way? (Maybe not; I'm just learning org-mode ATM.)

2.There is no "i" in Topeka.


1. Does anything use org-mode for data files (other than Emacs and things trying to be compatible)?

2. The addresses are of course fake; the zip codes don't match either. Same with the phone numbers. "KateMcD@aol.com" may be real, but "margaret.hodge@uk.edu" isn't.


Interesting. Maybe the time is ripe for a next generation of config formats. It looks decent. But if you just open a file with this you might think it is YAML if there is nothing else to hint you?


I read a bit but I'd like to see a comparison of the same data structure in JSON, YAML and NT.

I would also like to see YAML converted to NT and back to YAML just to see if there is any information loss.


Looks similar to CSON, at least superficially. https://github.com/bevry/cson


Which in turn has more than a passing resemblance to Rebol...

  ; Comments!!!
 
  ; An Array with no commas!
  greatDocumentaries: [
      'earthlings.com
      'forksoverknives.com
      'cowspiracy.com
  ] 
 
  importantFacts: [

      ; Multi-Line Strings! Without Quote Escaping!

      emissions: {Livestock and their byproducts account for at least 32,000 million tons of carbon
  Goodland, R Anhang, J. “Livestock and Climate Change: What if the key actors in climate change we
  WorldWatch, November/December 2009. Worldwatch Institute, Washington, DC, USA. Pp. 10–19.
  http://www.worldwatch.org/node/6294}
 
      landuse: {Livestock covers 45% of the earth’s total land.
  Thornton, Phillip, Mario Herrero, and Polly Ericksen. “Livestock and Climate Change.” Livestock E
  https://cgspace.cgiar.org/bitstream/handle/10568/10601/IssueBrief3.pdf}
 
      burger: {One hamburger requires 660 gallons of water to produce – the equivalent of 2 months’
  Catanese, Christina. “Virtual Water, Real Impacts.” Greenversations: Official Blog of the U.S. EP
  http://blog.epa.gov/healthywaters/2012/03/virtual-water-real-impacts-world-water-day-2012/
  “50 Ways to Save Your River.” Friends of the River.
  http://www.friendsoftheriver.org/site/PageServer?pagename=50ways}
 
      milk: {1,000 gallons of water are required to produce 1 gallon of milk.
  “Water trivia facts.” United States Environmental Protection Agency.
  http://water.epa.gov/learn/kids/drinkingwater/water_trivia_facts.cfm#_edn11}
 
      more: http://cowspiracy.com/facts
  ]
Some references: https://en.wikipedia.org/wiki/Rebol | http://www.rebol.com/article/0108.html


Hjson is the best format in my opinion https://hjson.github.io/


This seems... worse?

I can't figure out what's happening here. Is everything just a string? They complain about how YAML handles booleans (fair criticism), however I can't see how NestedText fixes this.

This feels far inferior to TOML (everything NextText does, and much more) or JSON (very unambiguous, works natively in most languages).

I see this a lot in startups and open source projects. They exist solely as a criticism of their competition (which is fair!), but don't work to understand the real problems and fix them in any meaningful way. You can spot them when they talk a lot about what's wrong with the competition ("AWS is too expensive", "YAML is too ambiguous") but never explain their solution.


"It's a complex problem so we just cast everything to a sting and require applications to handle the typing" is what I interpret it as.

Honestly after a few years using YAML you remember to always quote anything you know you need to be a string, and the different multi line types.

An argument against this is "why not just design it to be simpler" — well, then it's less useful in as wide a variety of applications, and stands less chance of actually becoming a standard like JSON or YAML has. And you end up with the XKCD problem.

YAML came in and solved a few of JSON's most glaring problems (multi line and comments) with a usable approach.

This new format seems like it's "YAML, but a little different."

If I wanted to usurp YAML, I'd focus on the greatest pain points for most, whitespace and schema support.


> YAML came in and solved a few of JSON's most glaring problems

AFAIK they were both independently created in 2001 and YAML was not created in response to JSON.


YAML just introduced other problems.


Sadly my comment will not add value to the conversation, but I came to the comments looking to see if someone made the XKCD reference. To my delight, here you are. #927 always delivers - https://xkcd.com/927/


Don't quite get the use of > for multiline strings. If you don't use > does it mean the string folds lines? Or is it an error?


So you can't use ' and " in the same key. Does this mean NestedText can't represent all possible key values?


One might say that this ain’t a markup language


lemme piggyback a bit '__') i love this format, but i don't know the name: https://stackoverflow.com/questions/57470117/name-for-junipe...


I'm always disappointed when one of these new languages comes up and there's no railroad diagram.


Needs to ship with an editor ( or a vs code extension).

One of the biggest issue is just validating the actual data


The only thing that I can think when I see it https://xkcd.com/927/


wth this is the exact syntax i use for my notes


Misspelling "Topeka" does not inspire confidence . . .


Lol, “Topika”


https://cuelang.org

- clean syntax

- logical & unified

- roots in LINGO

- forked from Golang

https://cuelang.org/play


Do people really think significant indentation is a good idea?

Edit: A lot of people seem to be accidentally clicking the downvote icon.


It works fine in Python. It's nice because the visual indentation lines up with how the interpreter actually reads the code. You don't end up in a situation like this contrived example:

  if (something == somethingElse)
    doSomething();
    --counter; // added afterwards
when whitespace has actual meaning.

Your editor adds the whitespace to match the current scope, so you don't have to type it manually.

What's your issue with significant whitespace?


I don't mind it that much for programming languages like Python. The one annoyance I have is related to editor tooling: In languages that use braces to denote blocks, I can paste a block of code wherever and have my editor autoformat it to the correct indentation level, while in Python I have to be more deliberate with increasing or decreasing the indentation of a pasted snippet, and sometimes I'll make a mistake in the process that isn't caught until later.

I dislike significant indentation for configuration because the nesting tends to get deeper than for programming languages and it can be harder to see what's going on.


> You don't end up in a situation like this contrived example:

Modern languages repair this with mandatory block braces: {C style} (Rust, Go, Swift...) or Modula/Ada style (if x then foo else bar end). C style is easier for editors.

> What's your issue with significant whitespace?

1. Many cases when leading whitespace is still spoiled, including editors (which use can't be avoided due to corporate tool limiting), web formatters, etc. Last decade spreading of mobile-aware tools made this worse.

2. With grouping by indentation, it's hard to determine block start if multiple blocks are ended at the same line, like:

    a:
        b
        c
        d:
            e
            f
    p
and let's imagine the most nested block occupies 2-3 screens (not rare for configs or code, even if it is against good coding rules). You have no means to decide what block start is to be matched when you ask editor to find match.

I code in Python continuously since 2004 and I deem grouping by indentation is bad idea. Not fatally bad, of course, but with explicit grouping it would be a bit better.


I love Coffescript, Sass and Pug for exactly this reason. Just my opinion, though.


Why not? I'm not a Python developer, but bad indentation is considered a faux pas in most production codebases and is already enforced by linters, so I don't really see the issue with enforcing it at the language level.


Apparently. It’s my single biggest pain point with Python, but apparently a lot of people think it’s a good idea.


I'd rather cringe a bit at it than fight on editor style configuration in reviews


I don’t have strong opinions about it


Yes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: