> I think avoiding numeric types is a good decision.
Only if this format is intended for use-cases that never need to deal with numbers.
> One should remember that any sane application will be parsing the config file into internal data structures and validating it anyway so it gets little benefit from the numbers being already “parsed”.
That statement couldn't possibly be more wrong.
Number parsing (and encoding!) is a decidedly non-trivial problem. You need to concern yourself with -- at a minimum -- all of the following:
- Unsigned 64-bit numbers.
- A series of digits that would be bigger than a 64 bit whole number. Convert to float? Truncate in some way? Error?
- NaN
- Infinity
- Negative zero
- Denormal numbers.
- Differentiating between decimal/currency types and floating point numbers. Not all decimal values can be exactly represented as floats!
- Efficiently encoding floating point to use the minimum digits without losing precision.
- Parsing those minimal numbers with perfect "round-tripping".
- Doing the above efficiently.
- Securely too! Efficient parsers cut corners on sanity checks. I hoped you fuzzed your parser...
The above can easily amount to many kilobytes of extremely complex code. Look up "ryu" as an example of what Google came up with to make JSON number parsing reasonably efficient.
Meanwhile, reading a fixed-length number from a binary format can be done in a single machine instruction. One. It might not even take an entire CPU clock cycle! Okay, two, if you need to bounds-check your buffer, but there's ways to avoid that.
Afterwards, the bounds check is again literally just two machine instructions in complexity. That's not the difficult bit!
You’ve given lots of examples of things that make parsing numbers difficult but I don’t see why they are relevant to a config file written by humans. I think it makes sense to have the number parsing owned by the thing which cares about the number format.
One example you provide is decimals for currency values but I claim you would want such values to look like $1234 in config files so that when they are reviewed or written, the person reading the file knows they are looking at a dollar value and can be concerned if it is too large.
I’m not suggesting that applications write their own number parsing. Just do uint64::parse or parseInt or Double.of_string, or whatever else you need to access your language’s number parsing routines.
> Just do uint64::parse or parseInt or Double.of_string, or whatever else you need to access your language’s number parsing routines.
Okay, so the computer is doing the parsing.
Those functions are notoriously inconsistent in their behaviour, particularly across different programming languages. If you're not careful, you'll end up accidentally using the internationalised versions of those functions. Even if you're careful, other people won't be.
Remember, data formats are for interchange. They have to be language agnostic. They have to be well-defined, and it should be possible to write a parser for them without having to guess at the precise details.
If you go fully against Robustness principle, you lose the reason to use textual formats as well, since they are designed to be forgiving of human errors in input and catch them in syntax.
And - it is certainly OK in many instances to have fixed-width, fixed byte-order binary encoding as the format's basis. It comes with the twin downsides of wholly different categories of errors cropping up, and with the lack of a universally agreed upon tool for human entry.
Perhaps text was a fashion, though. I definitely have had thoughts in that vein lately. And in that case we shouldn't always be rushing to use it as the source of truth when we have many good, machine-level agreements about numeric formats.
I wrote some config for my application, which knows how to read it. Why do I want some other application In some other programming language to read it too?
I am far more worried about localisation issues than language issues. If you are storing something central to multiple applications I'd argue a text file is the wrong tool
But which of these are problems in configuration files written by a human? That is the aim of the format. Moreover in applications were there could be issues, it would most certainly be tied to very specific fields and you would want specific application logic to handle that field. Now if people misuse it as a data exchange format or so, yes I agree with you, but at that point just use a binary format instead.
That doesn't matter at all. The author's aims will be ignored if this format is used for anything even vaguely important. Eventually it'll need tooling to both read and write it.
DevOps pipelines, applications with GUIs, or something will need to both parse and generate this format in a consistent way.
There is no such thing as a human-write-only format in widespread use.
Even programming languages are regularly generated by tools such as RPC API codegen tools, LINQ-to-SQL and the like.
Only if this format is intended for use-cases that never need to deal with numbers.
> One should remember that any sane application will be parsing the config file into internal data structures and validating it anyway so it gets little benefit from the numbers being already “parsed”.
That statement couldn't possibly be more wrong.
Number parsing (and encoding!) is a decidedly non-trivial problem. You need to concern yourself with -- at a minimum -- all of the following:
- Unsigned 64-bit numbers.
- A series of digits that would be bigger than a 64 bit whole number. Convert to float? Truncate in some way? Error?
- NaN
- Infinity
- Negative zero
- Denormal numbers.
- Differentiating between decimal/currency types and floating point numbers. Not all decimal values can be exactly represented as floats!
- Efficiently encoding floating point to use the minimum digits without losing precision.
- Parsing those minimal numbers with perfect "round-tripping".
- Doing the above efficiently.
- Securely too! Efficient parsers cut corners on sanity checks. I hoped you fuzzed your parser...
The above can easily amount to many kilobytes of extremely complex code. Look up "ryu" as an example of what Google came up with to make JSON number parsing reasonably efficient.
Meanwhile, reading a fixed-length number from a binary format can be done in a single machine instruction. One. It might not even take an entire CPU clock cycle! Okay, two, if you need to bounds-check your buffer, but there's ways to avoid that.
Afterwards, the bounds check is again literally just two machine instructions in complexity. That's not the difficult bit!
The difficult bit is the parsing.