It’s a shame that ASCII unit/record/group boundary characters were practically never used, at least since microcomputers took over. Instead we’re stuck with broken replacements like CSV.
Interestingly, if you ever scan a FedEx barcode, they use tons of ASCII separator characters. About the only time I've seen them significantly used though.
In that alternate universe nobody would have put them into some varchar column? The edge case still exists, which means that the standard would need to support it somehow, and we're back to exactly square one.
But if I was writing a similar standard today and someone was pressuring me that I really had to make some kind of mechanism for nesting, I'd use something that transforms the field to remove the need for escaping, in a big and obvious way. Probably base64. That prevents most of the implementation issues you see with CSV.
Neither does a trivial variation of CSV where quotes are not special and commas must be escaped in whatever way. The separator is completely immaterial, the resulting format is identical.
>> Instead we’re stuck with broken replacements like CSV.
> ASCII separators are equally as broken. How do you represent a string containing the record separator in a record-separated file?
ASCII separators may not be perfect, but they're certainly less broken than CSV. With CSV, you have the pervasive problem of how to represent a common, everyday character in the file. With ASCII separators, you only have the problem of representing an unusual, untypeable
character, which is much less troublesome.
I'm happy with CSV values containing a comma being quoted and a quoted quote character being represented by doubling it (similar to SQL doubling apostrophe). I don't see what pervasive problem remains.
So many people look at a simple spec where correct implementation is a handful of lines, and many libraries are available, and decide they'll do an incorrect one-line implementation instead.
It's the temptation to just concatenate or split based on commas, or to hack a broken implementation of quotes on top, that causes so many issues.
So the problem isn't CSV the spec, it's the use of CSV without using correct library implementations. I can hardly blame that on the format itself and shouldn't be a reason to reach for something else that could also be poorly parsed with self-written code. The other extreme is YAML where the complex spec makes using any implementation is concerning.
So many people keep messing up that at a certain point you can't just say it's their fault, there's a responsibility to guide them into better behavior.
There are two main options for that. Make it so that poorly parsing with self-written code is still very likely to do the right thing, or make the parsing hard enough that people go get a library.
I think a big reason they never took off is also that there's no visual representation of them, or input method for them.
I'd be happy to use them instead of CSV if I could edit records in Notepad or TextEdit the way I can with commas and quoted strings. See the separators, type them.
But of course once you do that, somebody now wants to insert a list of five values separated by unit or record separators into a field. They'd just be common ASCII characters along with tab, CR, LF, etc, and need to be escaped the same.
Those characters were "control characters" and used for streaming data over serial, parallel and other devices. Control character could cause some real surprises when you used them.
XML has gaps in its escaping. (Oops I wrote JSON at first, got the technicalities mixed up.) Null-terminated strings don't have escaping. Fixed width formats don't have escaping. Type-length-value formats don't have escaping.
And in practice I do not expect CSV with embedded nulls to work properly, so there's already precedent to reject certain characters entirely in a CSV-like format.
Formally, yes, but in practice there's an enormous difference in the number and kind of records that might contain a comma and the number and kind that might contain a non-printable record seperator.
This difference carries over to C, where NULL got the job done for string termination pretty darn well, even if there were strong critiques to be weighed against it.
I think the C example is meaningfully different. \0 in C is a string terminator. There is no more string after it, so you can just read the string "while (c != 0)", and there is never any need to escape it, because by definition strings cannot contain that character.
From a purely linguistic standpoint, the choice is sound. Obviously it's objectively a disaster but for unrelated reasons.
sizeof s (fun fact: you can omit the parentheses because it's not a function, it's an operator) reports the size of the array s, which is a variable of automatic storage duration.
Consider the following program:
#include <stdio.h>
#include <string.h>
int main()
{
const char * const s = "gogo\x00gogo";
const char t[] = "gogo\x00gogo";
printf("%d,%d,%d,%d\n",
(int)sizeof s,
(int)strlen(s), // DO NOT EVER USE THIS FUNCTION
// YOU WILL BE FIRED IF YOU DO
(int)sizeof t,
(int)strlen(t) // DO NOT EVER USE THIS FUNCTION
// YOU WILL BE FIRED IF YOU DO
);
}
> 8,4,10,4
You always strip them out from input because a record value can’t contain more records.
CSV is more broken than you imagine. Try opening an American CSV file with Excel in Germany or France. You’ll discover that it doesn’t parse, and the reason is that the ‘C’ in CSV apparently stands for “could be a semicolon too” because comma is the decimal separator in these countries and therefore must be available for use inside values.
I work in data engineering, I have seen more ways to fuck up data than any human has any business seeing, that's not new to me. Here's some more ideas for maximum fun with CSVs:
- Make sure to write your own parser! The format is so simple!
- Is the first line a header or not? The RFC says you should look at the MIME type to tell. You can't make this up.
- Is the line terminator "\r\n", "\n" or "\r"?
- What happens if the number of records for a line is incorrect? Well, that SHOULD'nt happen...
- Make sure to involve Excel, nothing ever went wrong feeding data to Excel.
- Are end-of-line spaces relevant or not? Should we trim them?
There is no defending of CSV, it's a broken format and it's broken beyond repair. The point is that the separator is immaterial. Replacing commas with something else gains us exactly nothing, and results in an identically broken format.
I love your post, and agree to everything, except the last 2 sentences. Yes, CSV is broken beyond repair. But this doesn't make the choice of separator immaterial. It's broken for other reasons AND for the separator. A character that's already profusely found in texts and numbers... I would even argue that it would be difficult to find a WORSE separator. Probably only space and hyphen qualify.
Full stop would be even worse. Apostrophe would be runner up...
Always the seemingly simple solutions have most issues in real world... At some level still trying to edit stuff by hand might be where we have gone wrong for long time. Why don't we have any sensible agreed structured format for which tools would be on all and every device...
This is not actually the case. The problem was you were transporting CSV often using a mechanism that used control characters at the hardware level. Using those characters often would have unpredictable results when storing, retrieving, and transmitting data.
Right? And they're a really obvious solution to a problem that we somehow still have even though they've been around forever. The only problem I've ever seen is that there's no good way to type or display them
If there was a good way to type of display them, then normal people would be typing them every day (say, pasting them into comment forms or whatever), and then we would need to have some standard for quoting them.
It is a bit of a chicken and egg problem here. If those characters were used then at least for CSV support better text editors would find a way to handle them. Since they aren't nobody wants to try to use them in a CSV like file.