Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It’s a shame that ASCII unit/record/group boundary characters were practically never used, at least since microcomputers took over. Instead we’re stuck with broken replacements like CSV.


Interestingly, if you ever scan a FedEx barcode, they use tons of ASCII separator characters. About the only time I've seen them significantly used though.


ASCII separators are equally as broken. How do you represent a string containing the record separator in a record-separated file?

Any such solution will always be broken, proof is left as an exercise for the reader. Hint: use Ogden's lemma.


They are exactly the same... Expect for the massive use case of storing human text, which contains tons of commas and zero ASCII separators.

So if you decide not to support "strings", just "text", then those separators Just Work.


In that alternate universe nobody would have put them into some varchar column? The edge case still exists, which means that the standard would need to support it somehow, and we're back to exactly square one.


The standard does not have to support nesting.

But if I was writing a similar standard today and someone was pressuring me that I really had to make some kind of mechanism for nesting, I'd use something that transforms the field to remove the need for escaping, in a big and obvious way. Probably base64. That prevents most of the implementation issues you see with CSV.


Neither does a trivial variation of CSV where quotes are not special and commas must be escaped in whatever way. The separator is completely immaterial, the resulting format is identical.


People screw up escaping. Don't escape.

CSV needs escaping, so don't do CSV.


>> Instead we’re stuck with broken replacements like CSV.

> ASCII separators are equally as broken. How do you represent a string containing the record separator in a record-separated file?

ASCII separators may not be perfect, but they're certainly less broken than CSV. With CSV, you have the pervasive problem of how to represent a common, everyday character in the file. With ASCII separators, you only have the problem of representing an unusual, untypeable character, which is much less troublesome.


I'm happy with CSV values containing a comma being quoted and a quoted quote character being represented by doubling it (similar to SQL doubling apostrophe). I don't see what pervasive problem remains.


So many people look at a simple spec where correct implementation is a handful of lines, and many libraries are available, and decide they'll do an incorrect one-line implementation instead.

It's the temptation to just concatenate or split based on commas, or to hack a broken implementation of quotes on top, that causes so many issues.


So the problem isn't CSV the spec, it's the use of CSV without using correct library implementations. I can hardly blame that on the format itself and shouldn't be a reason to reach for something else that could also be poorly parsed with self-written code. The other extreme is YAML where the complex spec makes using any implementation is concerning.


So many people keep messing up that at a certain point you can't just say it's their fault, there's a responsibility to guide them into better behavior.

There are two main options for that. Make it so that poorly parsing with self-written code is still very likely to do the right thing, or make the parsing hard enough that people go get a library.


Yup, exactly.

I think a big reason they never took off is also that there's no visual representation of them, or input method for them.

I'd be happy to use them instead of CSV if I could edit records in Notepad or TextEdit the way I can with commas and quoted strings. See the separators, type them.

But of course once you do that, somebody now wants to insert a list of five values separated by unit or record separators into a field. They'd just be common ASCII characters along with tab, CR, LF, etc, and need to be escaped the same.

There's no escaping escaping...


Those characters were "control characters" and used for streaming data over serial, parallel and other devices. Control character could cause some real surprises when you used them.

> There's no escaping escaping...

This is the truth.


XML has gaps in its escaping. (Oops I wrote JSON at first, got the technicalities mixed up.) Null-terminated strings don't have escaping. Fixed width formats don't have escaping. Type-length-value formats don't have escaping.

And in practice I do not expect CSV with embedded nulls to work properly, so there's already precedent to reject certain characters entirely in a CSV-like format.


Formally, yes, but in practice there's an enormous difference in the number and kind of records that might contain a comma and the number and kind that might contain a non-printable record seperator.

This difference carries over to C, where NULL got the job done for string termination pretty darn well, even if there were strong critiques to be weighed against it.


I think the C example is meaningfully different. \0 in C is a string terminator. There is no more string after it, so you can just read the string "while (c != 0)", and there is never any need to escape it, because by definition strings cannot contain that character.

From a purely linguistic standpoint, the choice is sound. Obviously it's objectively a disaster but for unrelated reasons.


> the C example ... by definition strings cannot contain that character

They can, only some functions stop at the first zero:

    char s[] = "gogo\x00gogo";
    printf( "%d\n", (int)sizeof( s ) );


sizeof s (fun fact: you can omit the parentheses because it's not a function, it's an operator) reports the size of the array s, which is a variable of automatic storage duration.

Consider the following program:

  #include <stdio.h>
  #include <string.h>

  int main()
  {
      const char * const s = "gogo\x00gogo";
      const char t[] = "gogo\x00gogo";
      printf("%d,%d,%d,%d\n", 
          (int)sizeof s, 
          (int)strlen(s), // DO NOT EVER USE THIS FUNCTION
                          // YOU WILL BE FIRED IF YOU DO
          (int)sizeof t,
          (int)strlen(t)  // DO NOT EVER USE THIS FUNCTION
                          // YOU WILL BE FIRED IF YOU DO
      );
  }

  > 8,4,10,4


> > the C example ... by definition strings cannot contain that character

> They can, only some functions stop at the first zero:

> char s[] = "gogo\x00gogo";

And in what human language is that `\x00` a valid character? What rune/glyph/symbol represents it?

Nothing, you say? Well, looks like a good choice to me.


You always strip them out from input because a record value can’t contain more records.

CSV is more broken than you imagine. Try opening an American CSV file with Excel in Germany or France. You’ll discover that it doesn’t parse, and the reason is that the ‘C’ in CSV apparently stands for “could be a semicolon too” because comma is the decimal separator in these countries and therefore must be available for use inside values.


I work in data engineering, I have seen more ways to fuck up data than any human has any business seeing, that's not new to me. Here's some more ideas for maximum fun with CSVs:

- Make sure to write your own parser! The format is so simple!

- Is the first line a header or not? The RFC says you should look at the MIME type to tell. You can't make this up.

- Is the line terminator "\r\n", "\n" or "\r"?

- What happens if the number of records for a line is incorrect? Well, that SHOULD'nt happen...

- Make sure to involve Excel, nothing ever went wrong feeding data to Excel.

- Are end-of-line spaces relevant or not? Should we trim them?

There is no defending of CSV, it's a broken format and it's broken beyond repair. The point is that the separator is immaterial. Replacing commas with something else gains us exactly nothing, and results in an identically broken format.


I don't think anyone's suggesting to take the CSV RFC and do a find/replace on the commas.

For one, I think you'd almost certainly use Record Separator, and \r and \n would be normal text characters inside of a field. And no trimming spaces.

And the behavior upon encountering corruption isn't really up to the spec, is it?

Headers might be more trouble than they're worth, but if they're in I'd say they need to be mandatory. One code path.

You'll find yourself with a much less broken format.


I love your post, and agree to everything, except the last 2 sentences. Yes, CSV is broken beyond repair. But this doesn't make the choice of separator immaterial. It's broken for other reasons AND for the separator. A character that's already profusely found in texts and numbers... I would even argue that it would be difficult to find a WORSE separator. Probably only space and hyphen qualify.


Full stop would be even worse. Apostrophe would be runner up...

Always the seemingly simple solutions have most issues in real world... At some level still trying to edit stuff by hand might be where we have gone wrong for long time. Why don't we have any sensible agreed structured format for which tools would be on all and every device...


Instead of CSV we should use RFC 4180.


You don't... The ASCII separators are out of band. But the CSV separators are most certainly in band.


This is not actually the case. The problem was you were transporting CSV often using a mechanism that used control characters at the hardware level. Using those characters often would have unpredictable results when storing, retrieving, and transmitting data.


Right? And they're a really obvious solution to a problem that we somehow still have even though they've been around forever. The only problem I've ever seen is that there's no good way to type or display them


If there was a good way to type of display them, then normal people would be typing them every day (say, pasting them into comment forms or whatever), and then we would need to have some standard for quoting them.


It is a bit of a chicken and egg problem here. If those characters were used then at least for CSV support better text editors would find a way to handle them. Since they aren't nobody wants to try to use them in a CSV like file.


They are used, but they tend to be in places that nobody sees or thinks about. A few auto manufacturers use them in their serialized part codes.


Really old point of sale protocols sometimes still used them. That's how I learned field separator.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: