Hacker News new | past | comments | ask | show | jobs | submit login

> It’s that libraries and tools to parse CSV tend to suck. Whereas JSON is the lingua franca of data.

This isn't the case. An incredible amount of effort and ingenuity has gone into CSV parsing because of its ubiquity. Despite the lack of any sort of specification, it's easily the most widely supported data format in existence in terms of tools and language support.




> An incredible amount of effort and ingenuity has gone into CSV parsing because of its ubiquity.

Yea and it's still a partially-parseable shit show with guessed values. But we can and could have and should have done better by simply defining a format to use.


Meanwhile, Excel exports to CSV as “semicolon separated values” depending on your OS locale


Albeit for fairly justifiable reasons


Justifiable how?


Well, Excel has a lot of common use-cases around processing numeric (and particularly financial) data. Since some locales use commas as decimal separators, using a character that's frequently present as a piece of data as a delimiter is a bit silly; it would be hard to think of a _worse_ character to use.

So, that means that Excel in those locales uses semicolons as separators rather than the more-frequently-used-in-data commas. Probably not the decision I'd make in retrospect, but not completely stupid.


Decimal separators being commas in some locales?


They could have just ignored the locale altogether though. Put dots on the numbers when using csv, and assume it has dots when importing


This exactly. Numbers in XLS(X) are (hopefully) not locale-specific – why should they be in CSV?


CSV -> text/csv

Microsoft Excel -> application/vnd.ms-excel

CSV is a text format, xls[x], json, and (mostly) xml are not.


Commas are commonly used in text, too.


Clearly they should have gone with BEL as the delimiter.

  printf "alice\007london\007uk\nbob\007paris\007france\n" > data.bsv
I'm hoping no reasonable person would ever use BEL as punctuation or decimal separator.


If one was going to use a non-printable character as a delimiter, why wouldn't they use the literal record separator "\030"?


Every time you cat a BSV file, your terminal beeps like it's throwing a tantrum. A record separator (RS) based file would be missing this feature! In other words, my previous comment was just a joke! :)

By the way, RS is decimal 30 (not octal '\030'). In octal, RS is '\036'. For example:

  $ printf '\036' | xxd -p
  1e
  $ printf '\x1e' | xxd -p
  1e
See also https://en.cppreference.com/w/cpp/language/ascii for confirmation.


On the off chance you're not being facetious, why not ASCII 0 as a delimiter? (This is a rhetorical question.)


ASCII has characters more or less designed for this

0x1C - File Separator

0x1D - Group Separator

0x1E - Record Separator

0x1F - Unit Separator

So I guess 1F would be the "comma" and 1E would be the "newline."


https://stackoverflow.com/questions/8695118/what-are-the-fil...

I am pretty sure you shifted the meaning, the decimal separator is part of the atomic data it does not need a control character.

You would use 1F instead of the comma/semicolon/tab and 1E to split lines (record means line just like in SQL).

You could then use 1D to store multiple CSV tables in a single file.


Yes but then the text is not human readable or editable in a plain text editor.

This would confuse most users of csvs they are not programmers they at most use text editors and Excel.


I am not proposing to do this, but if you were to use ascii separators you would do it this way


There are some decent arguments for BEL over NUL, however given you posed that as a rhetorical question I feel I can say little other than

ding! ding! ding! winner winner, chicken dinner!

Although BEL would drive me up the wall if I broke out any of my old TTY hardware.


...and excel macros


Sure, let's put quotation marks around all number values.

Oh wait.

lol


Can you point me to a language with any significant number of users that does NOT have a JSON library?

I went looking at some of the more niche languages like Prolog, COBOL, RPG, APL, Eiffel, Maple, MATLAB, tcl, and a few others. All of these and more had JSON libraries (most had one baked into the standard library).

The exceptions I found (though I didn't look too far) were: Bash (use jq with it), J (an APL variant), Scratch (not exposed to users, but scratch code itself is encoded in JSON), and Forth (I could find implementations, but it's very hard to pin down forth dialects).


I made no claim about JSON libraries. I contested the claim that "CSV libraries and tools suck". They do not.


CSV tooling has had to invest enormous amounts of effort to make a fragile, under-specified format half-useful. I would call it ubiquitous, I would call the tooling that we’ve built around it “impressive” but I would by no means call any of it “good”.

I do not miss dealing with csv files in the slightest.


> CSV tooling has had [...] to make a fragile, under-specified format half-useful

You get this backwards. Tabular structured data to store are ubiquitous. Text as a file format is also ubiquitous because it is accessible. The only actual decisions are about whether to encode your variables as rows or columns, what is the delimiter, and other rules such as escaping etc. Vars as columns makes sense because it makes appending easier. There is a bunch of stuff that can be used for delimeters, commas being the most common, none is perfect. But from this point onwards, decisions do not really matter, and "CSV" basically covers everything from now on. "CSV" is basically what comes naturally when you have tabular datasets and want to store them in text. CSV tooling is developed because there is a need for this way of formatting data. Whether CSV is "good" or "ugly" or whatever is irrelevant, handling data is complicated as much as the world itself is. The alternatives are either not structuring/storing the data in a tabular manner, or non-text (eg binary) formats. These alternative exist and are useful in their own right, but don't solve the same problems.


I think the issue is that CSV parsing is really easy to screw up. You mentioned delimiter choice and escaping, and I’d add header presence/absence to that list.

There are at least 3 knobs to turn every time you want to parse a CSV file. There’s reasonably good tooling around this (for example, Python’s CSV module has 8 parser parameters that let you select stuff), but the fact that you have to worry about these details is itself a problem.

You said “handling data is complicated as much as the world itself is”, and I 100% agree. But the really hard part is understanding what the data means, what it describes. Every second spent on figuring out which CSV parsing option I have to change could be better spent actually thinking about the data.


I am kind of amazed how people nag about having to parse practically a random file.

Having header or not should be specified up front and one should not parse some unknown file because that will always end up with failure.

If you have your own serialization and your own parsing working yeah this will simply work.

But then not pushing back to the user some errors and trying to deal with everything is going to be frustrating because amount of edge cases is almost infinite.

Handling random data is hard, saying it is a CSV and trying to support everything that comes with it is hard.


Microsoft Windows has had to invest enormous amounts...

Apple macOS has had to invest enormous amounts...

Pick your distro of Linux has had to invest enormous amounts...

None of them a perfect and any number of valid complaints can be said about any of them. None of the complaints make any of the things useless. Everyone has workarounds.

Hell, JSON has had to invest enormous amounts of effort...


I guess the point is that I can take a generic json parser and point it at just about any JSON I get my hands on, and have close to no issues parsing it.

Want to do the same with csv? Good luck. Delimiter? Configurable. Encoding? Configurable. Misplaced comma? No parse in JSON, in csv: might still parse, but is now semantically incorrect and you possibly won’t know until it’s too late, depending on your parser. The list goes on.


Here is a quick test

The table of contents points to a single Json object that is 20ish gb compressed

https://www.anthem.com/machine-readable-file/search/

All stock libs will fail


You claimed that CSV is "easily the most widely supported data format in existence in terms of tools and language support", which is a claim that CSV is better supported than JSON, which is a claim that JSON support is lacking.


Can you import .jsonl files into Google sheets or excel natively?


Importing csvs in excel can be a huge pain due to how excel handles localisation. It can basically alter your data if you are not mindful about that, and I have seen it happening too many times.


Excel dropping leading zeros (as in ZIP codes) was a crazy design decision that has certainly cost many lifetimes of person-hours.


And forcing 16+ digits to be floats, destroying information.


Yeah have had similar struggles with social security numbers.


For example:

Scientists rename genes because Microsoft Excel reads them as dates (2020)

https://www.reddit.com/r/programming/comments/i57czq/scienti...


I was so glad of that story. It gave me something to point to to get my boss off my back.


But it handles it better than Json.


Depends on what you mean by "better". I would rather software not handle a piece of data at all, than handle it erroneously and changing the data without me realising and thus causing all sorts of issues after.


In practice, web browsers accept the tag soup that is sometimes called html and strict xml-based formats failed.


The browser are not a database (unlike excel). Modifying data before showing it is reversible, modifying it before storing it is not.


Excel.

Before you dismiss it as 'not a language, people have argued that it is. And you can definitely program stuff in it, and so that surely makes it a language


Excel can import and parse JSON, it's under the "Get Data" header. It doesn't have a direct GUI way to export to JSON, but it takes just a few lines in Office Scripts. You can even use embedded TypeScript to call JSON.stringify.


I’ve found that the number of parsers that don’t handle multiline records is pretty high though.


It's widely, but inconsistently, supported. The behavior of importers varies a lot, which is generally not the case for JSON.


> Despite the lack of any sort of specification

People keep saying this but RFC 4180 exists.


> it's easily the most widely supported data format in existence in terms of tools and language support.

Even better, the majority of the time I write/read CSV these days I don't need to use a library or tools at all. It'd be overkill. CSV libraries are best saved for when you're dealing with random CSV files (especially from multiple sources) since the library will handle the minor differences/issues that can pop up in the wild.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: