> It’s that libraries and tools to parse CSV tend to suck. Whereas JSON is the lingua franca of data.
This isn't the case. An incredible amount of effort and ingenuity has gone into CSV parsing because of its ubiquity. Despite the lack of any sort of specification, it's easily the most widely supported data format in existence in terms of tools and language support.
> An incredible amount of effort and ingenuity has gone into CSV parsing because of its ubiquity.
Yea and it's still a partially-parseable shit show with guessed values. But we can and could have and should have done better by simply defining a format to use.
Well, Excel has a lot of common use-cases around processing numeric (and particularly financial) data. Since some locales use commas as decimal separators, using a character that's frequently present as a piece of data as a delimiter is a bit silly; it would be hard to think of a _worse_ character to use.
So, that means that Excel in those locales uses semicolons as separators rather than the more-frequently-used-in-data commas. Probably not the decision I'd make in retrospect, but not completely stupid.
Every time you cat a BSV file, your terminal beeps like it's throwing a tantrum. A record separator (RS) based file would be missing this feature! In other words, my previous comment was just a joke! :)
By the way, RS is decimal 30 (not octal '\030'). In octal, RS is '\036'. For example:
$ printf '\036' | xxd -p
1e
$ printf '\x1e' | xxd -p
1e
Can you point me to a language with any significant number of users that does NOT have a JSON library?
I went looking at some of the more niche languages like Prolog, COBOL, RPG, APL, Eiffel, Maple, MATLAB, tcl, and a few others. All of these and more had JSON libraries (most had one baked into the standard library).
The exceptions I found (though I didn't look too far) were: Bash (use jq with it), J (an APL variant), Scratch (not exposed to users, but scratch code itself is encoded in JSON), and Forth (I could find implementations, but it's very hard to pin down forth dialects).
CSV tooling has had to invest enormous amounts of effort to make a fragile, under-specified format half-useful. I would call it ubiquitous, I would call the tooling that we’ve built around it “impressive” but I would by no means call any of it “good”.
I do not miss dealing with csv files in the slightest.
> CSV tooling has had [...] to make a fragile, under-specified format half-useful
You get this backwards. Tabular structured data to store are ubiquitous. Text as a file format is also ubiquitous because it is accessible. The only actual decisions are about whether to encode your variables as rows or columns, what is the delimiter, and other rules such as escaping etc. Vars as columns makes sense because it makes appending easier. There is a bunch of stuff that can be used for delimeters, commas being the most common, none is perfect. But from this point onwards, decisions do not really matter, and "CSV" basically covers everything from now on. "CSV" is basically what comes naturally when you have tabular datasets and want to store them in text. CSV tooling is developed because there is a need for this way of formatting data. Whether CSV is "good" or "ugly" or whatever is irrelevant, handling data is complicated as much as the world itself is. The alternatives are either not structuring/storing the data in a tabular manner, or non-text (eg binary) formats. These alternative exist and are useful in their own right, but don't solve the same problems.
I think the issue is that CSV parsing is really easy to screw up. You mentioned delimiter choice and escaping, and I’d add header presence/absence to that list.
There are at least 3 knobs to turn every time you want to parse a CSV file. There’s reasonably good tooling around this (for example, Python’s CSV module has 8 parser parameters that let you select stuff), but the fact that you have to worry about these details is itself a problem.
You said “handling data is complicated as much as the world itself is”, and I 100% agree. But the really hard part is understanding what the data means, what it describes. Every second spent on figuring out which CSV parsing option I have to change could be better spent actually thinking about the data.
I am kind of amazed how people nag about having to parse practically a random file.
Having header or not should be specified up front and one should not parse some unknown file because that will always end up with failure.
If you have your own serialization and your own parsing working yeah this will simply work.
But then not pushing back to the user some errors and trying to deal with everything is going to be frustrating because amount of edge cases is almost infinite.
Handling random data is hard, saying it is a CSV and trying to support everything that comes with it is hard.
Microsoft Windows has had to invest enormous amounts...
Apple macOS has had to invest enormous amounts...
Pick your distro of Linux has had to invest enormous amounts...
None of them a perfect and any number of valid complaints can be said about any of them. None of the complaints make any of the things useless. Everyone has workarounds.
Hell, JSON has had to invest enormous amounts of effort...
I guess the point is that I can take a generic json parser and point it at just about any JSON I get my hands on, and have close to no issues parsing it.
Want to do the same with csv? Good luck. Delimiter? Configurable. Encoding? Configurable. Misplaced comma? No parse in JSON, in csv: might still parse, but is now semantically incorrect and you possibly won’t know until it’s too late, depending on your parser. The list goes on.
You claimed that CSV is "easily the most widely supported data format in existence in terms of tools and language support", which is a claim that CSV is better supported than JSON, which is a claim that JSON support is lacking.
Importing csvs in excel can be a huge pain due to how excel handles localisation. It can basically alter your data if you are not mindful about that, and I have seen it happening too many times.
Depends on what you mean by "better". I would rather software not handle a piece of data at all, than handle it erroneously and changing the data without me realising and thus causing all sorts of issues after.
Before you dismiss it as 'not a language, people have argued that it is. And you can definitely program stuff in it, and so that surely makes it a language
Excel can import and parse JSON, it's under the "Get Data" header. It doesn't have a direct GUI way to export to JSON, but it takes just a few lines in Office Scripts. You can even use embedded TypeScript to call JSON.stringify.
> it's easily the most widely supported data format in existence in terms of tools and language support.
Even better, the majority of the time I write/read CSV these days I don't need to use a library or tools at all. It'd be overkill. CSV libraries are best saved for when you're dealing with random CSV files (especially from multiple sources) since the library will handle the minor differences/issues that can pop up in the wild.
This isn't the case. An incredible amount of effort and ingenuity has gone into CSV parsing because of its ubiquity. Despite the lack of any sort of specification, it's easily the most widely supported data format in existence in terms of tools and language support.