> Most tabular data currently gets exchanged as: CSV, Tab separated, XML, JSON o...

dossy · on May 3, 2022

> That's CSV/TSV's real shortcoming: about the only generic validation they allow is to make sure the column count is the same for all rows.

Once upon a time, when I was doing a lot of data interchange between a wide variety of systems (OS'es, applications, etc.) I considered proposing an "enhanced CSV" (ECSV) where the values did not start on the second row in the file, but instead the second row would be regular expressions that could be used to validate the contents of the columns that followed, and data would start on row 3.

In other words, you might have:

``` ID,NAME,DATE

   "/^\d+$/","//","/^\d{4}-\d{2}-\d{2}$/"

   867,Alice,1984-01-09

   5309,Bob,1981-11-16

```

(Newlines added because HN doesn't speak Markdown, sigh.)

In the end, I think the solution was far simpler: we just exchanged a separate descriptor file that had column names and their corresponding regexp patterns for validation as a separate (versioned) file, in order to save a few bytes inside each file transmission, which was a real savings when you paid per-byte over an EDI network.

LordDragonfang · on May 3, 2022

HN actually does support markdown code blocks, you just have to preface lines with four (nope, two)[1] spaces instead of the (newer) triple-backtick codefences.

  ID,NAME,DATE
  "/^\d+$/","//","/^\d{4}-\d{2}-\d{2}$/"
  867,Alice,1984-01-09
  5309,Bob,1981-11-16

[1] apparently it's actually two spaces instead of the normal markdown standard of four, making HN even more non-standard than usual

https://news.ycombinator.com/formatdoc

dossy · on May 5, 2022

Yup, I noticed that and my Level of Care was insufficient to go and edit my comment a second time to fix it. But, thanks for confirming that it could have worked!

syntheweave · on May 3, 2022

I spent a while making a binary format for tabularish documents, and even started on an editor for it. What I decided on after some long months of gradual iteration was to give each cell its own header that could contain various forms of type info, flags, and modes, and to define a cell type that described forms of break (space, line, page, etc. - a 16-bit range of break types could be encoded). The document header also described a dictionary mapping for the data so that it could immediately be presented to the editor in a readable form.

But now I just use plain old spreadsheets to do things - I obsoleted my own tech, although I like certain things about it. The editing and storage encoding isn't really the problem so much as the integrity of the literals, which a solution like the regex idea could accommodate.

I do think that CSV would benefit by having a header area that described the encoding of breaks in cells and lines. Maybe that's the only thing that really needs fixing in it. And if it included arbitrary break levels like my thing and dropped the rectangular row-column shape, it would cover a huge number of documents.

zaidw · on May 3, 2022

We can blame CSV, or we can blame the way people use CSV. Either way CSV is so unreliable that I try to “fail-fast” as soon as possible in automated pipeline.

At work, we explicitly define data structuring process, converting CSV to Parquet with strict schema and technical/structural validation. We assign interns and new grad engineers for this, which is nicely within their capabilities too with minimal training.

sundarurfriend · on May 3, 2022

>> CSV is a mess. One quote in the wrong place and the file is invalid.

> That breaks the other formats too, why pick on CSV?

I think it's perhaps badly worded, but the implied (and more important) criticism seems to me to be that CSV makes this kind of error much more likely, with its handling of quotes. Having worked with CSV files that had commas in the data (and sometimes quotes too), I quickly learned that I should `set readonly` on my editor and only interact with the file through programmatic tools, and give up any notion of it being a plaintext hand-editable data format.

Someone · on May 4, 2022

Indeed, it’s because people use an editor that isn’t designed for editing csv to edit csv.

Every csv files is a text file, but not every text file is a csv file, but people use text editors to edit them.

A csv editor would forbid you from entering that quote, automatically add a matching one, ask you for your intentions, or whatever, but it shouldn’t silently corrupt your file.

A spreadsheet-like UI, but without formulas, automatic single header row, etc. would be a better UI for csv files.

If you use zip to edit a .xlsx or .docx file, you easily get an invalid file, too.

cylinder714 · on May 12, 2022

I think you just described VisiData:

VisiData is an interactive multitool for tabular data. It combines the clarity of a spreadsheet, the efficiency of the terminal, and the power of Python, into a lightweight utility which can handle millions of rows with ease.

https://www.visidata.org/

hermitcrab · on May 3, 2022

if you add an extra comma in a CSV (outside of quoting) then the rest of the cells in that row are off by 1. Which is not good obviously. But if you add an extra quote, then the entire rest of the file is garbage.