There is a decent file format for tabular data, and the author dismisses it: par...

emef · on May 3, 2022

parquet is great but it's not particularly easy to read or write. the libraries that do exist to work with it are few and far between, and those that do either have a hundred dependencies or depend on native code (e.g. libarrow). certainly an important dimension in an ideal file format should be the ease of parsing/writing it, and parquet gets an extremely low score on that front imo

thayne · on May 4, 2022

Parquet is also column-major which is great for many use cases, but bad for others, where row-major is more useful. For example, if you want to get just the first x rows.

adammarples · on May 4, 2022

Then you want avro

habitue · on May 4, 2022

Sure, but any new format is going to have the same problems. I think you're right that implementation complexity needs to be considered, but it's not like Word or Excel files or something where you need to replicate bug for bug a format accreted over decades.

Parquet isn't trivial to parse / write but that's probably good imo. CSV is really easy to write, and... that just means everybody does it slightly differently. Being somewhat difficult to interact with encourages people to use a library to centralize a bit, but it's not so complex that someone motivated couldn't write a new implementation in a reasonable amount of time.

noxvilleza · on May 4, 2022

Unsure if it was added after your comment, but there is a brief comment on parquet (in that it's binary - so it can be annoying to view/edit).

habitue · on May 4, 2022

Yeah, he mentioned it quickly, but kind of dismisses it for bad reasons imo.

hermitcrab · on May 3, 2022

Editors such as Notepad++ make it fairly easy to insert US and RS ASCII codes. But not quite as simple as typing a comma or return, obviously.