Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There is a decent file format for tabular data, and the author dismisses it: parquet.

It's compact, encodes all the common data types well, does int/float distinction (thanks for teaching us about how important that is json), stores null records with a mask instead of a special value, row major order, has compression, speedy random access... it has it all. And it isn't bogged down with legacy cruft (yet).

Since you need to look at tabular data outside of a text editor anyway[0], I don't see a ton of benefit to making it a plaintext format. Especially not with the author's suggestion of un-typeable ascii delimiters. If I can't type it on my keyboard, I may as well be looking at a hex dump of a binary file because I can't really edit it.

[0] Who among us hasn't experienced the joy of a pull request updating a checked in csv file? A mess of ,,,,,,,"Birmingham",,,AL, etc.




parquet is great but it's not particularly easy to read or write. the libraries that do exist to work with it are few and far between, and those that do either have a hundred dependencies or depend on native code (e.g. libarrow). certainly an important dimension in an ideal file format should be the ease of parsing/writing it, and parquet gets an extremely low score on that front imo


Parquet is also column-major which is great for many use cases, but bad for others, where row-major is more useful. For example, if you want to get just the first x rows.


Then you want avro


Sure, but any new format is going to have the same problems. I think you're right that implementation complexity needs to be considered, but it's not like Word or Excel files or something where you need to replicate bug for bug a format accreted over decades.

Parquet isn't trivial to parse / write but that's probably good imo. CSV is really easy to write, and... that just means everybody does it slightly differently. Being somewhat difficult to interact with encourages people to use a library to centralize a bit, but it's not so complex that someone motivated couldn't write a new implementation in a reasonable amount of time.


Unsure if it was added after your comment, but there is a brief comment on parquet (in that it's binary - so it can be annoying to view/edit).


Yeah, he mentioned it quickly, but kind of dismisses it for bad reasons imo.


Editors such as Notepad++ make it fairly easy to insert US and RS ASCII codes. But not quite as simple as typing a comma or return, obviously.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: