Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been pretty impressed with parquet lately. One thing I've missed is a way to group tables. Is there a standard for that? While parquet is generally column oriented it has support for metadata about tables of multiple columns. However, I'm not aware of any format that groups the tables, short of just zipping a bunch of files.

For context, this would be for an application that passes sqlite files around. So naturally it has good support for the database level of storage. But parquet is so fast for some applications as well as so compressed.




Is Spark what you're looking for? You can do all sorts of joins, groupings, and aggregations with parquet(s) acting as your source(s)


You want to search for “DataFrame” libraries.

Another commenter mentioned Spark, Panda’s is another popular one, not used it but think it’s lighter weight where Spark is more for large distributed computation even though can run locally.

There’s a bunch of these tools which lets you treat parquet files as tables doing joins, aggregations etc.


Arrow is really the future here


Isn’t Apache Arrow an in memory format that the various DataFrame libraries can standardise on to interact with each other? inter-process communication (IPC)?

My understanding is your raw data on disk is still a format such as Parquet, but when you load that Parquet in to your application it’s stored as Arrow in-memory for processing?


Arrow also has its own on-disk format called Feather - https://arrow.apache.org/docs/python/feather.html




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: