I've been pretty impressed with parquet lately. One thing I've missed is a way t...

101011 · on May 4, 2022

Is Spark what you're looking for? You can do all sorts of joins, groupings, and aggregations with parquet(s) acting as your source(s)

BFLpL0QNek · on May 4, 2022

You want to search for “DataFrame” libraries.

Another commenter mentioned Spark, Panda’s is another popular one, not used it but think it’s lighter weight where Spark is more for large distributed computation even though can run locally.

There’s a bunch of these tools which lets you treat parquet files as tables doing joins, aggregations etc.

mountainriver · on May 4, 2022

Arrow is really the future here

BFLpL0QNek · on May 4, 2022

Isn’t Apache Arrow an in memory format that the various DataFrame libraries can standardise on to interact with each other? inter-process communication (IPC)?

My understanding is your raw data on disk is still a format such as Parquet, but when you load that Parquet in to your application it’s stored as Arrow in-memory for processing?

hiyer · on May 4, 2022

Arrow also has its own on-disk format called Feather - https://arrow.apache.org/docs/python/feather.html