I've been pretty impressed with parquet lately. One thing I've missed is a way to group tables. Is there a standard for that? While parquet is generally column oriented it has support for metadata about tables of multiple columns. However, I'm not aware of any format that groups the tables, short of just zipping a bunch of files.
For context, this would be for an application that passes sqlite files around. So naturally it has good support for the database level of storage. But parquet is so fast for some applications as well as so compressed.
Another commenter mentioned Spark, Panda’s is another popular one, not used it but think it’s lighter weight where Spark is more for large distributed computation even though can run locally.
There’s a bunch of these tools which lets you treat parquet files as tables doing joins, aggregations etc.
Isn’t Apache Arrow an in memory format that the various DataFrame libraries can standardise on to interact with each other? inter-process communication (IPC)?
My understanding is your raw data on disk is still a format such as Parquet, but when you load that Parquet in to your application it’s stored as Arrow in-memory for processing?
For context, this would be for an application that passes sqlite files around. So naturally it has good support for the database level of storage. But parquet is so fast for some applications as well as so compressed.