HDF5 is often used in scientific computing for this. https://en.wikipedia.org/wi...

out_of_protocol · on May 3, 2022

You can just use sqlite then. Very compact, highly popular (in different role though). Seen it used for large datasets - map tiles (millions of jpeg files). Much smaller size than zip or tar archive, indexed, fast.

P.S.

  sqlite> .mode csv
  sqlite> .import city.csv cities

d3mcfadden · on May 4, 2022

Cool, I didn't know about .mode and .import. Super handy tip.

mastax · on May 4, 2022

You can also use

    .mode csv
    .headers on
    .output file.csv

Then run a query and it gets output to file.csv.

sundarurfriend · on May 3, 2022

> This results in a truly hierarchical, filesystem-like data format. In fact, resources in an HDF5 file can be accessed using the POSIX-like syntax /path/to/resource.

That seems a whole higher level of complexity compared to CSV or the other options listed in TFA (perhaps comparable to Excel).

brandmeyer · on May 3, 2022

NetCDF4 (built on top of HDF5 largely through sub-setting) is considerably more powerful than excel/libreoffice. Its also easy to access through widely-available libraries. I frequently use the Python `netCDF4` (yes, it really is capitalized that way) library for exploratory work.

acomjean · on May 4, 2022

Single Cell RNA Seq data is often stored in Loom which is a HDF5 format.

https://linnarssonlab.org/loompy/

Its a little weird at first but its a great format and has libraries in a lot of major languages. It stores a sparse matrix which cuts the size down a lot.

https://linnarssonlab.org/loompy/format/index.html

MrPowers · on May 4, 2022

HDF5 has some limitations that make it suboptimal for cloud based storage systems.

Zarr overcomes these limitations for array data and Parquet overcomes these limitations for tabular data.

noobermin · on May 4, 2022

The OP wants a text based format, he doesn't care about what is optimal.