Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

HDF5 is often used in scientific computing for this.

https://en.wikipedia.org/wiki/Hierarchical_Data_Format



You can just use sqlite then. Very compact, highly popular (in different role though). Seen it used for large datasets - map tiles (millions of jpeg files). Much smaller size than zip or tar archive, indexed, fast.

P.S.

  sqlite> .mode csv
  sqlite> .import city.csv cities


Cool, I didn't know about .mode and .import. Super handy tip.


You can also use

    .mode csv
    .headers on
    .output file.csv
Then run a query and it gets output to file.csv.


> This results in a truly hierarchical, filesystem-like data format. In fact, resources in an HDF5 file can be accessed using the POSIX-like syntax /path/to/resource.

That seems a whole higher level of complexity compared to CSV or the other options listed in TFA (perhaps comparable to Excel).


NetCDF4 (built on top of HDF5 largely through sub-setting) is considerably more powerful than excel/libreoffice. Its also easy to access through widely-available libraries. I frequently use the Python `netCDF4` (yes, it really is capitalized that way) library for exploratory work.


Single Cell RNA Seq data is often stored in Loom which is a HDF5 format.

https://linnarssonlab.org/loompy/

Its a little weird at first but its a great format and has libraries in a lot of major languages. It stores a sparse matrix which cuts the size down a lot.

https://linnarssonlab.org/loompy/format/index.html


HDF5 has some limitations that make it suboptimal for cloud based storage systems.

Zarr overcomes these limitations for array data and Parquet overcomes these limitations for tabular data.


The OP wants a text based format, he doesn't care about what is optimal.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: