Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Advanced Scientific Data Format (github.com/asdf-format)
117 points by anigbrowl on Oct 5, 2022 | hide | past | favorite | 121 comments


What timing! I'm just experimenting with some bioinformatics codes, and wow... the formats are terrible. They're mostly sequential text files, some tens of gigabytes in size.

Simply uncompressing a file can be a significant bottleneck, as this can inherently use only 1 CPU core.

Keep in mind that these files are intended to be processed on monstrously huge 128 core and 2TB memory machines! What happens is that the system uses 0.5% of its capacity to manipulate the files while the other 99.5% is heating the data center air.

I'm looking at the CPU usage metric graph now, and the machine is spending half the time at 100% load and half the time at 1% load. That's half the total capacity wasted.

If you find yourself in 2022 or later designing a file format intended for bulk data and you use any of the words "stream", "serialization", or "text", stop. Rethink what you have done, and this time consider that normal machines you can buy for normal money have 256 hardware threads. Soon, 512, and then 1024 in just a few years!

Everything at this scale should be split into blocks for parallel processing. Shoving text files into a .tar.gz archive is not acceptable any more. I don't care what 1970s "standard" it adheres to, that just doesn't matter any more, the hardware has moved on.

I think it's high time that the industry standardised on a generic "container" format to replace legacy archive file formats. Something akin to a random-access archive file like zip, but designed so that even a single file can be decoded in parallel, even if compressed.

As an example of "parallel thinking", CRC or SHA style checksums are automatically a no-go. They're inherently sequential. Instead, Merkle trees would have to be used.

Compression efficiency of small files with random access could be improved by using something like zstd as the internal compression algorithm, but with a shared dictionary stored separately. This would retain the advantages of 'tar' without requiring sequential decoding of files.

Etc...


Bioinformatics file formats are broken, but it's better to try to understand why they are broken before rushing to fix them.

One key problem is that technology changes quickly. There are always new instruments generating new kinds of data with new properties and new features. People are using that data in new applications.

Software comes many years behind the state of the art. First you need to figure out what is the exact problem the software is supposed to solve. Then you have to solve the problem and turn the prototype into a useful tool. This work is mostly done by researchers who may know a little about software engineering. By the time the situation is stable enough that software engineers who are not active researchers in the field could be useful, it's often too late to change the file formats. There is already too much legacy data and too many tools supporting the established formats.

Another key problem is that the "broken" file formats are often good enough. When you have tabular data where the fields can be reasonably understood as text, a simple TSV-based format often gets the job done. Especially if individual datasets are only tens of gigabytes. By using a custom format, you avoid having to choose from many existing formats that all have their own issues. And that often guarantee you version conflicts and breaking changes in the future.

Also, when it comes to parallelization, it's hard to beat running many independent jobs in parallel. While computers are getting bigger, individual problems are often not, as the underlying biological problems remain the same. In the work I do, a reasonable target system had 32 cores and 256 GB memory in 2015. That's still a reasonable target in 2022. The computers I use have become cheaper and faster, but they have not really changed.


So what prevented the use HDF or netcdf in bioinformatics? I mean these are not new format by any definition of new and I disagree the broken formats are often not good enough, it seems more that often the bio/med fields (probably/hopefully not bioinformatics) are still used to process data using excel (speaking from my limited interaction with mainly med researchers).


Relative obscurity, most likely. If the formats are not used in bioinformatics, people developing bioinformatics tools are usually not familiar with them. And if developers are not familiar with the formats, they can't make informed decisions about using them. Yet another consequence of researcher-driven software development.


We presented using Parquet formats for bioinformatics 2012/13-ish at the Bioinformatics Open Source Conference (BOSC) and got laughed out of the place.

While using Apache Spark for bioinformatics [0] never really took off, I still think Parquet formats for bioinformatics [1] is a good idea, especially with DuckDB, Apache Arrow, etc. supporting Parquet out of the box.

0 - https://github.com/bigdatagenomics/adam

1 - https://github.com/bigdatagenomics/bdg-formats


Maybe column-oriented formats like Parquet never became popular in bioinformatics because new file formats usually come from people developing tools for upstream tasks such as read mapping, variant calling, and genome assembly. They are the ones who work with new kinds of data first.

Those upstream tasks tend to be row-oriented. You often iterate over all rows, do something with them, and output new rows in another format. Alternatively, you read the entire input into in-memory data structures, do something, and later serialize the data structures. Using column-oriented formats for such tasks does not feel natural.


HDF has been around in some capacity in bioinformatics for at least 20 years. For instance the Bioconductor project had a package for it in the early aughts. But as to why it never took off, I suspect the reasons you cite are correct.


>One key problem is that technology changes quickly.

Each of your plateaus of stability often need to become recognized before the next step can be taken.

With scientific software at the end of the train, the file-type/file-system needs to be well established and more stable than any software could be, and for a lot longer than the whole scientific project itself takes.

A scientific filetype needs to be well-documented (better than ordinary software) and unchanged for long enough so that all agree no further changes are intended. It needs to have already been virtually perfectly stable, for more years than most research projects are likely to have their data remain useful in the future.

netcdf is in this category while still being extensible and it is very old (well established) if not well known. Public domain government "codec" which basically decompresses netcdf to structured text, and in reverse.

Now this new ASDF filetype looks like it does have useful features of its own except one thing:

>ASDF is under active development

Which can still be a drawback in this situation.


HDF is used in single cell genomics


> When you have tabular data where the fields can be reasonably understood as text, a simple TSV-based format often gets the job done. Especially if individual datasets are only tens of gigabytes.

In what other industry is less than 0.5% utilisation accepted as "gets the job done?"

TSV is a terrible format for multi-gigabyte files, because it uses line breaks.

Technically it's possible to parse them in parallel, but if the format enables quoted strings with newlines, then it isn't possible to do this safely.

A simple binary row-based or columnar format could be read orders of magnitude faster. No character-by-character processing needed, just "memcpy" and go.


What I think jltsiren is trying to say is that if you're trying to parallelize within one file, you've already lost. Or your job is tiny, but then who cares.

This matches my experience a decade ago as a rare scientist who could program. When I had to run a job on the giant cluster against our ~100TB dataset, I did not put hundreds of threads to work against one file. We had everything broken up into hundreds of files, so I could run one thread against each file (with a tiny bit of boundary patch-up), and it all just flew. Trying to get it "all in one" would have been an exercise in misery and pain.

This can also be done by simply running many jobs at once. Many (but not all! never all!) scientific analyses naturally work well with that model, so don't fight it on technical purity grounds.

This was all done rather straightforwardly with the horrid piece of radioactive software garbage that is (was? please say was? I dare not check) CERN's ROOT. It had few redeeming characteristics... except for being fast and efficient on massive datasets, once you got it running at all. And that counts for something!


> What I think jltsiren is trying to say is that if you're trying to parallelize within one file, you've already lost.

I disagree, you don't need to split up a file to parallelize things if you just use a moderately recent format.

Put it in parquet, have sensible row groups, and turn on zstd compression. Split it into multiple files if you want but fast access to subsets of files which are neatly compressed is very easy to get now.

You also get the ability to store things losslessly, which you don't get so much without custom work with TSV (main example here is floats) and things like schemas.

I've just tested out moving from one of my zstd parquet test files that's 190M, and it's 4.8G as a CSV file for a single datapoint.


What we're trying to say is that that approach does not work so well if your individual files are 1-2 orders of magnitude larger and your whole analysis dataset (which must be processed in its entirety to answer any question!) is 4-5 orders of magnitude larger.

Yes, not everyone has such large data sets. (And not everyone has such small data sets!) But it is imperative that scientific computing infrastructure developers understand what classes of experiments they are serving, and what classes they can not or should not serve.


I'm not sure I follow. Splitting the file is fine, and very common, but you can absolutely get parallelisation within a file now using common standards. Even your single threaded code can be faster.


The glory of ROOT prevails!

But yes, it is better now than back in the 4.X days or whatever ungodly version it sounds like you used.


Currently working for a bioinformatics data processing company. I've seen, heard and disproved this "many files is better than one big file" argument before... But there is a trick to it...

---

Starting with an example parquet file first, something similar to bioinformatics data i've been working on recently. This is the "fully read into memory as a table / data frame" view of the file

    | File ID        | spectrum attributes                                    | groupings       | numbers and stuff | more numbers as a list | File as raw string |
    | -------------- | ------------------------------------------------------ | --------------- | ----------------- | ---------------------- | ------------------ |
    | 2984704        | ["other_thing", "another_thing", "different_things"]   | group       1   | 329021854.0935902 | [0, 2344, 22, 74, 745] | iw c3lyultrc3l.... |
    | 2984705        | ["other_thing", "another_thing", "more_things"]        | group       2   | 329021854.0934522 | [231, 09, 123, 15, 5]  | sdalkjfh2cn232.... |
    | 2984707        | ["other_thing", "other_thing2", "some_thing", "thing"] | group       2   | 3232518.032532    | [892342, 52, 252, 525] | cnm3247cmo27xm.... |

The "File as raw string" column is the magic one here. It's going to let us do what you did, but without having to manually manage thousands, hundreds of thousands or millions of files.

---

### PROCESSING

It sounds like you distributed your computation (smaller pieces of data, executing on many threads). But it reads like you did it manually (it reads like you wrote the orchestration code yourself, rather than sitting there submitting one file at a time). By bunching everything together in one file you can get the tools to do the work for you (at least with current tech you can, no idea about CERN's ROOT).

For modern business, the data engineering tech stack often uses Apache Spark for the distributed data processing engine / cluster engine [0].

Spark distributes data across the cluster nodes by partitioning your data frames/tables/tabular data. Rows 1-500 are loaded on cluster node 1, rows 501-100 on cluster node 2, etc. Spark then executes processing in threads on each node. Each partitioned row on a node is passed to each available thread for that node and processed. The results are stored in memory on the node and can be accessed later on for "other stuff"^{TM}.

Remember that I stored the raw file contents in the "rows" of the "table"? Now I can just "load" that file as part of the processing in a single thread [1]. Et Voila! I'm doing exactly what you did, many files being passed to many threads, but I haven't had to orchestrate anything myself.

Spark has done it for me! So the PROCESSING element can become easier, from the human operator orchestrating the processing of many files perspective, when you have one big file...

What about LOADING the data?

---

### LOADING

> What I think jltsiren is trying to say is that if you're trying to parallelize within one file, you've already lost

I can store this example data as parquet format because it is tabular. As part of the parquet standard, I can:

- partition the data by column values -- a file per partition based on the "grouping" column

- use row groups -- a file per partition of 10,000 row subsets

The above two mechanisms mean you can parallelise the LOAD of the data as well.

When you have a 100TB data set, this becomes *really* important to get right to minimise network data shuffles -- where data is being transferred around the cluster nodes because Spark needs to repartition the data across the cluster. If the data is already partitioned as a parquet file then Spark can load in 1x partition of parquet data onto 1x cluster node.

In an ideal case, where there is 1x cluster node for every 1x data partition, your full dataset will load onto the cluster in the same time it takes to load 1x parquet partition.

Network transfer times for 1TB vs 100TB are significantly different, and this approach can significantly reduce loading time when needing to execute different variants of the processing code on the same source data.

---

In summary, I get where you are coming from. But there are tools that do a bunch of magic things these days so we don't need to worry about stuff.

From jltsiren's original post

> Also, when it comes to parallelization, it's hard to beat running many independent jobs in parallel.

Embarrassingly parallel computation is what everyone is talking about here. Me, you, everyone. Spark does it very well. Python's multiprocessing library does it quite well.

The problem is that no-one thinks about loading and/or storing the data in a convenient format to do embarrassingly parallel computation ... they just stick it in a CSV/TSV file.

---

> This was all done rather straightforwardly with the horrid piece of radioactive software garbage that is CERN's ROOT

https://root.cern/releases/release-62606/

It is still being updated .... I agree with your sentiment, I would not use this out of choice after a brief glance at the docs.

---

[0]: The same principles here apply to something simpler like python's multiprocessing library, which is what I applied to gain an 8x speed up in processing times (they were running it single threaded before).

[1]: See the comment about pymzml about why it's not usually "just" that simple...


> The problem is that no-one thinks about loading and/or storing the data in a convenient format to do embarrassingly parallel computation ... they just stick it in a CSV/TSV file.

Independent jobs go beyond what is usually understood as embarrassingly parallel. In a typical bioinformatics workflow, you download the data, process it locally for hours, and send the results back. The ratio of computation to data is high enough that data transfers (and reading/writing data) rarely become a bottleneck. Meanwhile, the natural size of problems is both large enough that handling them takes hours and small enough that they can be handled on commodity servers.

In work like that, distributed systems tend to be overengineered solutions that make everything harder and more expensive. They make it harder to find developers who understand both the technology and the business needs – the biological problems in question. And they make it harder to install the solution in a new environment that may be fundamentally different from the one it was developed in. Especially in the fairly typical case where the person trying to install it does not have administrator rights.


> Independent jobs go beyond what is usually understood as embarrassingly parallel.

Independent = embarrassingly parallel, i.e. CPU bound. You want to apply some f on x to get y ==> y = f(x). You have many cases of x, all with f applied independently to get all the independent y outputs. There is no difference between "independent" and "embarrassingly parallel" in this case.

> In a typical bioinformatics workflow, you download the data, process it locally for hours, and send the results back.

And I'm saying that you do not need to be waiting for hours, if you do the data storage and data processing using sane and modern methods (i.e. not with CERN's ROOT tool).

This is not theory BTW -- I've been doing it for the last month. This is a very real, evidence based observation.

---

I was writing out a bunch of other point by point replies here, but I get the sense that it's more likely going to cause you to dig in to your currently held beliefs rather than open you up to the magic methods used by data engineering teams, so I decided to give up and go eat some pizza.

I enjoy pizza.


I think you are missing the context here.

We are talking about the kind of bioinformatics tools that establish new popular file formats. They are typically developed by researchers rather than software engineers. These tools are intended for end users to install and run on a wide variety of systems. And they are often released before the first high-profile papers on the topic are published. At that point, it's rare to find a software engineer who is familiar enough with the topic to be able to contribute.

Waiting a few hours for the results is not a big deal with these tools, because you are probably going to submit a large number of jobs anyway. Getting the full results will likely take days, regardless of whether you are using naive or modern methods. Because you are probably not allowed to use unlimited compute, your primary constraint is throughput rather than latency. Naive file formats survive, because tools that solve independent problems locally provide similar throughput to state-of-the-art distributed systems.

Bioinformatics, and particularly the subfield that focuses on genomics, is noteworthy because it relies more on freely available open source software than most other academic fields. That may be because the state of the art is changing so rapidly.

Companies sometimes take over old well-established problems. By providing services instead of tools, they can use whatever technology they prefer internally. But you rarely see these companies attacking state-of-the-art problems.


Bioinformatics tools are often used in batch processing pipelines, where the expected running time is hours. The ratio of users to developers is usually low enough that trade-offs between developer time and running time remain meaningful.

If you have a 50-gigabyte TSV file, you can memory-map, index, and validate it in a few minutes. Overhead like that is usually negligible. And if you encounter something weird like quoted strings, you can just print an error message and abort. It's your file format, so you don't have to make your life difficult by supporting unnecessary features.

That approach still kind of works with hundreds of gigabytes of data. Once you have terabytes, text files become too cumbersome. (Here I'm unfortunately speaking from experience.)


> TSV is a terrible format for multi-gigabyte files, because it uses line breaks.

Put it into a database if you care about these things. But let's not reinvent the wheel, and let's not bikeshed binary file formats, of all things. It's a non-starter. Plain text is perfectly fine for long-term storage and interop, converting to a more effective representation is just a one-time cost.


A lot of data set sizes are best measured in terabytes and petabytes. Many of these file formats must target that mentality, because small things grow up and get big whether anyone planned for that or not. People who can get away with plaintext for their needs probably aren't even reading this.


The big issue is how many tools would need to be rewritten to accommodate the updated file format. For better or worse there is a lot of inertia to the terrible file formats, it's not as simple as "Hey, change".

As an example, consider clinical use cases. It can be near impossible to update the software in the pipelines for these situations. How do they move on from the old tools? Ok, now suppose they need to interact with data from non-clinical sources?

Things like you're proposing can be done. As others have intimated in this thread, solving the technical issue is the easiest part of the chain. More than one Tech Person has walked into the fray assuming the only thing holding stuff back is that no one ever thought to apply Good Tech Solutions. Instead what's necessary are people who can understand why things are the way they are, and the human dynamics at play. Building a better mouse trap requires taking these into account from the start.


I ran into this problem while building my new data management system. I create relational tables by parsing CSV, TSV, or other 'character separated' files to get the row data and then use separate threads to insert each row. Once I had successfully parsed out a row, I could multi-thread the insertion code but only one parser thread could work with the actual file since embedded line breaks could be enclosed within quotes. It only takes one line to break any kind of parallel processing on the file.


You did say yourself that overall utilization was more like 50 percent.

Interesting discussion though. I write some academic code in which I anticipate file load will be a bottleneck when we scale up one day, so I appreciate your thoughts


It's probably a useful reminder that biologists do all kinds of other things besides loading sequence files. For instance, they generate the sequence files and grow the organisms that are getting sequenced.


That's why we need more cooperation between researchers and actual software engineers and less "I can also code!" thinking of researchers. Yes they can, but please in their free time or publicly, so that actual software engineers can chime in and stop bad design from happening. Well coding in public is of course also no guarantee and neither is hiring a software engineer, but at least the chances are better for something useful to come out of it.


not all coding is software engineering man


yes, we need replace all scientists with software developer.


You do not seem to grasp the meaning of "cooperation". Or were you trying to comment on another comment?


sorry for this comment, I just frustrated.


These files are all about persistent storage for many many decades, not sub second processing.

It's not that hard to pre process prior to feeding into whatever the latest processing architecture of the decade is .. but it's hell week when you're tasked with decoding some decades old non standard "only used for 18 months" "best practice at the time" obscure compression format data.

Field seperated line orientated ASCII data doesn't require extensive out of band notes to understand or decipher years after you've passed on.


> It's not that hard to pre process prior to feeding into whatever the latest processing architecture of the decade is .. but it's hell week when you're tasked with decoding some decades old non standard "only used for 18 months" "best practice at the time" obscure compression format data.

That's quite disingenuous, there are very well documented data formats that have been around for decades (netCDF, HDF) those are not "obscure" formats and are infinitesimally better than text.

> Field seperated line orientated ASCII data doesn't require extensive out of band notes to understand or decipher years after you've passed on.

Field separated ACII data is terrible because it does not contain relevant metadata (except for column names), you quite possibly loose precision (also what even was the precision?), are slow to read, need to be additionally compressed ...


> (netCDF, HDF) those are not "obscure" formats and are infinitesimally better than text.

so, not much better?


> infinitesimally better

Sounds like they better stick with the TSV format then ;)


"bonus" points:

* text/TSV data files made public as a requirement from the published articles may have spaces or dots in column names, missing line ends, non-unique row IDs, etc.

* there is no limit as how many records can be stored in a single file. Latest dbSNP has more than 1000 millions of rows.

* bunch of formats (GFF, GTF, VCF) has several TSV delimited "proper" columns (as: 1 value) and then special column where optional fields are piled in a different format, with another separator, field names etc. Real fun to parse...


I work in proteomics, where the field decided to standardize on XML based file formats. It can take longer to parse the data out of the XML container than it does to actually run the analysis.


Have you tried pigz? It’s an implementation of gzip that enables parallel processing.


For tabular data, parquet with zstd compression is very straight forward and quick. Columnar and split into row groups with stats on the columns within each, easily processed by many tools as standard. You can go further into arrow but parquet alone solves the majority of key issues IMO for tabular data.


Parquet looks good but HDF5 is good just because it's hierarchical and can group together multiple related datasets. Maybe data taken in the same session on completely different axes/coordinates this is what I've missed with parquet.


I had started a little bit of work towards that recently: https://github.com/celtera/uvfs

It's very optimized towards my specific needs but could be a basis for what you mention


> CRC style checksums are automatically a no-go. They're inherently sequential.

AFAIK the linearity of CRCs allows them to be split, parallelized, and combined. This is also the case for polynomial MACs, like GHash or Poly1305.


Yep, the only downside for CRC being that the recombination involves some expensive multiplications that are proportional to the total input length.


For Poly1305 you can use square-and-multiply to compute the exponentiation required for recombination, which has logarithmic cost, not linear. Are you sure a similar algorithm can't be applied to CRCs?


Sorry, I didn't phrase it well. It's also log(n) for CRCs via the same algorithm.


The "bioinformatics formats" might be terrible, but they work. In fact, they are meant to be Excel-readable which keeps my collaborators (and me) happy. Coming from a CS /programming background, it is natural to feel the urge to "fix" the formats (<insert relevant XKCD>), until you realise that there are a libraries that easily handle serialization/parsing.

Besides, "bioinformatic formats" is a meaningless word anyway. FASTQ, VCFs, BCL, AIRR-seq -- all different and it just works.


"Just works" and "I've been waiting 15 minutes now for this file to un-gzip" aren't compatible in my book, especially on a computer that should be able to process that file in seconds.

Also, I'd love to see someone open a 75 GB FASTQ file in Excel.


VCF at least typically uses bgzip which is essentially gzipped sections concatenated, but parallel unzipable for random access, cram is also parallelisable in the same way. Maybe you just dont know the formats and tooling so well? Im not sure anyone opens a fastq directly for viewing anymore, but they will want pile ups from a bam. The problem with bio formats isnt that they're text its that they are shit text formats too.


CRAM is a great example for some of the other people in the thread who say "just get a better format". There's been slow uptake in the larger community despite the benefits. For anyone looking to Solve Bioinformatics File Formats, it's important to understand why this is the case.


Nearly all bioinfo tools operate in streaming mode which means line based gzipped formats work great as you can parallelise the processing with reading the file. Nobody ever unzips the whole file before starting to process it.


FASTQ is not for Excel, obviously - although you can still explore it in the shell. Nonetheless operating directly on FASTA/FASTQ files is often a "one-time" preprocessing task. You then serialize the preprocessed data and continue on from there.

FASTA (and its various incantations) are not going anywhere anytime soon.


Excel-readable is a bad thing. I wonder how much data is ignored or misinterpreted because Excel misinterpreted the input to be something else.

(I actually know biologists who have run into this problem.)


https://pubmed.ncbi.nlm.nih.gov/27552985/ estimates that about one fifth of papers with supplementary Excel lists of genes contain mangled gene names. I remember talking about this problem back in 2003. The HGNC has been quietly going around changing the names of some of these genes to try and stop this from being a problem.


Thanks for the pointer. Indeed, a more recent paper (cited below) estimates an even higher error rate (30.9%), but the fact that we are not talking of 0.001% tells me that excel is simply a non-starter for this kind of work. (Actually, this is just one of many reasons why I discourage my students from using excel for any dataset.)

Abeysooriya, Mandhri, Megan Soria, Mary Sravya Kasu, and Mark Ziemann. “Gene Name Errors: Lessons Not Learned.” PLoS Computational Biology 17, no. 7 (July 30, 2021): e1008984. https://doi.org/10.1371/journal.pcbi.1008984.


#927 is a fun quip, but as an actual critique of engineering practices, I'd actually much rather see folks attempt new standards and innovate when they feel like they have an idea which could work better, rather than be discouraged by the perennial "great now we have N+1 standards".

Every standard nowadays aside from the very first one, are an N+1. Heck, even IFF and ASN.1, the absolute old timers of file/serialization formats, are improvements on "just mmap to disk" application formats.


> The "bioinformatics formats" might be terrible, but they work. In fact, they are meant to be Excel-readable which keeps my collaborators (and me) happy. Coming from a CS /programming background, it is natural to feel the urge to "fix" the formats (<insert relevant XKCD>), until you realise that there are a libraries that easily handle serialization/parsing.

And this is the crux of the issue, people still think excel processing is acceptable practice in 2022. If you are required to publish your analysis code (if you are not yet, it will come, the writing is on the wall), are you just publishing the excel sheets?


I do feel that there is a "missing" simple/basic text format which is somewhere in between csv and json/yaml. csv has no hierarchy while json/yaml are horribly unreadable for tabular data as it has no concept of columns (see https://csvjson.com/csv2json as an example). Maybe such a thing exists but I haven't seen it.


>Simply uncompressing a file can be a significant bottleneck, as this can inherently use only 1 CPU core.

Look this up: gnu parallel

Also, with building on what the top reply to what you said, if "uncompressing is a bottleneck" and you didn't know you could uncompress multiple files at once suggests you should spend time learning about tools that exist before you jump and try to force your colleagues (who likely have much more experience than you) to adopt "new practices" that the js community will move on from in 6 months.


TAR files can't be uncompressed in parallel. ZIP archives containing a single large file can't be uncompressed in parallel.

The resulting CSV file can't be processed in parallel, as there might be quoted record delimiters.

There are compressed file formats where a single file can be processed in parallel. Apache Avro and Parquet are examples I'm familiar with, but these handle columnar data (i.e. replacing CSV), not large numeric matrices etc.


Of course...I'm assuming (perhaps incorrectly) that they are decompressing many files sequentially without knowing that you can easily run gunzip in parallel using tools like gnu parallel.

The OP says they get to "ten gigabytes in size." It does not take 15 minutes to decompress a 10GB file on a modern workstation, as they complain in a child comment, unless you decompress multiple files sequentially. I've routinely compressed and decompressed 100GB+ files on 5 year old workstation class machines and it takes at most 5-ish minutes for one direction, (stress on "at most").


> If you find yourself in 2022 or later designing a file format intended for bulk data and you use any of the words "stream", "serialization",

Nit: I think I know what you are getting at, with blocks streams vs byte streams, but it's kinda hard to design a file format without serialization or byte streams. Not sure how that would work.

> I think it's high time that the industry standardised on a generic "container" format to replace legacy archive file formats.

I have a side project chipping away at just such a thing. It's quite daunting, so if this at all interests anyone, please comment/reach out. I'd love more of an excuse to work on this.

SITO in a nutshell:

- It's all based on msgpack, which does most of the heavy lift for serialization and datatype encoding

- a sito stream comprises blocks, each block is a self-contained, independently decodeable msgpack array object.

- each block an array of the form (type: smallint, header: optional(hashmap), data: any)

- the type is either a single-byte int, or a packed int indicating a substream id

- the sito primary stream comprises multiple independent substreams

- there's no raw plaintext fields, but there is a plain unicode block which can be used to embed whatever plaintext metadata

- substreams can each have whatever compression/codec they want

- since each stream is a block stream, it's trivial to de/interleave

- for data integrity, I want to do something like block-level CRC/FEC along with per-stream merkle trees but I haven't worked out the details yet

- there are periodic "sync-blocks" which have a magic 8-byte sequence for starting a file, but also throughout the stream, to facilitate re-alignment of read heads

- I've also been toying with the idea of using sqlite as stream indexes and as a general glue to keep track of what's going on (right now, you can arbitrarily start a new substream at any point in the primary stream, so it's hard to tell at the start of a file what's in it, sqlite pre-allocates pages so write heads can go back and update a prior index block)

- nd-arrays are a particularly interesting datatype so there's an emphasis on ergonomics around handling them

I plan on doing a simple PoC at some point soon showcasing SITAR, the sito archive format, with a python tarfile-like interface.


I just updated the github page with my current state of notes, in case folks are curious to some of the details.

https://github.com/xkortex/sito


Echoing my frustration... A few months ago I went through a similar search. IIRC my needs were: binary data storage (not ascii), index-based access on arrays without loading the whole file in memory, being able to attach metadata with the file contents, with interfaces for multiple scientific programming languages (Python, Julia, etc).

I recall going through ASDF, BSDF and a handful of other formats, finally ending up with HDFS -- which was okay, but not fully satisfactory (I don't recall all my gripes right now).


you're looking for zarr.


I'll argue the corner of bioinformatic file formats here. The main ones are FASTQ, SAM, BAM, VCF, and that format that GATK's DepthOfCoverage spews out.

FASTQ is a plain text format, and is usually gzipped. This is usually not a problem, as the only thing that's going to happen to a FASTQ file is you're going to shove it through an aligner, and a single thread can un-gzip that file fast enough to keep a lot of cores busy doing the alignment.

SAM is basically used for nothing, except very briefly as an output from the aligner before it is promptly converted into BAM.

The BAM format is actually sensible. It's compressed and indexed. There's an alternative format out there called CRAM, which can be a little more efficient, but you need to ensure that some external files are still available in order to decompress it.

VCF (and gVCF) files are text, and they are usually gzipped. Whether they are gzipped or not, they usually have an accompanying index file that allows any section to be accessed without having to sequentially read through. This is possible with the gzipped version because a variant of gzip called bgzip is used that compresses blocks of data, and locations of the start positions of those blocks are stored in the index.

The DepthOfCoverage file format - OK, I don't have any defence of it. It's just huge. If you're storing the read depth for a single sample, it uses about 24 bytes per location, to store a single number that's usually less than 256. So, a typical file for a whole genome sequencing sample would be around 75GB. It also has no index. A few months ago I decided to write an alternative file format, which delta-encodes then huffman-encodes the number in blocks, and uses about 1.3 bits per location, and includes an index, so that 75GB file is now 0.5GB and is a heck of a lot faster to read.

As an aside the GATK DepthOfCoverage is a fairly dire example of slow software. About 7 years ago I wrote my own DepthOfCoverage, which produces the same results but runs about 50 times faster. It wasn't hard. And because the BAM file format is sensible, yes it does access the files in parallel as you suggest.

There are advantages of text file formats. The format is unlikely to be non-readable in 10 years. You can just load it up in less and have a read. The text format doesn't stop it being compressed and indexed and accessed in parallel. But yes, the data could often be stored in a more efficient manner.

Finally, if you're finding that your server is regularly under-utilised, then you aren't doing load-management properly. You should use a queuing system that knows for each job how much RAM and how many CPU threads are used, and therefore how many can be run simultaneously.


> CRC or SHA style checksums are automatically a no-go

CRC is 100% parallelizable FYI. Both in SIMD and also on a block level which can be merged.


There's quite a bit of magic hiding under this. For e.g., try creating a NumPy array with different orderings, and the metadata looks the same:

import asdf

import numpy as np

x = np.array([[1, 2], [3, 4]], order="C")

y = np.array([[1, 2], [3, 4]], order="F")

tree = {"x": x, "y": y}

af = asdf.AsdfFile(tree)

af.write_to("example.asdf")

and you get in the metadata no distinction between the two arrays even though things like byteorder are included:

x: !core/ndarray-1.0.0

  source: 2

  datatype: int64

  byteorder: little

  shape: [2, 2]
y: !core/ndarray-1.0.0

  source: 0

  datatype: int64

  byteorder: little

  shape: [2, 2]
This makes me wonder what it's actually storing - is it actually doing something like pickling the NumPy array?


I looked into this, because my team is considering using this format. I think it's storing array data as-is in straight binary. The type, order, and shape indicators are enough to tell how to recover the binary data.


> I think it's storing array data as-is in straight binary.

You can't infer the stride from the raw binary; a 2-D array [[1, 2], [3, 4]] in C ordering just looks like:

1,2,3,4

and in Fortran ordering it's

1,3,2,4

So there must be additional metadata stored. Maybe it's just using the .npy format internally - https://numpy.org/devdocs/reference/generated/numpy.lib.form...


byteorder != row/column order


I was hoping for some explanation of how this is envisioned to augment or coexist with HDF, a popular existing format for scientific data (but didn't see one in the docs). Did I miss it?


From the introduction: "On the other end of the spectrum, formats such as HDF5 and BLZ address problems with large data sets and distributed computing, but don’t really address the metadata needs of an interchange format. ASDF aims to exist in the same middle ground that made FITS so successful, by being a hybrid text and binary format: containing human editable metadata for interchange, and raw binary data that is fast to load and use. Unlike FITS, the metadata is highly structured and is designed up-front for extensibility." [0]

Frankly, I feel like making the metadata of a scientific data structure human-editable is something of a mis-feature, or at best a non-feature. I use metadata in HDF5 files as a form of provenance tracking and I'd rather there be some friction to editing it.

[0] https://asdf-standard.readthedocs.io/en/1.0.3/intro.html


In limitations it states: >While there is no hard limit on the size of the Tree, in most practical implementations it will need to be read entirely into main memory in order to interpret it, particularly to support forward references. This imposes a practical limit on its size relative to the system memory on the machine. It is not recommended to store large data sets in the tree directly, instead it should reference blocks.

I would guess that HDF5 would be the better choice for large datasets. However I quite do not understand the capital 'Tree' in this sentence and what that means for practical data sets.


Metadata needs that, e.g., h5ad solves? I think there's quite a bit to improve on for h5 (it's very slow), but h5ad adds great ways of managing indices and meta data.


Good thing someone already invented NetCDF to address the metadata needs too...


It is envisioned that this will replace HDF5, at least for astronomy — based on this 2015 paper.

https://www.sciencedirect.com/science/article/pii/S221313371...

I’ve used HDF5 before but not ASDF so I can’t fully evaluate their points one way or the other.


ASDF is also a -different- ASDF scientific data format. There are two!

Adaptable Seismic Data Format: https://asdf-definition.readthedocs.io/en/latest/


I wish people stopped using YAML. It is a terrible, ambiguous, error-prone data format, particularly not suitable for data exchange purposes.


It's also terribly unintuitive, despite trying to be "simple and obvious". A really bad choice.


I just looked it up, looks like YAML no longer parses Norway's country code as a boolean (NO).

Not sure how widespread YAML 1.2 adoption is though.

https://github.com/crdoconnor/strictyaml/issues/186


After dealing with the absolute garbage that is XML I was always happy to see JSON.

I agree YAML is bad and would say JSON is the best we have.

What are you suggesting? I think you have to make call if you say something is terrible or a bad choice.


I am a fan of the libconfig format: https://hyperrealm.github.io/libconfig/

(it's quite close to json, but supports comments and also integer sizes and such)


XML is a fine markup language for anything that looks like a document, but a very poor configuration format.

JSON is a great text-based exchange format and protocol medium, but extremely poor configuration language (no comments, no references, very restricted syntax for human editing).

Every tool has it uses (for configuration purposes I suggest something like HOCON, or even some superset of INI, like Python’s configparser).

YAML is bad at everything.


YAML is fine at deeply-nested configs (where TOML kind of falls over), support for multi-line strings in a variety of formats is nice, and the ability to use references and anchors makes it good for things like defining a schema with reusable components. Mistakes were of course made with the "friendly" booleans (yes/no unquoted strings).

The Docker Compose and OpenAPI formats, are good uses of YAML that would be cumbersome in any other format.

Is HOCON better? Maybe. As far as I recall it doesn't have any affordance for multi-line strings, which I see as a valuable YAML feature. It does have its own merits, though, and is probability a better default than YAML in a lot of projects.

But at least if you want an INI-like format, use TOML instead of an ad-hoc underspecified alternative.


JSON is a subset of YAML anyway. So anything that accepts YAML can deal with JSON, and you can convert one to the other.


>JSON is a subset of YAML anyway.

It isn't, really. A conforming YAML parser will not treat JSON data the same way that a JSON parser will in all cases. [0,1] The only correct way to deal with JSON is using a JSON parser.

[0]https://john-millikin.com/json-is-not-a-yaml-subset

[1]https://news.ycombinator.com/item?id=31406473


Is the "%YAML 1.2" directive necessary? Seems like something that you should be able to turn on in a YAML parsing library.


Wait, can pyyaml actually parse json? I highly doubt it.


Yaml has comments and the sensible subset of yaml is in my opinion a good format compared with all alternatives


Also, most of the problems of YAML go away if you defensively quote strings.


I know ASDF is probably old by now but it would have helped if they googled first when they chose the name... for example:

https://asdf.common-lisp.dev/



For those who wonder like me what problems are solved by ASDF (why not use a big JSON file?), here's what I found.

But I'd love a shorter explanation!

"

The following lists our key requirements for a useful format:

• Human readable metadata; format suitable for archives.

• Efficient support for binary data.

• Implicit grouping and organization of metadata and data items.

• Use of a standard format for metadata to leverage existing tools and community.

• Easily extensible features, both for the general standard, and narrower needs.

• Strong validation tools.

• Ability to provide flexible WCS (world coordinate system) models.

• References to common data or metadata without requiring copying.

• Open Source, community controlled.

" Source : https://aspbooks.org/a/volumes/article_details/?paper_id=405...


I’m reminded of Erik Naggum’s epic rant about XML (I’ve misplaced the link): as the cost of compute and storage and network move relative to one another decade over decade, some things go from bad to good idea or vice versa.

These days, your storage-adjacent nodes can burn compute on the fly, you have to try hard to pay for rack-adjacent 10 gig, Snappy and zstd are so fast that SREs at Google and FB turn them on or off under service owners to shave the margins.

You can afford getting out of and back into a custom format for your domain, multiple times.

It’s going to be mmap’d ndarrays before it hits the compute stuff, and maybe libraries could be stronger there, but you can always throw away column names and putting them back is hard.


If it can become as useful as FITS, there shall be much rejoicing...

Now if I could stop twitching from the five years of astronomical ontology analysis, I would feel the world has truly moved on.

The timing of previous efforts was swamped by XML. Trying to maintain a clear distinction between "how" (the file format) and "what" (the content of a file in that format) can get lost in the quest to hang on to the "why" (the intent of the formatted data).

I'm an optimist. There's way more experience now. Consensus around Python has helped immensely in keeping this grounded with the scientists. I have hope.


Are there any non-Python implementations yet? That the format is so coupled to Python is a major weakness of ASDF.


I'm slightly confused about

  squares: !core/ndarray-1.0.0
This is supposed to be a format, but a numpy array is a python concept. So that seems like a weird mismatch. And it is just a block of numbers really, so it seems odd to make it a special python thing.

I also wasn't able to see how to store more complicated data, like a symmetric sparse matrix.


The `ndarray`-concept seems to have caught on across languages for data-analysis. I have happily used `rust-ndarray`[0] for building python extensions in rust.

[0]: https://docs.rs/ndarray/latest/ndarray/


I'm having a hard time clearly seeing what the problems were with other approaches that ASDF is intended to solve.

Are the other solutions not fast enough? If so what is the performance delta using ASDF? Are the other solutions not scalable enough? Likely not, given that ASDF mentions certain scale limitations-- but maybe those limitations are still less limiting than limitations of other solutions?

Whatever the reasons for making ADFS, it would be helpful to describe in the intro "here is a problem that ADFS solves" and "here is a graph (or similar) showing how it solves it better than alternatives".

Without that context, the description of Features, to my naive outsider's view, is not very valuable and I'm left still asking, "why use this instead of other solutions that many engineers already are familiar with?"


Yet Another YAML-based Markup Language... or YAYML for short.


If a file is large enough that it cannot be read into memory in a single pass, it does beg the question if it should be in a single file in the first place. If it is that big, there may end up being multiple simultaneous writers and readers accessing various parts, depending on the use case. Now you are getting into some pretty advanced territory and will eventually need to introduce a WAL log into your file format for consistency (or crude locks). Formats like HDF5 never bothered with this leaving it up to the user to ensure that writers and readers are not accessing the file simultaneously.

A directory on a file system happens to have these capability already. If you absolutely need everything in a single enormous file perhaps consider using Sqlite.


I know that most HPC cluster do have a limit on the number of files per user ; this pushes scientific applications to generate big hdf5 files.


Consider the case where you need different parts of the same file from many processes concurrently (e.g. a lookup table). It may fit into memory but not into memory N times (yes you can use shared memory etc.). Most files are write once read many times so no need for synchronization....


Frankly, ASDF is a downgrade from the previous Quick Words Exchanging Raw Text Yesterday spec. :)


Your Undoing Is Our Progress


Beyond what has been mentioned so far (i.e., how is this better than long-standing scientific data formats and tools, e.g., netcdf, hdf, numpy, dask, etc. ?) I would like to know if there is a cloud native data storage aspect to asdf similar to the work the zarr folks are tackling. As data sets grow increasingly large, we move computation to data instead of the other way around. Does asdf meld with that paradigm? Glancing through the docs, I did not see it mentioned, but perhaps I missed it.


I'm not the target audience for this, but it feels like picking a name like "asdf" means when a user has a question, quickly googling for answers is gonna be a real pain.


So, this is like netCDF format?


What are the benefits over HDF5 / netCDF?

Python already has great support for these formats with Dask and Xarray. (Think multidimensional Pandas.)


Funny how, even having left particle physics a decade ago, I keep returning to CERN ROOT [0] when a dataset hits a certain size.

[0] https://root.cern/primer/#file-io-and-data-analysis


This. Python-> Uproot is all the good stuff from the ROOT data format minus the bloat

https://uproot.readthedocs.io/en/latest/basic.html


Uproot really kept CERN ROOT from falling into obscurity.


While it's very nice to have uproot (and I have contributed to it!), unfortunately it isn't able to read all of the ROOT files I generate... probably because I'm not aware of its limitations. (See e.g. https://github.com/scikit-hep/uproot5/issues/586 )

Of course solving the general problem of all ROOT I/O (which can involve custom serialization or schema migration code in C++) won't (and doesn't need to, for most people) work.


Given it comes with specific "astronomical" problems in mind, no pun intended, can ASDF be used for other data, for example biomedical data? Would it be a good format for example for biosignals?


Why not sqlite?


I love SQL and use it as much as I can, but it is not a good storage format for storing gigabytes of floating point numbers that you want to do linear algebra on with routines written in Fortran or assembly, ideally without having to copy the data.

And that is the exact usecase here.


Looks like this is expected to serve as a rich file format for astronomy data. The creators of the format pick HDF5 as their best choice among other formats. So you may want to consider that.

The closest discussion on the format's raison d'être is here -- "Introduction: why another data format?" - https://www.sciencedirect.com/science/article/pii/S221313371... ... which in turn points people to other papers on this, such as https://www.sciencedirect.com/science/article/pii/S221313371... .

The authors say "There are a few worth considering. We will briefly review the landscape and comment on them. In short, we find significant problems with all of them. If one were to choose the best of them, it would likely be HDF5." .. and go on to list these "significant" problems as --

1. It is an entirely binary format.

2. It is not "self documenting".

3. There is only one implementation due to the complexity.

4. HDF5 does not lend itself to supporting simpler, smaller text-based data files.

5. The HDF5 Abstract Data Model is not flexible enough to represent the structures we need to represent, notably for generalized WCS.

The last seems the most significant to me. Let's see what WCS is about -- https://www.sciencedirect.com/science/article/pii/S221313371...

"WCS objects consist of a sequence of coordinate frames, with a transform definition from one to the next."

.. and they go on to give a "complex" example - https://www.sciencedirect.com/science/article/pii/S221313371...

I'm tempted to reference https://xkcd.com/927/ , but there is nothing wrong about having a specific rich file format for astronomy data. It is just that I think folks coming to the format ought to be told that up front.


It seems to me that Arrow is more mature, more feature complete, more efficient and supported by many libraries.

No?


Arrow doesn't compete with HDF so much, because the latter is a hierarchical collection of multiple datasets.


What is its advantage over parquet?


is there anything other than a python implementation?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: