Interesting article. I mostly agree. OP, could I ask a question? You mention 1TB...

jerven · on Feb 7, 2018

Not OP: but working on a downstream project and my current boss used to work on EMBL-bank in the day. A lot of this stuff is in databases. e.g. Oracle and I think for advanced search it was in Teradata.

However, databases are hard to share so many steps require dumping the database into some interchange formats (custom and often from before the age of XML or JSON, yeah for ASN.1 parsing!)

Sharing database dumps is done but commercial licenses and version mismatches do add issues here as well. Remember EMBL/ENA is older than MySQL. The databases tend to have the wrong shape for the next downstream step i.e. table design is related to work flow and if your next step in working is completely different we end up with issues. Also some data can't be published until a certain date so that needs to be filtered from the dumps in some way.

Consider as well that this project is 3 decades old and used to be printed in books at some point, and shipped on DVD as recently as 2004. File based operations can be extremely efficient.

luispedrocoelho · on Feb 7, 2018

For some things, we do. But databases are not magical and setting up a good table/index system &c is also work and there is overhead.

Thus, if we are talking about having (for example) a webservice where queries have a form that is known apriori, then it's a good solution. If you have output data from your processing that you will be slicing and dicing in different ways which you cannot predict ahead of time, then, they are not appropriate.

(Loading Terabytes of data into a database takes a while too).

xyhopguy · on Feb 7, 2018

Bioinformatics is perpetually ten years behind. The de facto standard for sequencing data is effectively a stripped down bzipped plain text file. It's madness.