This might work for CERN. But a great deal of science is done by small teams who...

graemep · on Jan 6, 2024

If you cannot model using decent code is it worth writing models at all? What if bugs mean the model is simply wrong?

It has consequences too. There has been a lot of argument about how much impact the poor code quality of the Imperial college covid epidemiology model (which was the basis of British government policy during the pandemic) had on its accuracy. I do not know how bad it was, but it cannot be good the code was bad.

ben_w · on Jan 6, 2024

One problem is that it's really hard to tell when you've just written bad code, which is also a problem for people whose job title is software developer, not just people who do it as a small part of their overall work.

Some genes have been renamed because Excel interprets the old names as dates. The people who put all their genetic analysis into Excel had no reason to expect that, just as the people writing Excel itself weren't expecting the app to be used like this.

Cacti · on Jan 6, 2024

Yeah but, in your example. this is a bookkeeping issue that, while frustrating and time consuming and costly, is just that. It’s not like a gas line in Manhattan blew up because someone in Toledo hit C-v in Excel. The scientists swapped around excel files and imported stuff without checking, which then was fed into other systems. A clusterfuck but one that is a daily occurrence at, minimally, every major non-tech company. It eventually gets unfucked with human labor or it simply wasn’t important in the first place. Exact same thing happens in research and academia.

Point being, unknown unknowns are just that. But most unknowns are known and can be programmed defensively against for most serious use cases. All major fields are like this—like, you can hook up a car battery to light a menthol tank to boil two cups of water… or we can use a kettle. Perhaps for a brief point in time, due to our ignorance or just history, people lit containers of menthol on fire like it was sane, but that doesn’t mean it was, or is.

collyw · on Jan 6, 2024

Given Ferguson's track record of being out by orders of magnitude on absolutely everything beforehand, it almost seems like he was chosen to give an over the top estimate.

mike_hearn · on Jan 6, 2024

It would be comforting to believe that because it'd mean there were other epidemiologists who were right but ignored. Go read the works by his counterparts though, and they're all out by similar orders of magnitude.

graemep · on Jan 6, 2024

It is possible. It is certainly common to pick the expert who says what you want them to - and to get rid of experts who say the "wrong thing" (e.g. the dismissal of UK govt drug policy advisor David Nutt).

mike_hearn · on Jan 6, 2024

> I do not know how bad it was, but it cannot be good the code was bad.

I do know because I reviewed the code and its issue tracker extensively. I then wrote an article summarizing its problems that went viral and melted the server hosting it.

The Imperial College code wasn't merely "bad". It was unusable. It produced what were effectively random numbers distributed in a way that looked right to the people who wrote it (in epidemiology there is no actual validation of models, reinforcing the researcher's prior expectations is considered validation instead). The government accepted the resulting predictions at face value because they looked scientific.

In the private sector this behavior would have resulted in severe liability. ICL's code was similar to the Toyota engine control code.

A selection of bug types in that codebase: buffer overflows, heap corruption, race conditions, typos in PRNG constants, extreme sensitivity to what exact CPU it was run on, and so on. The results changed completely between versions for no scientific reason. The bugs mattered a lot: the variation in computed bed demand between runs was larger than the entire UK emergency hospital building program, just due to bugs.

The program was originally a 15,000 line C file where most variables had single letter names and were in global scope. The results were predictable. In one case they'd attempted to hand write a Fischer-Yates shuffle (very easy, I used to use it as an interview question), but because their coding style was so poor they got confused about what variable 'k' contained and ended up replacing the contents of an array meant to contain people's ages with random junk from the heap.

There were tests! But commented out, because you can't test a codebase that overrun with non-determinism bugs.

The biggest problem was the attitudes it revealed within academia. Institutionalized arrogance and stupidity ruled the day. Bug reports were blown off by saying that they didn't matter because the "scientists" just ran their simulation lots of times and took the average. Professional programmers who pointed out bugs were told they had no right to comment because they weren't experts. ICL administration claimed all criticism was "ideological" or was irrelevant because "epidemiology isn't a subfield of computer science". Others were told that they shouldn't raise the alarm, because otherwise scientists would just stop showing their code for peer review. Someone claimed the results must have been correct because bugs in C programs always cause crashes and the model didn't crash. One academic even argued it was the fault of the software industry, because C doesn't come with "warning labels"!

The worst was the culture of lying it exposed. The idea you can fix software bugs by just running the program several times is obviously wrong, but later it turned out that their simulation was so slow they didn't even bother doing that! It had been run once. They were simultaneously claiming determinism bugs didn't matter whilst also fixing them. They claimed the software had been peer reviewed when it never had been. The problems spanned institutions and weren't specific to ICL, as academics from other universities stood up to defend them. The coup de grace: ICL found an academic at Cambridge who issued a "code check" claiming in its abstract that in fact he'd run the model and got the same results, so there were no reproducibility problems. The BBC and others ran with it, saying the whole thing was just a storm in a teacup and actually there weren't any problems. In reality the code check report went on to admit that every single number the author had got was different to those in Report 9, including some differences of up to 25%! This was considered a "replication" by the author because the shape of the resulting graph was similar.

That's ignoring all the deep scientific problems with the work. Even if the code had been correct it wouldn't have yielded predictions that came close to reality.

Outside of computer science I don't believe science can be trusted when software gets involved. The ICL model had been hacked on for over a decade. Nobody had noticed or fixed the problems in that time, and when they were spotted by outsiders, academia and their friends in the media collectively closed ranks to protect Prof Ferguson. Academia has no procedures or conventions in place to ensure software is correct. To this day, nothing was ever done and no fault was ever admitted. There was a successful coverup and that was the end of it.

Again: in the private sector this kind of behavior would yield liabilities in the tens of millions of dollars range, if not worse.

IlliOnato · on Jan 6, 2024

Wow. It's pretty unbelievable. It there a place where I can read the whole article?

The one I found the funniest/crziest is "Bug reports were blown off by saying that they didn't matter because the "scientists" just ran their simulation lots of times and took the average", because this is exactly how some scientists I know think.

This thinking is not limited to software. My father was by trade involved in building experimental apparatus ("hardware") for scientific experiments. Often they were designed by scientists themselves. He told me about absurd contraptions which never could measure what they were intended to measure, and extreme reluctance/defensiveness/arrogance he often met when trying to report it and give some feedback...

mike_hearn · on Jan 6, 2024

Yeah confusion between simulation and reality can be observed all over the place. Multiple runs can be needed if you're doing measurements of the natural world, but for a simulation that doesn't make sense (you can do Monte Carlo style stuff, but that's still replicable).

You could see the lines being blurred in other ways. Outputs of simulations would be referred to as "findings", for example, or referenced in ways that implied empirical observation without it being clear where they came from unless you carefully checked citations.

Here are some of the articles I wrote about what happened (under a pseudonym)

https://dailysceptic.org/2020/05/06/code-review-of-fergusons...

https://dailysceptic.org/2020/05/09/second-analysis-of-fergu...

https://dailysceptic.org/2020/06/11/how-replicable-is-the-im...

After that people started sending me non-Imperial models to look at, which had some similar problems:

https://dailysceptic.org/2020/08/08/schools-paper/

I don't write for that website anymore, by the way. Back then it was called Lockdown Sceptics and was basically the only forum that would publish any criticism of COVID science. Nowadays it's evolved to be a more general news site.

IlliOnato · on Jan 6, 2024

Thank you! This is very interesting.

> Outputs of simulations would be referred to as "findings"

Yeah, a recent brouhaha about creating (!) a traversable wormhole in a quantum computer comes to mind...

ben_w · on Jan 6, 2024

Eek! That makes the worst code I've ever seen, seem good in comparison.

collyw · on Jan 6, 2024

Aside from the code quality, his models have never been close to accurate on anything.

mike_hearn · on Jan 6, 2024

Right. That's not unique to Ferguson, epidemiology doesn't understand respiratory virus dynamics and doesn't seem particularly curious to learn anymore (I read papers from the 80s which were very different and much more curious than modern papers, not sure though it that's indicative of a trend or just small sample size).

Other models I checked didn't have the same software quality issues though. They tended to use R rather than C and be much simpler. None of them produced correct predictions either, and there were often serious issues of basic scientific validity too, but at least the code didn't contain any obvious bugs.

dash2 · on Jan 6, 2024

How are people supposed to do science without running statistical models?

mike_hearn · on Jan 6, 2024

This is asked in good faith of course, but that question really gets to the heart of what's been corrupting science.

Statistical techniques can be very useful (ChatGPT!) but they aren't by themselves science. Science is about building a theoretical understanding of the natural world, where that theory can be expressed in precise language and used to produce new and novel hypotheses.

A big part of why so much science doesn't replicate is that parts of academia have lost sight of that. Downloading government datasets and regressing them against each other isn't actually science even though it's an easy way to get published papers, because it doesn't yield a theoretical understanding of the domain. It often doesn't even let you show causality, let alone the mechanisms behind that causality.

If you look at epidemiology, part of why it's lost its way is that it's become dominated by what the media calls "mathematicians"; on HN we'd call them data scientists. Their papers are essentially devoid of theorizing beyond trivial everyday understandings of disease (people get sick and infect each other). Thousands of papers propose new models which are just a simple equation overfitted to a tiny dataset, often just a single city or country. The model's predictions never work but this doesn't invalidate any hypothesis because there weren't any to begin with.

How do you even make progress in a field if there's nothing to be refuted or refined? You can fit curves forever and get nowhere.

In psychology this problem has at least been recognized. "A problem in theory" discusses it:

https://www.nature.com/articles/s41562-018-0522-1

dash2 · on Jan 6, 2024

Right, statistical models are not sufficient for science. I agree. But they are necessary. So I recur to my original question.

dguest · on Jan 6, 2024

Most teams at CERN don't have a professional programmer available either. In a few of larger projects (those with a few hundred active contributors) there might be one or two more tech savvy people who profile the code regularly and fix the memory leaks. But few (if any) are professional programmers: most contributors are graduate students with no background in programming.

IlliOnato · on Jan 6, 2024

And this is scary. At least with "high-energy experiments" (like the one that discover Higgs) in colliders, a lot depends on so-called triggers, which dismiss 99.9% of information produced in a collision "on the spot", so that this information is never recorded and analyzed.

They have to: there is way too much information produced. So the triggers try to identify "trivial" events and dismiss them immediately, relaying only the ones that are may be somewhat unusual/unexpected.

Essentially, the triggers are computers with highly specialized programs. Very smart people work on this, and supposedly they figure out problems with triggers before they affect the results of experiments...

dguest · on Jan 6, 2024

The triggers are the most fun part of the experiments!

The composition of teams working on triggers might be a bit of an exception in the "engineer : "data scientist" ratio. Most of the talent is still from a physics background but there's more of an engineering bent where around half the team can probably write performance critical code when they need to. Elsewhere that ratio is much lower.

Determining which data to save is a mix of engineering, algorithms, physics, bits of machine learning, and (for better or worse) a bit of politics. Surprisingly we're always desperate for more talent there.

As you say, the goal is to try to stop problems before they affect the data, but it's not always perfect. Sometimes we discover sampling biases after the data comes in and need to correct for them, and in the worst case we sometimes blacklist blocks of data.