The practices identified here make sense. Testing pipelines on a small amount of data to make sure there are not simple typos, then doing aggregate statistical tests and specific example condition tests will catch many issues.
The paper "Making Contextual Decisions with Low Technical Debt" https://arxiv.org/pdf/1606.03966.pdf goes deeper. Testing and monitoring deployments are very similar. The idea of shadow testing new models (seeing how their outputs would differ from the production models on real data) has been very important for identifying issues in my experience.
This can be generalised to comparing models on historic data which greatly speeds up evaluation. This is different from cross-validation as it is not about correctness, just how different the new output is. This is like the pattern in UX development of a test harness that compares differences in a screenhot. If the differences look good, then ship it!
Nice article. Agree that this is a nascent but really interesting and important area for machine learning as a discipline. The topics in this article labelled as "invariance testing" and "expectation tests" hint at the broader challenge with empirically defined functions: that the input domain for a given model can be significantly larger in scope and complexity than the datasets that the asset has been exercised through at the training, testing and validation process. This will be addressed by some of the highest performing teams, but I'd hazard a guess that there's many models in production nowadays that aren't instrumented to consider concepts like the extremes and dynamics of their valid input spaces, developer-introduced scope creep as models get more integrated with CI/CD practices, and many other tricky and complex behaviours that might get lost amongst the noise of ML development.
"Testing" in ML as it stands just now is essentially still development & application logic of the learned function, not testing in the same sense as we consider it in other aspects of software. The "post-train" area will need to a see lot of advances if we're to remain confident in our ML models in production (provided they continue to proliferate into more areas of software).
The paper "Making Contextual Decisions with Low Technical Debt" https://arxiv.org/pdf/1606.03966.pdf goes deeper. Testing and monitoring deployments are very similar. The idea of shadow testing new models (seeing how their outputs would differ from the production models on real data) has been very important for identifying issues in my experience.
This can be generalised to comparing models on historic data which greatly speeds up evaluation. This is different from cross-validation as it is not about correctness, just how different the new output is. This is like the pattern in UX development of a test harness that compares differences in a screenhot. If the differences look good, then ship it!