The practices identified here make sense. Testing pipelines on a small amount of...

The practices identified here make sense. Testing pipelines on a small amount of data to make sure there are not simple typos, then doing aggregate statistical tests and specific example condition tests will catch many issues.

The paper "Making Contextual Decisions with Low Technical Debt" https://arxiv.org/pdf/1606.03966.pdf goes deeper. Testing and monitoring deployments are very similar. The idea of shadow testing new models (seeing how their outputs would differ from the production models on real data) has been very important for identifying issues in my experience.

This can be generalised to comparing models on historic data which greatly speeds up evaluation. This is different from cross-validation as it is not about correctness, just how different the new output is. This is like the pattern in UX development of a test harness that compares differences in a screenhot. If the differences look good, then ship it!