> On a related note, unlike traditional unit tests, you don’t necessarily need a 100% pass rate. Your pass rate is a product decision, depending on the failures you are willing to tolerate.
Not sure how I feel about this, given expectations, culture, and tooling around CI. This suggestion seems to blur the line between a score from an eval and the usual idea of a unit test.
P.S. It is also useful to track regressions on a per-test basis.
AI Evals are systematic frameworks for measuring LLM performance against defined benchmarks, typically involving test cases, metrics, and human judgment to quantify capabilities, identify failure modes, and track improvements across model versions.
Maybe it's obvious to some - but I was hoping that page started off by explaining what the hell an AI Eval specifically is.
I can probably guess from context but I'd love to have some validation.