Current Transformer models are looking pretty good at complex end-to-end tasks (at least, better than the shallow regression with hand-picked features that ETS probably uses). In a few years, complete end-to-end evaluation may not be so impossible, especially with so much data.