Right, the way I see it, MRCR isn't a retrieval task in the same vein as RULER. It’s less about finding one (or multiple) specific facts and more about piecing together scattered information to figure out the ordering of a set of relevant keys. Of course, it’s still a fairly simple challenge in the grand scheme of things.
LongProc looks like a fantastic test for a different but related problem, getting models to generate long answers. It seems to measure a skill the others don't. Meanwhile, RULER feels even more artificial than MRCR, since it's almost entirely focused on that simple "find the fact" skill.
But I think you're spot-on with the main takeaway, and the best frontier models are still struggling with long context. The DeepMind team points this out in the paper with that Pokemon example and the MRCR evaluation scores themselves.
LongProc looks like a fantastic test for a different but related problem, getting models to generate long answers. It seems to measure a skill the others don't. Meanwhile, RULER feels even more artificial than MRCR, since it's almost entirely focused on that simple "find the fact" skill.
But I think you're spot-on with the main takeaway, and the best frontier models are still struggling with long context. The DeepMind team points this out in the paper with that Pokemon example and the MRCR evaluation scores themselves.