Sure. I didn't imply (or didn't mean to imply at least) that I thought MRCR was solved, only pointing out that it's closer to testing raw retrieval than it is testing long range dependency resolution like Longproc does. If retrieval is great but the model still implodes on the downstream task, the benchmark doesn't tell you the whole story. The intent/point of my original comment was that even the frontier models are nowhere near as good at long context tasks than what I see anecdotally claimed about them in the wild.
> The other evals you mention are not necessarily harder than this relatively simple one.
If you're comparing MRCR to for example Longproc, I do think the latter is much harder. Or at least, much more applicable to long-horizon task domains where long context accumulates over time. But I think it's probably more accurate to say its a more holistic, granular eval by comparison.
The tasks require the model to synthesize and reason over information that is scattered throughout the input context and across previously generated output segments. Additionally, the required output is lengthy (up to 8K tokens) and must adhere to a specific, structured format. The scoring is also more flexible than MRCR: you can use row-level F1 scores for tables, execution-based checks for code, or exact matches for formatted traces.
Just like NIAH, I don't think MRCR should be thrown out wholesale. I just don't think it can be pressed into the service of representing a more realistic long context performance measure.
EDIT: also wanted to note that using both types of evals in tandem is very useful for research and training/finetuning. If Longproc tanks and you dont have the NIAH/MRCR context, its hard to know what capabilities are regressing. So using both in a hybrid eval approach is valuable in certain contexts. For end users only trying to guage the current inference-time performance, I think evals like RULER and Longproc have a much higher value.
Right, the way I see it, MRCR isn't a retrieval task in the same vein as RULER. It’s less about finding one (or multiple) specific facts and more about piecing together scattered information to figure out the ordering of a set of relevant keys. Of course, it’s still a fairly simple challenge in the grand scheme of things.
LongProc looks like a fantastic test for a different but related problem, getting models to generate long answers. It seems to measure a skill the others don't. Meanwhile, RULER feels even more artificial than MRCR, since it's almost entirely focused on that simple "find the fact" skill.
But I think you're spot-on with the main takeaway, and the best frontier models are still struggling with long context. The DeepMind team points this out in the paper with that Pokemon example and the MRCR evaluation scores themselves.
> The other evals you mention are not necessarily harder than this relatively simple one.
If you're comparing MRCR to for example Longproc, I do think the latter is much harder. Or at least, much more applicable to long-horizon task domains where long context accumulates over time. But I think it's probably more accurate to say its a more holistic, granular eval by comparison.
The tasks require the model to synthesize and reason over information that is scattered throughout the input context and across previously generated output segments. Additionally, the required output is lengthy (up to 8K tokens) and must adhere to a specific, structured format. The scoring is also more flexible than MRCR: you can use row-level F1 scores for tables, execution-based checks for code, or exact matches for formatted traces.
Just like NIAH, I don't think MRCR should be thrown out wholesale. I just don't think it can be pressed into the service of representing a more realistic long context performance measure.
EDIT: also wanted to note that using both types of evals in tandem is very useful for research and training/finetuning. If Longproc tanks and you dont have the NIAH/MRCR context, its hard to know what capabilities are regressing. So using both in a hybrid eval approach is valuable in certain contexts. For end users only trying to guage the current inference-time performance, I think evals like RULER and Longproc have a much higher value.