I'm still not sold on recall at such large context window sizes. It's easy for an LLM to find a needle in a haystack, but in most RAG use-cases it's like finding a needle in a stack of needles, and the benchmarks don't really reflect that. There's also the speed and cost implications of dumping millions of tokens into a prompt - it's prohibitively slow and expensive right now.