Context length possibly. Prompt adherence drops off with context, and anything above 20k tokens is pushing it. I get the best results by presenting the smallest amount of context possible, including removing comments and main methods and functions that it doesn't need to see. It's a bit more work (not that much if you have a script that does it for you), but the results are worth it. You could test in the chatgpt app (or lmarena direct chat) where you ask the same question but with minimal hand curated context, and see if it makes the same mistake.
Yes, that's what I'm suggesting. Cursor is spamming the models with too much context, which harms reasoning models more than it harms non-reasoning models (hypothesis, but one that aligns with my experience). That's why I recommended testing reasoning models outside of Cursor with a hand curated context.
The advertised context length being longer doesn't necessarily map 1:1 with the actual ability the models have to perform difficult tasks over that full context. See for example the plots of performance on ARC vs context length for o-series models.