Sudoku? But there are endless failings in its reasoning, they just don’t come up during one-off questions, but during more complex discussions with it.
Sudoku is a decent example (which has apparently been solved, but only through very specific prompting [1]), though I would be more interested in puzzles that require a lot of arithmetic, since it's already clear that GPT-4 struggles with math and counting.