Whenever one of these well known gotcha prompts gets "solved" the question is always whether they actually solved the underlying reason it used to fail, or did they just have a bunch of third-world workers tag pictures of horses and astronauts until the model started handling that specific example more reliably. As the saying goes, every measure which becomes a target becomes a bad measure.