That's not the point. While there is a temperature setting and randomness involv...

That's not the point. While there is a temperature setting and randomness involved, you can still benchmark and experience significant differences in the output between models and generations. I thus provided more details and the full output to make it easier for people to assess the context of the comment I replied to

When someone uses the same tools as I do but seem to experience problems I do not have - these kind of posts often describes how bad LLMs are or how bad Google search is - I get a bit confused. Is it A/B testing going on? Am I just lucky? Am I inattentive to these weaknesses? Is it about promoting? Or what areas we work in? Do we actually use the same tools (i.e., same models)?