Reliability of ChatGPT GPT-4 isn't consistent in my experience. It seems to respond to prompts differently, but I'm not sure yet if it's based on the number of prompts in the last 8 hours or overall server load. I would guess that the API is more consistent than the ChatGPT frontend, but can't confirm for sure.
You can tune the temperature parameter and bring it to 0 (if using the API). Although technically it's not fully deterministic, it will reply with the exact same answer >99% of the time in my experience.
(This is for GPT-3 and ChatGPT. Haven't tested GPT-4)