Doesn't that depend on the implementation? There's a trade-off between performance and determinism for sure, but if determinism is what you want then it should be possible.
If you fix random seeds, disable dropout, and configure deterministic kernels, you can get reproducible outputs locally. But you still have to control for GPU non-determinism, parallelism, and even library version differences. Some frameworks (like PyTorch) have flags (torch.use_deterministic_algorithms(True)) to enforce this.
TL;DR: assuming you've squashed all regular non-determinism (itself a tall ask), you either need to ensure you always batch requests deterministically, or ensure all kernels are "batch invariant" (which is absolutely not common practice to do).
This. I’m still amazed how many people don’t understand how this technology actually works. Even those you would think would have a vested interest in understanding it.