Because if you are trying to assess the safety of some process and there is a 0.5% chance that it will give you deadly results, its not good.
Moreover, if its not deterministic, how can you assess if its safe? Sure you can run many many iterations, but how do you know when its safe enough? LLMs encourage freeform entry, which means the testing space is fucking massive.
Does writing with a different syntactic style give different outcomes? Does spelling mistakes lead to increase morbidity? thats a test plan I don't want to have to run (unless you are paying me megabucks.)
Humans are not deterministic as you point out, which is why you need to control for as many unknowns as possible when testing in a health setting.
If the training data for this is completely published, we'll have a huge head-start in terms of building a deterministic LLM once someone figures out how to do that.
Whatever flavour of AI needs to be deterministic, which llama, et al are not. even if you turn the temperature right down.
As others have pointed out, its the training set that actually makes a model behave, hence why models are freely given away by large companies.