> There's a simple fact: humans prefer lies that they don't know are lies over l...

godelski · 2025-09-29T01:45:17 1759110317

  > As an engineer and researcher, I prefer lies (models, simplifications), that are known to me, rather than unknown unknowns.

I think you misunderstood.

I'll make a corollary to help:

  ~> There's a simple fact: humans prefer lies that they believe are truths over lies that they do know are lies.

I'm insure if you: misread "lies that they don't know are lies", conflated unknown unknowns with known unknowns, or (my guess) misunderstood that I am talking about the training process which involves a human evaluator evaluating an LLM output. That last one would require the human evaluator to preference a lie over a lie that they do not know is actually a lie. I think you can see how we can't expect such an evaluation to occur (except through accident). For the evaluator to preference the unknown unknown they would be required to preference what they believe to be a falsehood over what they believe is truth. You'd throw out such an evaluator for not doing their job!

As a researcher myself, yes, I do also prefer known falsehoods over unknown falsehoods but we can only do this from a metaphysical perspective. If I'm aware of an unknown then it is, by definition, not an unknown unknown.

How do you preference a falsehood which you cannot identify as a falsehood?

How do you preference an unknown which you do not know is unknown?

We have strategies like skepticism to deal with this help with this but this doesn't make the problem go away. It ends up with "everything looks right, but I'm suspicious". Digging in can be very fruitful but is more frequently a waste of time for the same reason: if a mistake exists we have not identified the mistake as a mistake!

  > I don't need to know exact implementation details, knowledge of aggregate benchmarks, fault rates and tolerances is enough.

I think this is a place where there's a divergence in science and engineering (I've worked in both fields). The main difference in them is at what level of a problem you're working on. At the more fundamental level you cannot get away with empirical evidence alone.

Evidence can only bound your confidence in the truth of some claim but it cannot prove it. The dual to this is a much simpler problem, as disproving a claim can be done with a singular example. This distinction often isn't as consequential in engineering as there are usually other sources of error that are much larger.

As an example, we all (hopefully) know that you can't prove the correctness of a program through testing. It's a non-exhaustive process. BUT we test because it bounds our confidence about its correctness and we usually write cases to disprove certain unintended behaviors. You could go through the effort to prove correctness but this is a monumental task and usually not worth the effort.

But right now we're talking about a foundational problem and such a distinction matters here. We can't resolve the limits of methods like RLHF without considering this problem. It's quite possible that there's no way around this limitation since there are no objective truths the majority of tasks we give LLMs. If that's true then the consequence is that a known unknown is "there are unknown unknowns". And like you, I'm not a fan of unknown unknowns.

We don't actually know the fault rates nor tolerances. Benchmarks do not give that to us in the general setting (where we apply our tools). This is a very different case than, say, understanding the performance metrics and tolerances of an o-ring. That part is highly constrained and you're not going to have a good idea of how well it'll perform as a spring, despite those tests having a lot of related information.