I think part of this is that you can't trust the "thinking" output of the LLM to...

I think part of this is that you can't trust the "thinking" output of the LLM to accurately convey what is going on internally to the LLM. The "thought" stream is just more statistically derived tokens based on the corpus. If you take the question "Is A a member of the set {A, B}?", the LLM doesn't internally develop a discrete representation of "A" as an object that belongs to a two-object set and then come to a distinct and absolute answer. The generated token "yes" is just the statistically most-likely next token that comes after those tokens in its corpus. And logical reasoning is definitionally not a process of "gaining confidence", which is all an LLM can really do so far.