I still dont understand why people point to this chart as any sort of meaning. Cost per task is a fairly arbitrary X axis and in no way representing any sort of time scale.. I would love to be told how they didn't underprice their model and give it an arbitrary amount of time to work.
US, GA here. My mom was big on tanning and warned of us this (lemons are also bad). I believe she said something about it being used on purpose for tanning, but that you had to be careful or you would badly burn. She probably did that around the late 80s or early 90s.
With serious diminishing returns. At inference time, no reason to use fp64 and should probably use fp8 or less. The accuracy loss is far less than you'd expect. AFAIK Llama 3.2 3B fp4 will outperform Llama 3.2 1B at fp32 in accuracy and speed, despite 8x precision.
The theory is that you increase the context with more relevant tokens to the problem at hand, as well as its solutions, which in theory makes it more likely to predict the correct solution.
Perhaps it’s because I know human beings that have the exact same operation and failure mode as the LLM here and I’m probably not the only one. Failing at something you’ve never seen and faking through it is a very human endeavor.
Regarding errors: I don't know the exact mechanism in the brain that causes humans to make them but i believe it's a combination of imperfect memory, attention span and general lack of determinism. None of these affect logical reasoning as performed by a machine.
Regarding faking it till making it: This is a more general point that there's a difference between simulating human behavior and logical reasoning.