> However, with T steps of CoT, constant-depth transformers using constant-bit precision and O(logn) embedding size can solve any problem solvable by boolean circuits of size T
There is a difference between being equivalent to a circuit and prediction of the output of the BVSP.
That is what I was suggesting learning descriptive complexity theory would help with.
Why does the limit on computational complexity of single decoder transformers matter for obtaining superhuman coding ability? Is there a theory of what level of complexity is needed for the task of coding according to a spec? Or the complexity for translation/optimization of a code? Even if there were, and one could show that a plain decoder transformer is insufficient, you probably only need to add a tool in the middle of the stream processing. Unless you have some specific citation that strongly argues otherwise, I will stick with my speculative/optimistic view on the upcoming technology explosion. To be fair, I always thought coding was at best modest complexity, not super hard compared to other human activities, so I will not make claims of generic superintelligences anytime soon, though I hope they happen in the near term, but I’d be happy if I simply see them in a decade, and I don’t feel partial to any architecture. I just think that attention was a cool idea even before the transformers, and decoder transformers took it to the limit. It may be enough for a lot of superhuman achievements. Probably not for all. We will see.
Rice's theorem means you can't choose to decide if a program is correct, but you have to choose an error direction and accept the epislon.
The Curry–Howard–Lambek correspondence is possibly a good tool to think about it.
The reason I suggested graduate level complexity theory is because the undergrad curriculum is flawed in that it seems that you can use brute force with a TM to stimulate a NTM with NP.
It is usually taught that NP is the set of decision problems that can be solved by a NTM in polynomial time.
But you can completely drop the NTM and say it is the set of decision problems that are verifiable by a DTM in poly time.
Those are equivalent.
Consider the The Approximate Shortest Vector Problem (GapSVP), which is NP-HARD, and equivalent to predicting the output of a 2 layer NN (IIRC).
Being NPH, it is no longer a decision problem.
Note that for big 0, you still have your scaler term. Repeated operations are typically dropped.
If you are in contemporary scale ML, parallelism is critical to problems being solvable, even with FAANG level budgets.
If you are limited to DLOGTIME-uniform TC0, you can't solve NC1- complete problems, and surely can't do P-complete problems.
But that is still at the syntactic level, software in itself isn't worth anything, it is the value it provides to users that is important.
Basically what you are claiming is that feed forward NN solve the halting problem, in a generalized way.
Training an LLM to make safe JWT refresh code is very different from generalized programming. Mainly because most of the ability for them to do so is from pre-training.
Inference time is far more limited, especially for transformers and this is well established.
There is a difference between being equivalent to a circuit and prediction of the output of the BVSP.
That is what I was suggesting learning descriptive complexity theory would help with.