This is a rather mundane and unsurprising line tbh. Finite automata are not Turing complete. Computers are not in reality since they do not have infinite memory and do not have infinite precision. I'm not sure if it'll lead to arbitrary precision machines since the lack of real world Turing completeness hasn't stopped us so far and likely won't.
Is there a formal, mathematical way of saying "real-world computers may not have infinite memory, but have more than enough, so they can be treated as Turing-complete for a subset of programs that are well-behaving - i.e. don't end up hitting the memory limit"?
And in general, there surely is a way of formally saying "this is theoretically X, but effectively Y, for the [hand-waves] kind of inputs"?
Not really, formal mathematics in complexity theory is involved when talking about asymptotics, constants like “for N < 10 trillion, a desktop computer is good enough” isn’t very interesting from a mathematical perspective.
That said, some simple intuition is the following: PSPACE is a subset of EXP is a subset of EXPSPACE
(We think these are all strict separations but technically that’s not fully proven)
If you use the shorthand intuition that we can handle polynomial scaling but can’t handle exponential scaling, this means that we hit a time barrier (EXP-complete problems) before we hit a space barrier (EXPSPACE-complete problems)
Another bit of intuition: you can store a very big number in very few bits in binary because binary holds an exponentially large number in linear bits. But you can’t loop over that number in a humans’ lifespan.
Edit:
> they can be treated as Turing-complete for a subset of programs that are well-behaving - i.e. don't end up hitting the memory limit
Just to be clear, it’s a matter of input size and not the programs themselves. Technically you could say we haven’t “solved” sorting on real hardware because nobody can sort the first 2^1000 digits of pi. But realistically we don’t care to do so.
Space complexity helps characterize this: real-world computers can (arguably) emulate a linear-bounded automaton, so anything in DSPACE(O(n)) is fair game if you can wait long enough.
For the arguably part: I am assuming that the machine can access all of the input at once, so it is reasonable to expect available memory to be a multiple of the input, so you get O(n) memory.
Computers can do side effects so their state is effectively the universe. Which is still not infinite, but for all practical purposes the distinction doesn’t matter.
I'm not a PL person or someone that focuses on computability, but I think you'd refer to it as "bounded," "pseudo," or "quasi." To give an example, you might call a transform on an image quasi-equivariant if it introduces some aliasing (obviously not destructing the image). See quasimorphism[0] as an example.
Usually people just let it go unless someone brings up specifically something that implies that there is, or might, not a shared mutual understanding (as done here). Maybe it is shared, maybe not, maybe someone reading it doesn't understand the difference. I mean we're humans. We compress a lot of information into language that is not directly expressed in our words (this is also why it is often hard to talk to people on the internet since there's a wide audience with vastly different priors. Relevant XKCD[1]).
An important distinction is that computers can do arbitrary many iterations of algorithms, while most neural networks have to operate on some fixed size, so the practical limits are very different.
It won't, it just means this is another pointless theoretical study seeking to interpret Transformers in a framework that has no explanatory value for AI.
As I see it, the result is rather establishing a very fundamental property pertaining to the expressive power of a mechanism, and it can be useful also in practice.
For instance, I have many potential applications of Turing complete formalisms, because I am interested in results of arbitrary computations. The result obtained in the article means that I can use a Neural Network to obtain this, under the conditions outlined in the article, and in the way shown in the article.
This may simplify software architectures, especially in situations where Neural Networks are already applied, and additional mechanisms would otherwise be needed to cover arbitrary computations.
Something being Turing Complete just that means in principle it could be used to solve any computation problem. But this may require infinite memory or infinite time.
The paper showed that Transformer with positional encodings and rational activation functions is Turing complete.
Rational activation functions with arbitrary precision make sure that you are in the smaller countable infinities, where floats run into that cardinality of the continuum problem.
While all nets that use attention are feed forward and thus effectively DAGs, they add in positional encodings to move it from well-founded to well ordered.
While those constraints allow the authors to make their claims in this paper, they also have serious implications for real world use as rational activation functions are not arbitrarily precise in physically realizable machines in finite time and you will need to find a well-ordering of your data or find a way to force one one it which is not a trivial task.
So while interesting, just as it was interesting when someone demonstrated sendmail configurations were Turing complete, it probably isn't as practical as you seem to think of it.
As attention is really runtime re-weighting and as feed forward networks are similar to DAGs it is not surprising to me that someone found a way to prove this, but just as I am not going to use the C preprocessor as a universal computation tool as it is also TC, I wouldn't hold your breath waiting for attention to be a universal computation tool either.
I'm not going to engage with this directly, but for any other readers passing through - this is nonsense. One more drop in the flood of uninformed AI noise that's been drowning out the signal.
Choose the sources you trust very carefully, and look to the people actually working on real-world AI systems, not the storytellers and hangers-on.
Their proof depends on arbitrary precision and they explicitly state that the finite case is not TC.
But if you are talking about arbitrary precision floats, or the computable set of the reals it is equivalent.
The computable reals are just the concatenation of the natural numbers/ints
So it is the countable infinity and thus the cardinality of Aleph-nought.
That adds to the time complexity, while the unbounded memory requirement comes from the definition of a Turing machine which is roughly a finite state machine+ an infinite tape.
As the reals are uncomputable almost everyplace, you would need an activation function that only produced computable reals, as they are equivalent rational activation functions are simpler for the proof.
"To prove this we assume that internal activations are represented as rational numbers with arbitrary precision."
And:
"Transformers with fixed precision are not Turing complete."
It will be interesting to see if this leads to better support for arbitrary precision arithmetic in processor architectures.