Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Turing Complete Transformers: Two Transformers Are More Powerful Than One (openreview.net)
190 points by georgehill on Jan 8, 2024 | hide | past | favorite | 77 comments


These reviews are brutal. It's basically science-speak for "the paper is utter trash".

"The main claim [...] is both somewhat obvious and previously already stated"

"Many pieces of writing are overly assertive and inaccurate."

"I do not think it deserves spending half a page demonstrating that {0^n 1^n} is not in the regular language."


Yes. Anything with finite memory is not Turing-complete, and transformers are finite-state machines. So what? Anything with infinite memory is un-buildable.

Yes, there's a school of philosophy that claims that AI cannot be built with digital hardware because it needs infinite state. Penrose is the leading exponent of this argument.[1] This isn't taken too seriously any more, now that finite machines are doing so well at AI.

This is close to theology. Man must be special, right? Only humans can play chess, right? Play Go? Pass the Turing test? Create art and music? We're running out of boxes to check. Common sense (as in getting through the next 30 seconds without a major screwup) and manipulation in unstructured settings are still in bad shape, but we only need squirrel-level AI for that.

On that subject, there was a "what does the cerebellum do" article on HN a few days ago. That's a big issue. The lower mammal brains are mostly cerebellum. The weak areas in AI at present are mostly cerebellum functions. That area has never gotten enough attention in AI research. Today, though, the amount of hardware to do a low-end mammal cerebellum equivalent doesn't seem at all unreasonable.

[1] https://sortingsearching.com/2021/07/18/roger-penrose-ai-ske...


Penrose claims that human intelligence relies on quantum effects and demonstrates certain structures in neurons that makes it at least a plausible argument.

It would follow that comparable intelligence would not be possible with current digital computer architectures.

I also think a lot of people are a little too fooled by the “intelligence” of current models which are mimics with enormous libraries of “knowledge”. It is not too hard to be frustrated by the enormous limitations of current models. Impressive, yes, smart, no. Though to refute claims of their intelligence is really pushing to refine the philosophical definition of intelligence which has been rather vague so far. We can’t just say “i know it when i see it” because computers pretending is getting very good.


There is no way to effectively pretend to be smart without actually being smart.

This comes up a lot in things like science fiction, movies, etc...

E.g.: an author can state that a character has "superhuman intelligence", but the author themselves would need to actually posses this superhuman intelligence to write the actions or dialogue of that character.[1] Because authors are "just human", they obviously can't be superhuman, so they generally avoid directly demonstrating the actions of superhumanly intelligent characters. I.e.: the "tell, not show" instead of the usual "show, don't tell."

The current crop of AIs are 100% "show". They demonstrate intelligence directly, by answering questions instead of just talking about how they would answer questions, which is what fake intelligence would do.

PS: Next time you see a politician promise to deliver a solution, but not actually state what that solution is... well... now you know what's going on with that particular situation!

[1] There's a degree of cheating available to authors, because they have information that the in-world fictional characters don't. Similarly, they can set up the question to lead to an answer they already have. The chat-based AIs can't cheat like this, they have to answer the questions given to them.


Sure there is. Know the answers to questions already. Being trained on a noninsignificant proportion of all questions ever asked on the Internet means you have to wonder how smart LLMs really are. Seeing the ways they fail, confirms not so much. They are able to abstract a bit of knowledge an statically match it to questions and answers already seen. Confident hallucinations help you see this.

The best hallucinations I saw was asking chatgpt for a list of references for a topic I was researching. They all looked entirely believable, titles, authors, institutions, summaries… IIRC there were about twenty of them. Only one actually existed.


How many leaf URLs have you memorised? I don't mean "google.com", I mean the URL of a specific paper in a journal, or an article in some random news site?

If you were forced at gunpoint to provide such URLs -- valid or not -- with literally no option other than to keep talking, you would... make up some vaguely correct-sounding URLs!

The AIs are doing precisely what an intelligent human in their place would do. Blurting out something, anything, even if they don't have the answers.

That's because the current crop of AIs are essentially being forced to provide the next token, directly equivalent to a human being forced to continue speaking against their will.

There are fixes being put in place already for this, such as using human feedback to tune the AI so that it answers with a generic "I don't know" blurb if it's not sufficiently confident in the next token. Annoyingly, the new "Turbo" versions of ChatGPT have overdone this a bit, and now it refuses to answer even when it does actually know the answer!

There are also variations of neural networks that encode not just "values", but probabilities and hence confidence intervals. These can directly output both tokens and the true confidence values, which would allow complex tree searches, etc... and avoid outputting this kind of gibberish entirely. The maths for this is well established, but it would be much more expensive to run models using it, so nobody has bothered yet.


> The AIs are doing precisely what an intelligent human in their place would do.

Nah. Either the human is feigning, lying with full knowledge that they’re doing so, or you torture them enough and they cease responding intelligently, if they start to believe their bullshit it isn’t unreasonable to say this is a loss of intelligent reasoning. I think this is a false equivalence.

> directly equivalent to a human being forced to continue speaking against their will.

Maybe some superficial appearance, but not equivalence.

> but probabilities and hence confidence intervals

And that’s the entirety of human metacognition. Maybe it’ll get us a little closer but that’s it.


Ask a human the same question and force them to answer (instead of saying I don't know), and they will hallucinate a list of references for you just the same. Does that mean humans are not intelligent?

I don't think your argument proves what you claim.


This!

One could argue that humans will say "I don't know" when they don't know, but that ultimately depends on training. We currently don't have LLMs trained to answer "I don't know" (I suspect that in part that's due to incentives of focusing on something that looks impressive; one could argue that if an AI system too often says "I don't know" it's less impressive albeit it may be more useful since you can rely upon it more when it says it knows)

There are plenty of cases where humans are trained to hope they sound convincing even if they make up answers. I've seen a fair share of this done by students.


This was my goto argument when people without technical background were laughing about ChatGPTs bullshit generation (before it was called hallucinating).

Put a gun to Einsteins head and force him to answer any question you ask him without hesitating or denying an answer.

This is not an argument for stating LLMs like ChatGPT are intelligent (it is not, but not for any of the reasons i have read about so far anywhere). But the hallucinations definitely don't show anything about intelligence. Actually, if they did, they would actually indicate intelligence rather than not. So, quite the opposite which should be very obvious.


> Penrose claims that human intelligence relies on quantum effects and demonstrates certain structures in neurons that makes it at least a plausible argument.

Except that Penrose's argument is biologically nonsense, which probably means that he has an incorrect definition of intelligence, since I'm pretty confident in his ability to reason from axioms.


The man did great things, but this quantum effect consciousness nonsense is a case of nobelitis.

https://en.wikipedia.org/wiki/Nobel_disease


Iirc his nobelitis around consciousness started before he even got awarded the prise. He clearly invented the time machine but didn't tell us yet


And so what? Your existence, as well as existence of Earth and anything else made of nuclei and electrons, relies on quantum effects.


The point being you would need a kind of quantum computer for real intelligence. It is at least a plausible requirement.


Even then, it sounds like a speed issue. Quantum computers can be simulated on traditional computers. And quantum is coming, so those that repeat this theory to comfort themselves are at best buying time.


While quantum computers can be simulated classically, you can’t have the simulated state of a simulated quantum computer, be entangled with something outside of the classical computer doing the simulating, while a quantum computer could have qubits entangled with something outside of the quantum computer.

So, if that was somehow important, then the “you can simulate the quantum computer classically (at the cost of exponential slowdown)” point wouldn’t handle that kind of thing.

But, it seems rather unlikely to me that we’ll find any significant evidence of that kind of thing being important for intelligence.


Yet the digital computer is based on quantum effects in transistors. We engineer the weirdness away to give the illusion of a digital computer. Eventually we will go back and see what can be done without that. Boolean algebra is nice a clean but there are other ways.


Why? Just introduce a separate loop of thoughts driven by a true quantum based RNG, something like a subconscious loop that subtly interacts with the rest of the system.

With "loop of thoughts" I am talking about using an a true RNG to create a never ending stream of the system "talking" to itself, and the subconscious part being that it would run separately from all other processing. The current content of this loop, would be weakly injected into the prompt that drives the main loop of thoughts (or possibly multiple loops of thoughts, why limit to one after all).

We won't reach AGI and true intelligence by adding more parameters, but via AutoGPT like systems that are self-driven without needing any external prompting. A simple addition of a true RNG will make the system nondeterministic.


I don’t think we are running out of boxes. We just try to determine the minimum requirement that we could constitute as general AI, and that bound is hard to find. It’s quite trivial to find less than minimal bounds, say, writing a novel research paper, making a discovery/invention, etc.


The issue is that intelligence itself is an ill-defined concept, and an unofficial but broadly-shared rough definition is "what separates humans (and some animals) from unanimated things". So, as soon as a machine can do something (whether it's having good memory, proving things, playing chess, translating texts, drawing a picture, driving a car, anything) it no longer belongs to "intelligence". Using that very definition, AI is an oxymoron.


Amusingly, there's part of the commercial AI community that wants to define intelligence as what companies will pay humans sitting at desks to do.


> This is close to theology

Yeah, even the arguments reek of basically "God of the gaps" and "moving the goalposts" stuff


Engineering POV makes sense, but Turing-completeness is well-defined in CS i.e. maths, and maths deals with unbuildable stuff all the time. You can't engineer a result into CS unless you prove your engineering in a mathematical fashion, e.g. the famous four colour theorem proof.


Isn’t genetic material an infinite state machine as long as the host system has the requisite support continuity?

“Ai” is just the current moniker for a persistent assault on the original general intelligence (belief not required).

Pretty sure religion is an accurate representation of a generative intelligence, but I digress to knowing little.

“James


> Yes. Anything with finite memory is not Turing-complete, and transformers are finite-state machines. So what? Anything with infinite memory is un-buildable.

I think that's missing the point. A Turing Machine description is independent of the memory provided to it by the hardware and it doesn't need to be rewritten every time the hardware improves. That's very different from a FSM which is restricted in memory a priori.

I don't really know enough about transformers to understand where they fit in here, though some reviewers do seem to suggest that allowing transformers access to infinite memory does make them also Turing Complete.


This is such a philosophical rabbit hole that I will get into when time allows it.


Good approach.


I don't even have a phD. I just find openreview's feedback fun to read. For the same reason I love watching Gordon Ramsey give feedback.


Me neither, but I know what the process is like from having co-authored some stuff at work. This is not the kind of review you get if the paper is simply not good enough. Those are some seriously pissed off reviewers. Compare for example with this:

https://openreview.net/forum?id=cXs5md5wAq


Aw man... I was looking forward to a whole night of attempted savage takedowns. Not mature constructive criticisms.


These are just jokes with reviewers having some fun after doing 20 other reviews. The review process isn't really needed for a paper this bad, but they have to go through the motions, and so people have fun with it.

It’s not a big deal. Shitty papers exist.


I actually really appreciate these brutal reviews


Welcome to one of the most hated parts of the academia.


The reviews are pretty harsh, but after reading the paper, I feel they may be too generous. Somehow this paper spends several pages proving things that are both trivial and irrelevant, and spends zero words explaining the architecture of their model. This is borderline crackpot territory.


Yeah, I sort of agree. It hardly provides any details about the model that they supposedly used to achieve only two benchmark results which are poorly described. It comes across as being written by someone that only dabbles in the topic but seems to believe they've taken some great leap forward ahead of the entire field.


Sounds consistent with some non-scientists working in the field, maybe for some smaller software company? They find something cool, and try to write a paper about it, which is something they aren't very experienced with.


Agreed. Why is this on HN again ?


Something I noticed when skimming the paper that is also called out by one of the reviewers:

"Moreover, in the submission the authors considered only Transformers with input and output of bounded lengths, which are quite strange since Turing machines do not pose constraints on the tape length. If the length is constrained in Transformers, they clearly do not match Turing machines."


I don't think the comparison apply in this way, Turing machine may operate on an unconstrained tape but they only look at a single cell at each step. This find and replace technique operate on a bounded context but, in the same way as a turing machine, they can "technically" operate on an unconstrained tape (technically here because the find transformer wouldn't be able to lookup an infinite tape in the real world). I guess you could say that a turing machine has a bounded input length of 1.


Okay, actually I think I just misunderstood the argument in the paper. After taking a second look, I think their argument goes as follows:

A typical autoregressive transformer model operates on a fixed-size context that acts as both input and a queue to which the output of the model is appended. If a model operates on a context of length k with n possible symbols in its alphabet, then there are n^k possible contexts.

If you run such a model for n^k iterations, then

1) if the <eos> (end of sequence) symbol appears in the context, then the model halts.

2) if the <eos> symbol never appears in the context, then you know the model never halts because it must repeat a context string by the pigeon hole principal and must be in an infinite loop.

Therefore, the halting problem is decidable for transformer models with finite length contexts that are autoregressive in this way and that would imply a contradiction if we claim they are Turing complete (because the halting problem is known to be undecidable for Turing complete systems).

I'm not entirely sure this proves anything (it sounds believable I guess?), but at least I think it describes the argument they are making.

It is sort of interesting how it highlights the different between a system that can write to any position in memory vs. one that can only append to memory while being required to delete the first memory cell (where memory is a queue).

However, the overall paper is awful and pretty hard to take seriously.


An autoregressive transformer is trivially a finite state machine with the state being the k input tokens. The state update just discards the leftmost token of the context and adds the predicted token to the right.

It is somewhat tempting to look at the entire token sequence as a tape but that is misleading, once a token fell out of the context it is lost forever whereas a Turing machine always maintains access to the entire tape as it can just move as far left as it likes.

As said, for this question it is really more useful to think of the state as a fixed size array or queue of tokens where in each step everything gets pushed one position to the left by the newly predicted token and the leftmost token gets discarded.


This seems at least plausible, and it agrees with my preconceived notions that a good chunk of LLM capability is driven by memorization and not computation[0]. Is there any substantive critique of the underlying idea from the other reviewers, and not just the (evidently terrible) presentation of it?

[0] For a good idea as to why I think this way, see https://not-just-memorization.github.io/extracting-training-...


I'm not sure why that argument is applicable to a queue but not to any fixed sized memory? A transformer can conceptually modify any position by simply outputting the whole context except with 1 different value. Basically, it's restating in a convoluted way the known fact that anything with finite memory has finite states and thus the halting problem is solvable.


Yeah, I guess you're right. The queue vs. addressable memory thing seems secondary. Is the real difference that a transformer model is stateless and Turing machine is stateful? And the assumption of a stateless model is why they can assert that a repeated state implies an infinite loop (assuming we're not sampling the softmax to get the output token)?


But then, the same can be said of a computer with a finite-sized memory.

I think the real thing in the paper is the computational complexity of calculations. With turing machines, each step is O(1), with transformers it's at least O(n), where n is memory size.


A modern computer can be cast as a seq-to-seq model, too, so their "arguments" apply to those as well, just with larger n and k. Any finite machine suffers from this.


Where did you find the paper? It's not on arxiv?


There's a PDF link next to the title on the linked page, but here's a direct link: https://openreview.net/pdf?id=MGWsPGogLH


This response by the authors to the reviews to me looks like they didn't really understand the objection:

> First, it's worth noting that different reviewers sometimes gave opposite critiques of the paper, e.g. Reviewer erh8: The conclusion in this paper is questionable... It contradicts to [1], which shows that Transformers are Turing-complete Reviewer bz3o: The main claim that transformers with finite attention span are not computationally universal is both somewhat obvious and previously already stated

If I'm reading the reviews correctly, the claim by both reviewers was that transformers are actually Turing complete, but one reviewer added that they're "obviously" not Turing complete if you restrict their memory a priori (which I would agree is obvious). So there isn't really a contradiction between the reviews.

From briefly skimming the paper, this does look indeed to me like researchers which aren't really familiar with theoretical CS trying to handwave their way into something that looks ground-breaking. But while you absolutely can get away with vague-ish description in a more experimental part of CS, you absolutely can't get away with it in computability theory - that field is rigorous, and basically maths.


Transformers with fixed precision and any amount of memory are not Turing complete. They're bounded by a variant of first order logic with counting quantifiers. https://arxiv.org/abs/2301.10743

This supports the intuitive idea that they "learn circuits for things" that I've heard a few others mention.

If they have infinite precision then I suppose they can simulate a Turing machine in their infinitely precise states...


It appears that it's exactly the distinction between fixed and arbitrary precision that decides whether or not transformers are Turing Complete: https://openreview.net/forum?id=HyGBdo0qFm&noteId=r1lxKiAJlV

I think the larger point though is that "is X Turing Complete?" depends on the exact assumptions made and that's a point that the authors of the paper IMHO failed to understand while reading the reviews.


The arbitrary vs fixed precision people are also totally clueless.

All of the arbitrary precision papers should have been rejected. This is CS theory 101. Any non-trivial architecture is not just Turing complete, it can perform hypercomputation (solving the halting problem), if it can use arbitrary precision numbers.


> Any non-trivial architecture is not just Turing complete, it can perform hypercomputation (solving the halting problem), if it can use arbitrary precision numbers.

Can you explain this part? (It doesn't seem like CS theory 101 to me - maybe 201).

I suppose that I'm not entirely sure what you mean by "use arbitrary precision numbers". TMs can use arbitrary precision numbers (integers and rationals) and they can't solve the halting problem, so you must mean something different.

If what you say is true (I don't really know anything about transformers), then that reflects rather poorly on the field that such wrong results would end up being published.


Hear me out. Try three transformers.




Bah. You need at least 7 to get a close enough shave. https://www.kroger.com/p/bromley-s-for-men-get-a-grip-razor/...


The link to that Onion article was worth a double up vote!


Can’t. Three Body Problem.


Is the discussion about the paper, or about how it was unilaterally rejected?


Meh, let’s see an open license implementation.

Unfortunately, I don’t find the acceptance or rejection of a paper in this field has much predictive value these days. By and large, the reviewers tend to be more and more like Wikipedia editors as time goes on.


Whether or not the review process works, open review is a slightly different process. I’d expect a skeptic of the review process to celebrate the idea of reviewers “showing their work.”


> Let’s consider a simple problem that finite automata cannot solve: recognizing the language L = {anbn|n ≥ 0}. This language consists of strings with n ’a’s followed by n ’b’s. > A regular expression or finite automaton can recognize strings in this language up to a certain length. For example, the regular expression a∗b∗ can recognize strings in L up to length 2.

That regex makes no guarantee that the number of a's matches the number of b's, which doesn't match their language definition. I think they wanted (ab)*, which does, and can match any string in their language.


Wow, that's wild. Not only are they writing down things everybody in that field should know, but they fail to even properly understand the very basics of regular languages. Literally what you learn in the first week of an intro to CS course.

What they wrote makes absolutely no sense, I have no idea where this confusion came from.

They're also at least imprecise (or possibly again confused) when they write that "no regular expression [...] can recognize all strings in L" - they should have written "exactly the strings in L", because you can of course write a regex (or FSM) that recognises every string.

However, they definitely shouldn't have written (ab)* because this doesn't recognise L either (it does keep the number of as and bs balanced, but they're in the wrong order) - and with good reason, that language is simply not regular: as they do correctly write, it would require arbitrarily large memory (formally, that's the Myhill-Nerode theorem).

(Also, as a side-note, it's "an automaton", not "an automata", and the fact that they have no spacing around the | in their set notation is also a minor pet peeve of mine. But now I'm splitting hairs.)


Out of curiosity, what do people think of the comments to the reviewers by the authors?

I was pretty surprised to see them challenge the reviewers. Maybe open review is different, but I was trained to try find ways to defer to the reviewers or, basically, placate them if possible. It look like the authors have tried to argue back, for example finding a contradiction between the reviews… it seems like a risky strategy to me. Then again I haven’t ever received feedback this negative, thank goodness.


After skimming the paper I kind of agree with the reviewers on this one, there's too much time spent going over undergrad level TCS material, and not much in the way of related papers cited (for instance, not one mention of circuit complexity). It's a good introduction to the material for people a bit rusty in the field, but if I as a layman can nod along with my meager TCS knowledge then those parts are probably a bit _too_ trivial to include (at least for a journal submission. An addendum or blog post would be fine).

It may be quite possible that they did stumble upon a model that beats gpt-4 on their tasks, in which case they should release the code/model and let those results stand for itself. But as a _paper_ it's not very good since the main novel contribution (this "Find/Replace modification to transformer architecture" has less than a page dedicated to describing it).


I think there are two situations in which you might see this kind of rebuttal:

1) The reviewers were truly unreasonable, and the authors feel they have no hope of winning them over. Instead, they try to highlight the flaws in the reviews to play to the Area Chair, and hope the reviewers are overruled. It's a longshot, but worth a try if you don't feel there's another path.

2) The authors aren't really familiar with the review process, and are just upset with the negative feedback. They're just looking to argue for the sake of arguing, and not as part of a strategy to get accepted. I think that's what we're seeing here.


I've been a meta-reviewer, reviewer, and author at ICLR many times (not involved in reviewing this paper).

The authors did not respond to reviewers properly at all! Dear readers, please do not do this if you submit.

Negative feedback from reviewers is good. It's the point of the review system. If the reviewers aren't going to give negative feedback, we need to remove them. This is not extreme negative feedback by any means. And had the authors actually been correct, they had a shot to convince reviewer, or at least an attentive meta-reviewer.

That being said, the instructions for how to reply are clear. Authors must respond to each reviewer individually. Point by point. And then engage in a discussion with the reviewers about their responses. They did none of this. It's the mark of a nutjob who wants to avoid being challenged.

Click any of that papers of the first page here to see how it's supposed to work https://openreview.net/group?id=ICLR.cc/2022/Conference

Complaining about the quality of your reviews to your meta-reviewer is... poor form. And it does not make it more likely your paper will be accepted. We're looking for reasons to accept based on the quality of your work. The perceived shortcomings of reviewers aren't an indication that your work is good. Many meta-reviewers will tell authors to knock it off next time, because this isn't helping their case.


Still going to Review the paper, but what about NTM-like transformers if turing is to be attained(external memory)


Looks like things are about to get interesting, between this work and iterative, self directed AI REPL type architecture.


Why not as many as possible like how gpus have shader processors


Paper written by one transformer?!



I am not sure what's wrong with the HN search, but I searched the URL before posting it and found nothing. As of writing this comment, it still shows only this URL. Maybe slow indexing time or Am I doing something wrong?

https://hn.algolia.com/?q=https%3A%2F%2Fopenreview.net%2Ffor...


One trick - after submitting, check page 1 of the domain submissions link.

I enjoyed this discussion though, so glad things went the way they did. Cheers.


Thanks! I will do it!




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: