The scientific impact of the transformer paper is large, but in my opinion the novelty is vastly overstated. The primary novelty is adapting the (already existing) dot-product attention mechanism to be multi-headed. And frankly, the single-head -> multi-head evolution wasn't particularly novel -- it's the same trick the computer vision community applied to convolutions 5 years earlier, yielding the widely-adopted grouped convolution. The lasting contribution of the Transformer paper is really just ordering the existing architectural primitives (attention layers, feedforward layers, normalization, residuals) in a nice, reusable block. In my opinion, the most impactful contributions in the lineage of modern attention-based LLMs are the introduction of dot-product attention (Bahdanau et al, 2015) and the first attention-based sequence-to-sequence model (Graves, 2013). Both of these are from academic labs.
As a side note, a similar phenomenon occurred with the Adam optimizer, where the ratio of public/scientific attribution to novelty is disproportionately large (the Adam optimizer is very minor modification of the RMSProp + momentum optimization algorithm presented in the same Graves, 2013 paper mentioned above)
I think the most novel part of it, and where a lot of the power comes from, is in the key based attention, which then operationally gives rise to the emergence of induction heads (whereby pair of adjacent layers coordinate to provide a powerful context lookup and copy mechanism).
The reusable/stackable block is of course a key part of the design since the key insight was that language is as much hierarchical as sequential, and can therefore be processed in parallel (not in sequence) with a hierarchical stack of layers that each use the key-based lookup mechanism to access other tokens whether based on position or not.
In any case, if you look at the seq2seq architectures than preceded it, it's hard to claim that the Transformer is really based-on/evolved-from any of them (especially prevailing recurrent approaches), notwithstanding that it obviously leveraged the concept of attention.
I find the developmental history of the Transformer interesting, and wish more had been documented about it. It seems from interview with Uszkoreit that the idea of parallel language processing based on an hierarchical design using self-attention was his, but that he was personally unable to realize this idea in a way that beat other contemporary approaches. Noam Shazeer was the one who then took the idea and realized it in the the form that would eventually become the Transformer, but it seems there was some degree of throw the kitchen sink at it and then a later ablation process to minimize the design. What would be interesting to know would be an honest assessment of how much of the final design was inspiration and how much experimentation. It's hard to imagine that Shazeer anticipated the emergence of induction heads when this model was trained at sufficient scale, so the architecture does seem to at least partly be an a accidental discovery, and more than the next generation seq2seq model that it seems to have been conceived as.
Key-based attention is not attributable to the Transformer paper. First paper I can find where keys, queries, and values are distinct matrices is https://arxiv.org/abs/1703.03906, described at the end of section 2. The authors of the Transformer paper are very clear in how they describe their contribution to the attention formulation, writing "Dot-product attention is identical to our algorithm, except for the scaling factor". I think it's fair to state that multi-head is the paper's only substantial contribution to the design of attention mechanisms.
I think you're overestimating the degree to which this type of research is motivated by big-picture, top-down thinking. In reality, it's a bunch of empirically-driven, in-the-weeds experiments that guide a very local search in a intractably large search space. I can just about guarantee the process went something like this:
- The authors begin with an architecture similar to the current SOTA, which was a mix of recurrent layers and attention
- The authors realize that they can replace some of the recurrent layers with attention layers, and performance is equal or better. It's also way faster, so they try to replace as many recurrent layers as possible.
- They realize that if they remove all the recurrent layers, the model sucks. They're smart people and they quickly realize this is because the attention-only model is invariant to sequence order. They add positional encodings to compensate for this.
- They keep iterating on the architecture design, incorporating best-practices from the computer vision community such as normalization and residual connections, resulting in the now-famous Transformer block.
At no point is any stroke of genius required to get from the prior SOTA to the Transformer. It's the type of discovery that follows so naturally from an empirically-driven approach to research that it feels all but inevitable.
I've seen and ignored a lot of "pytorch good, tensorflow bad" takes in my time, but this is so egregiously wrong I can't help but chime in. Facilitating graph-level optimizations has been one of the most central tenets of tensorflow's design philosophy since its inception. The XLA compiler was designed in close collaboration with the tensorflow team and was available in the tensorflow API as far back as 2017. It's not an exaggeration to say that pytorch is 5+ years behind on this front. Before anyone invokes the words "pythonic" or "ergonomic", I'd like to note that the tensorflow 2 API for compilation is nearly identical to torch.compile.
11. notice that there's a unicode rendering error ("'" for apostrophe) on kernel_initializer and bias_initializer default arguments in the documentation, and wonder why on earth for such a high-level API one would want to expose lora_rank as a first class construct. Also, 3 out of the 5 links in the "Used in the guide" links point to TF1 to TF2 migration articles - TF2 was released 5 years ago.
To add onto this I feel like one of the hard things about TF is that there is like at least 3 ways to do everything because they have supported multiple APIs and migrated to eager. So if you find an example or an open source project it might not be for the flavor of tensorflow that your codebase is in.
I feel like that with every single Google api doc. if there's a variable called x, the documentation will be "variable to store x". and you need to create/supply 5 different resources before you can create an x, but these will each require 5 further things to be figured out before you can create one of them.
Re 6: TF/Keras team motivates random people to write long tutorials and be featured in the official site and their tutorial be included in the official guides. I have seen a lot of subpar devs/AI people write subpar tutorials and brag on twitter how their tutorials are included in the official Keras site.
Honestly, this example holds true for roughly half of the Python ecosystem; and you can square the level of frustration if it's also anything coming from Google.
(This pattern is relatively easy to understand: smart people creating something get their gratification from the creation process, not writing tedious documentation; and this is systemically embedded for people at Google, who are probably directly incentivised in a similar way.)
Tensorflow works really well in theory. In practice a lot less so. I saw someone spend months fighting Tensorflow to convert a production model from CPU to GPU inference with any sort of efficiency. Tons of issues due to bugs across versions, deprecations of features across versions, the graph optimizer shuffling data back to the CPU for no decent reason, etc. The person had no idea what was happening or why most of the time due to how black box Tensorflow was. This was a very senior ML engineer with a lot of Tensorflow experience.
Does tensorflow have a future? I doubt it. I don't think Google is really investing many resources into it (beyond the necessary maintainence to support whatever production models still depend on it). The cost of migrating from old TF to new TF was really large, half the projects that depend on TF that I try to use just break out of the box (only 1/4 of torch projects I try fail that way).
From what I can tell Google is moving in a direction that doesn't require tensorflow, and I don't see it gaining signficant adoption outside google, so it seems most likely we will simply see it deprecated in about 10 years. It's best to see it as a transitional technology that Jeff Dean created to spur ML development internally, which was mistakenly open sourced, and now, Jeff's reports typically use Jax or other systems.
I think tensorflow-datasets and tensorflow-serving are great, but for model development I think most people use JAX and then export it to a tensorflow SavedModel with Orbax.
But IIUC Jax also leverages XLA and for the purpose of this discussion the frontend matters only inasmuch people feel productive in using it. Whether that's TF or Jax.
> Facilitating graph-level optimizations has been one of the most central tenets of tensorflow's design philosophy since its inception.
Agreed of course but it's not like they came up with this approach from scratch. They seem to have just picked it up from Theano (now Aesara/PyTensor).
Tensorflow is a lot like IBM -- it deserves praise not because it's great in its current state, but for its contributions towards advancing the broader technological front to where it is today. Tensorflow walked so JAX could run, so to speak. Frankly, I don't really draw much of a distinction between the two frameworks since I really just use them as lightweight XLA wrappers.
Tensorflow started out as anything but lightweight. In my opinion it takes the cake for kludgiest framework I've ever worked with. So verbose, so little effort put into ergonomics. Even eager mode is not really valuable unless you're working on a legacy project.
+1. As someone who has tried to migrate multiple tf.function to torch.compile, tensorflow edge is not small in this. torch.compile still is highly highly experimental. Don't believe me, just go and look into github issues as torch maintainers try to figure why torch.compile makes code very unoptimal in lot of cases, or results in incomprehensible errors.
> humanity discovered an algorithm that could really, truly learn any distribution of data (or really, the underlying “rules” that produce any distribution of data)
He's hand-waving around the idea presented in the Universal Approximation Theorem, but he's mangled it to the point of falsehood by conflating representation and learning. Just because we can parameterize an arbitrarily flexible class of distributions doesn't mean we have an algorithm to learn the optimal set of parameters. He digs an even deeper hole by claiming that this algorithm actually learns 'the underlying “rules” that produce any distribution of data', which is essentially a totally unfounded assertion that the functions learned by neural nets will generalize is some particular manner.
> I find that no matter how much time I spend thinking about this, I can never really internalize how consequential it is.
If you think the Universal Approximation Theorem is this profound, you haven't understood it. It's about as profound as the notion that you can approximate a polynomial by splicing together an infinite number of piecewise linear functions.
"Just because we can parameterize an arbitrarily flexible class of distributions doesn't mean we have an algorithm to learn the optimal set of parameters."
This is equally mangled, if not more, than what Altman is saying. We don't need to learn "the optimal" set of parameters. We need to learn "a good" set of parameters that approximates the original distribution "well enough." Gradient methods and large networks with lots of parameters seem to be capable of doing that without overfitting to the data set. That's a much stronger statement than the universal approximation theorem.
Yes, he's handwaving in this general area, but no, he's not really relying on the UAT. If you talked to most NN people 2 decades ago and asked about this, they might well answer in terms of the UAT. But nowadays, most people, including here Altman, would answer in terms of practical experience of success in learning a surprisingly diverse array of distributions using a single architecture.
I think that while researchers would agree that the empirical success of deep learning has been remarkable, they would still agree that the language used here -- "an algorithm that could really, truly learn any distribution of data (or really, the underlying “rules” that produce any distribution of data)" -- is an overly strong characterization, to the point that it is no longer accurate. A hash function is a good example of a generating process which NN + SGD will not learn with any degree of generalization. If you trained GPT4 on an infinite dataset of strings and their corresponding hashes, it would simply saturate its 100 billion+ parameters with something akin to a compressed lookup table of input/output pairs, despite the true generating process being a program that could be expressed in less than a kilobyte. On unseen data, it would be no better than a uniform prior over hashes. Anyways, my point is that people knowledgable in the field would have far more tempered takes on the practical limits of deep learning, and would reserve the absolute framing used here for claims that have been proven formally.
Certainly, it's an exaggeration / simplification. I don't feel it is really a dishonest one, in context [*]. It feels weird for me to "defend" him here, because in general Altman is a stupendous, egregious, world-leading liar.
[*] Ok, we can differ on that. My feeling is partly because the types of distributions that can't be learned - eg hash functions - are generally the kind of functions we don't really want to learn. Underneath this are deeper questions related to no free lunch and how "nice"/"well-behaved" this universe is.
You’re (both!) getting into metaphysics without necessarily realizing it. He’s just saying that a machine that could learn any pattern—not a sub pattern of accidental actualities that it overfits on, but the real virtual pattern driving some set of phenomena—would be a game changer. Sure, there are infinite things that can’t be reduced to polynomials, but something tells me that a whole lot of things that matter to us are, across the fields of Physics, Biology, Sociology and Neurology especially.
Basically it’ll be (and already has been since the quantum breakthroughs in the 1920s, to some extent) a revolution in scientific methods not unlike what Newton and Galileo brought us with physical mechanics: this time, sadly, the mechanics are beyond our direct comprehension, reachable only with statistical tricks and smart guessing machines.
errata. Also real humans often make mistakes in live interviews. The biggest difference is that eventually these fake humans will have lower error rates than real ones.
Is batched inference for LLMs memory bound? My understanding is that sufficiently large batched matmuls will be compute bound and flash attention has mostly removed the memory bottleneck in the attention computation. If so, the value proposition here -- as well as with other memorymaxxing startups like Groq -- is primarily on the latency side of things. Though my personal impression is that latency isn't really a huge issue right now, especially for text. Even OpenAI's voice models are (purportedly) able to be served with a latency which is a low multiple of network latency, and I expect there is room for improvement here as this is essentially the first generation of real-time voice LLMs.
Batched inference will increase your overall throughput, but each user will still be seeing the original throughput number. It's not necessarily a memory vs compute issue in the same way training is. It's more a function of the auto-regressive nature of transformer inference as far as I understand which presents unique challenges.
If you have an H100 doing 100 tokens/sec and you batch 1000 requests, you might be able to get to 100K tok/sec but each user's request will still be outputting 100 tokens/sec which will make the speed of the response stream the same. So if your output stream speed is slow, batching might not improve user experience, even if you can get a higher chip utilization / "overall" throughput.
Furthermore, each of those 16 channels would typically be mutibyte floats as opposed to single byte RGB channels. (speaking generally, haven't read the paper)
So glad the AK account exists. As a researcher, I've always wanted some guy with an econ degree and a year of ML eng to recommend me papers after glancing at them for maybe 30 seconds.
I am genuinely baffled that researchers in the field think there is any value in the service AK provides. I'd wager I could create an equally effective bot with the following process:
1) Create a historical dataset of publications and their citation counts
2) For each publication extract the following features:
- H-index of first author
- Maximum H-index of all authors
- Number of author affiliations in {top-10 school, deepmind, meta, openai, nvidia}
- Number of times the phrase "state-of-the-art" appears
- Which latex template is used (NeurIPS, ICML, etc.)
- Number of images in the paper
- Whether there is an image on the first page
- Whether "all you need" appears in the title
- Whether the publication has a linked project page
3) Train a shallow decision tree with citation counts as the regression target
A friend of mine created a bot to do basically this, except it also looks at the current page rank associated with researchers recommending that paper. I've seen a lot of good looking papers (decent school/group/conference submission/etc.) that don't end up contributing to the field. Top researchers and Professors tend to have a better intuition of importance by reading the abstract and a quick skim.
There are have been many, many services that have tried to automate paper selection based on these heuristics. None of them have had the staying power of AK's account. As someone with a PhD in machine learning from Stanford, I can attest AK's taste is quite good.
You should be ashamed of yourself and apologize. What's more likely? There's paid shills lying that he provides non-negative value, or you're missing something when asserting for every ML PhD on earth, they must think his extremely popular work has negative value?
Apparently my innuendo has not been taken well, so I'll clarify in more straightforward terms: @abidlabs is the employer of the individual running the aforementioned twitter account.
If this is true then not disclosing that is extremely unethical of @abidlabs, damn near intellectual malfeasance, and reflects very poorly on the rest of their work.
If someone has done more of the quantitative side of econ, they are well positioned to pick up ML real fast. And the average AI/ML paper simply isn't very difficult to understand. I was a comp sci undergrad and some of the econometrics focused folks were much closer to ML work than anyone doing comp sci (this was quite some time ago though).
"Do I recognize the latex template" is my number one filter when clicking through the new arxiv papers each day, so I definitely buy that that would work.
> As a researcher, I've always wanted some guy with an econ degree and a year of ML eng to recommend me papers after glancing at them for maybe 30 seconds.
This kind of elitism is baffling given the quality of work from independent researchers recently.
And no, I have no I have no “vested interest” (whatever the hell thats supposed to mean) in someone’s Twitter.
When I open a large pdf on arxiv (100+ MB, not uncommon for ML papers focused on hi-res image generation), there is a significant load time (10+ seconds) before anything is rendered at all other than a loading bar. Does anyone know what the source of this delay is? Is it network-bound or is Chrome just really slow to render large PDFs? Do PDFs have to be fully downloaded to begin rendering? In any case, this delay is my only gripe with arxiv and a progressively rendered HTML doc that instantly loads the document text would be a huge improvement.
> Does anyone know what the source of this delay is? Is it network-bound or is Chrome just really slow to render large PDFs? Do PDFs have to be fully downloaded to begin rendering? In any case, this delay is my only gripe with arxiv and a progressively rendered HTML doc that instantly loads the document text would be a huge improvement.
The default PDF format puts the xref table at the end of the file, forcing a full download before rendering can take place. PDF-1.2 onwards supports linearized PDFs, and most PDF export tools have some way of enabling it (usually an option like "optimize for web").
I have the same issue. From what I can tell it’s just network-bound and the Arxiv servers are slow. They theoretically allow for you to setup a caching server but after spending a while trying to get it setup, I haven’t been able to get it to work.
It may be even that the time is taken to generate a PDF.
The format in which articles are submitted and stored in arXive is LaTeX. PDF is automatically generated from it.
Probably arXiv does some caching of PDFs so they don't have to be generated anew every time they are requested, but I don't know how this caching works.
> a new kind of AI R&D lab which creates practical end-user products based on foundational research breakthroughs
This isn't new and if anything it's the de facto standard for just about every AI research lab these days. OpenAI is the obvious example of an AI lab with tightly coupled product and research roadmaps and ChatGPT is the most prominent example of a successful research-driven AI product. A few years ago it could be argued that DeepMind and (fka) FAIR were siloed off from their respective orgs, but these days they are littered with product teams and their research roadmaps reflect this influence as well.
They do try to claim that what they are doing is different from OpenAI because they are focused on applications of AI whereas OpenAI is focused on building AGI, which is a laughable mischaracterization of OpenAI's current roadmap. I personally have a hard time believing that path to AGI runs through the GPT store.
Accomplished researchers in AI can fundraise on their reputations alone, and Jeremy is no exception. The primary differentiator of any new startup in this space is the caliber of its researchers and engineers. But this post is really grasping at straws to claim that their value is from some new approach to R&D, which is a totally unnecessary framing.
reply