I wonder if we're going to end up in an arms race between AIs masquerading as contributors (and security researchers) trying to introduce vulnerabilities into popular libraries, and AIs trying to detect and fix them.
Why would it be like that instead of the way we already handle low-trust environments?
Projects that get a lot of attention already put up barriers to new contributions, and the ones that get less attention will continue to get less attention.
The review process cannot be left to AI because it will introduce uncertainty nobody wants to be held responsible for.
If anything, the people who have always seen code as a mere means to an end will finally come to a forced decision: either stop fucking around or get out of the way.
An adversarial web is ultimately good for software quality, but less open than it used to be. I'm not even sure if that's a bad thing.
What I'm suggesting is: what if AIs get so good at crafting vulnerable (but apparently innocent) code than human review cannot reliably catch them?
And saying "ones that get less attention will continue to get less attention" is like imagining that only popular email addresses get spammed. Once malice is automated, everyone gets attention.
I think the issue I have with this argument is that it's not a logical conclusion that's based on technological choice.
It's an argument about affordability and the economics behind it, which puts more burden on the (open source) supply chain which is already stressed to its limit. Maintainers simply don't have the money to keep up with foreign state actors. Heck, they don't even have money for food at this point, and have to work another job to be able to do open source in their free time.
I know there are exceptions, but they are veeeery marginal. The norm is: open source is unpaid, tedious, and hard work to do. It will get harder if you just look at the sheer amount of slopcode pull requests that plague a lot of projects already.
The trend is likely going to be more blocked pull requests by default rather than having to read and evaluate each of them.
So I do think we're in a bubble, but I also remember when all the discussion around here was around Uber, and I read many, many hot takes about how they were vastly unprofitable, had no real business model, could never be profitable, and only existed because investors were pumping in money and as soon as they stopped, Uber would be dead. Well, it's now ten years later, Uber still exists, and last year they made $43.9bn in revenue and net income of $9.8bn.
Oh dear, we are definitely in a bubble, it's just not in the way of total burst.
Back when everybody got into website building, Microsoft released a software called FrontPage, a WYSIWYG HTML editor that could help you build a website, and some of it's backend features too. With the software you can create a website completed with home, newspages and guestbooks, with ease, compare to writing "raw" code.
Now days however, almost all of us are still writing HTML and backend code manually. Why? I believe it's because the tool is too slow to fit in a quick-moving modern world. It takes Microsoft weeks of work just to come out with something that poorly mimics what was invented by an actual web dev in an afternoon.
Humans are adoptive, tools are not. Some times, tools can better humans in productivity, sometime it can't.
AI is still founding it's use cases. Maybe it's good at acting like a cheap, stupid and spying secretary for everyone, and maybe it can write some code for you, but if you ask it to "write me a YouTube", it just can't help you.
Problem is, real boss/user would demand "write me a YouTube" or "build a Fortnite" or "help me make some money". The fact that you have to write a detailed prompt and then debug it's output, is the exact reason why it's not productive. The reality that it can only help you writing code instead of building an actually usable product based on a simple sentence such as "the company has decided to move to online retail, you need to build a system to enable that" is a proof of LLM's shortcomings.
So, AI has limits, and people are finding out. After that, the bubble will shrink to fit it's actual value.
This is fair but it's also assuming that today's AI has reached its potential which frankly I don't think any of us know. There's a lot of investment being spent in compute and research from a lot of different players and we could definitely make some breakthroughs. I doubt many of us would've predicted even the progress we've had in the last few years before chatGPT came out.
I think the bubble will be defined on whether these investments pan out in the next two years or if we just have small incremental progress like gpt4 to gpt5, not what products are made with today's llm. It remains to be seen.
I think Uber’s profitability has also been achieved by passing what would be debt to a traditional taxi company (the maintenance of the fleet of taxis) onto the drivers. I think many drivers aren’t making as much money as they think they are.
Did this change since Uber was created? Did Uber previously, back when people were making their "Uber is Doomed" comments pay to maintain drivers' cars? If not why bring it up?
This is a pattern where people have their pre-loaded criticisms of companies/systems and just dump them into any tangentially related discussion rather than engaging with the specific question at hand. It makes it impossible to have focused analytical discussions. Cached selves, but for everything.
But did their business model require them doing that forever? That seems like something they can cut back on once there is a healthy size of drivers in a market.
Yeah I agree it was the original plan from the beginning: use Saudi money to strangle competition and then get the prices back to taxi level (or higher). I believe they partly succeeded by making a compromise here: they both cut the payments to drivers and increased prices.
The original plan worked because in the switch-and-bait phase they were visibly cheaper so in the last year people's mental and speech model changed from "call me a taxi" to "call me an uber". But at least in my local market, the price difference between a taxi an and uber in 2025 is negligible.
A decade ago in NYC, they were giving out free rides left and right. I used Uber for months without paying for a single ride, then when they started charging, they were steeply discounted. I could get around for a little more than a subway fare.
Lyft did the same thing, got a bunch of free rides for a while with them, too.
What I think has never changed, is that most people do not understand depreciation on an asset like a car, or how use of that vehicle contributes to the depreciation. People see the cost of maintenance of a vehicle as something inevitable that they have almost no control over.
I think the point is about Uber's profitability and not necessarily about their business practices or ethics, and we should be careful not to conflate the two. It is absolutely valid to criticize the latter, but that (so far) seems mostly orthogonal to the former.
Now, it is totally possible that their behavior eventually create a backlash which then affects their business, but then that is still a different discussion from what was discussed before.
There is also a significant difference in insurance. Taxi companies usually have comprehensive insurance, hence the higher standards for drivers and vehicles (monitored and maintained) while Uber has a more differentiated model (part driver, part company, not monitored):
This is underselling the Uber story to a degree. The original sell for Uber was that their total addressable market was the entire auto industry because people will start preferring taxis over driving. They are still trying to achieve that with similar stories now pushed to sell robotaxis.
Uber was undercutting traditional taxis either through driver incentive or cheaper pricing. Many hot takes were around the sustainability of this business model without VC money. In many places this turned out to be true. Driver incentives are way down and Uber pricing is way up.
That said, this is also conflating one company with an industry. Uber might have survived but how many ride sharing companies have survived in total? How many markets have Uber left because it couldn’t sustain?
In a bubble the destruction is often that some big companies get destroyed and others survive. For every pets.com there is one Amazon. That doesn’t mean Amazon is good example to say naysayers during the dot bubble were wrong.
Simplifying Uber's story to "pricing or more drivers" misses the most important part.
Uber was undercutting traditional taxis because, at least in the US, the traditional taxis was horrible user experience. No phone app, no way to give feedback on driver, horrible cars, unpredictable charges... This was because taxis had monopoly in most cities, so they really did not care about customers.
The times when Uber was super-cheap have long passed, but I still never plan to ride regular taxis. It's Waymo (when availiable) or Lyft for me.
Well just look at the price of Uber and Lyft rides. I regularly had single-digit fares on both Uber and Lyft early on. Of course they were unprofitable then. Now that they have gained mindshare they have increased prices drastically.
Uber proposed $43.00 yesterday for a 23 minute drive from park slope to brooklyn heights in New York City, versus $2.90 for a 35 minute R train ride.
I am humbled by how myopic I was in 2010 cheering for a taxi-hailing smartphone app to create consumer surplus by ordering taxis by calling taxi companies.
It's been my experience (~ 4 years ago) that generally taxis were cheaper than Uber in new york, especially for anything like "Get me to the airport", sometimes like $25 cheaper.
In my experience its actually cheaper at least for airport rides. $50 flat through yellowcab app and no surge nor tip when ordered through app compared to $65 at best sometimes well over double during a bad surge.
Airport trips these days are often over $100 for me. What is crazy is yellowcab will take me to my area for $50 flat tip included through their app. We’ve exceeded even taxicabs by this point.
The story about Uber was that they were going to be unprofitable until they destroyed taxi services, then they were going to charge more than taxis and give less of a share to the driver.
Nobody is predicting that AI is going to do that. One thing I hadn't considered before is how much it was in google's interest to overestimate and market the impact of AI during their antitrust proceedings. For the conspiratorially minded (me), that's why the bottom is being allowed to drop out of the irrational exuberance over AI now, rather than a couple months ago.
it doesnt look likely that any particular ai service will have a moat. every time one of them does anything right now, theres a dozen competitors able to match it within months
Uber was unprofitable and when it ceased to be unprofitable ceased to be better.
They did managed to offload price on weaker actors party by simply ugnoring laws and hoping it will work for them. It did, but it was not exactly some grand inspiring victory and more of success of "some dont have to follow the law" corruption.
At the prices they were charging back then that was indeed the accurate take. Of course prices rose and a lot of middle and lower income riders were kicked to the curb in favor of those who can afford to blow another $60 per leg on a night out. I guess there turned out to be enough of them at scale.
The "hot takes" were that they were using investor money to illegally undercut the taxi industry until ride share had an oligopoly and that the government would stop them from breaking the law.
I don't know why law enforcement is considered a hot take here, but I have a few guesses.
> Nakamura responded to Kramnik’s allegations by arguing that focusing on a particular streak while ignoring other games was cherry-picking. The researchers note that there’s a problem with this argument, too, as it violates the likelihood principle. This principle tells us the interpretation should only rely on the actual data observed, not the context in which it was collected.
I don't quite understand this objection? If I won the lottery at odds of 10 million to 1, you'd say that was a very lucky purchase. But if it turned out I bought 10 million tickets, then that context would surely be important for interpreting what happened, even if the odds of that specific ticket winning would be unchanged?
I believe they're speaking within the scope of the Bayesian analysis. We could interpret games outside of the winning streak as evidence to whether he's a cheater or not. Instead, I believe they are looking at the question of "given this winning streak in particular, what's the probability of him cheating in this set of games"?
They start with a prior (very low probability), I'm assuming they use the implied probabilities from the Elo differences, and then update that prior based on the wins. That's enough to find the posterior they're interested in, without needing to look outside the winning streak.
> "given this winning streak in particular, what's the probability of him cheating in this set of games"
I think the problem lies in the antecedent. Given all chess tournaments played, how often would we observe such a winning streak on average? If the number of winning streaks is near the average, we have no indication of cheating. If it is considerably lower or higher, some people were cheating (when lower, than the opponents).
Then the question is, whether the numbers of winning streaks of one person are unusually high. If we would for example expect aprox. 10 winning streaks, but observe 100, we can conclude that aprox. 90 were cheating. The problem with this is that the more people cheat, the more likely we are to suspect an honest person of cheating as well.
Again, this would be different if the number of winning streaks for a particular person were unusually high.
His performance in games outside the streak is relevant to the prior of his being a cheater, which in turn is highly relevant to how calculate p(cheater | this streak).
Indeed. I'd say that the issue is that they are misinterpreting the word "collecting". The principle is true if you are collecting or observing data live, but this data was collected long ago and with a much wider scope: when the games were recorded.
What they are doing here is sampling the data after the fact, and obviously one needs to take a uniformly random sample of a dataset for any statistical analysis done on it to be representative.
Perhaps, but Hank Green published a pretty convincing argument recently that electricity supply has nowhere the necessary elasticity, and the politicised nature of power generation in the US means that isn't going to change:
Won’t this be solved fairly soon when package managers have automatic scanning of updates by AIs that are superhumanly good at spotting malicious code?
Not sure if this is sarcastic, but this is a terrible idea. Best case scenario, it relaxes human vigilance and turns the success of malicious code attacks into a dice roll. More likely is that obfuscation techniques designed to fool LLMs will open the flood gates for malicious code.
I've noticed the same when looking at old Georgian and Victorian maps of London. You get these surprisingly sharp edges between urban and rural. You often have streets lined with quite grand buildings and nothing but fields behind them. It's quite strange when you're used to modern cities that gradually peter out into suburbs.
My guess is it's because at this point the population of cities was growing quickly, but the large scale migration of farm laborers into them hadn't begun in earnest yet. So most of the housing being built at the edges was intended for the expanding merchant classes, who wanted something a bit more impressive, and who also had live in servants. The Georgian terraces of London are typically three or four storeys, with the top storey being rooms with low-ceilings where the servants lived.
It probably has more to do with different administrative areas. Cities used to have different rights. Cities could just not simply expand to external land. The reason was quite simple: the land belonged to someone else. Meanwhile, the city was independent, even if it was the capital of a kingdom (such as Paris, for example).
In Vienna, for example, the city ended behind the belt. As a citizen, you could travel back and forth between the surrounding area and the city, but different laws applied (taxes, marriage, property).
The Viennese enjoyed traveling to the surrounding countryside for leisure (winegrowers had to pay significantly less tax for serving their own products than innkeepers in the city), but the citizens did not want to live there, or there were strict regulations on moving in.
Ironic, Western politicians thought opening up to trade with China would lead to it adopting a Western model of government. Instead it's lead to the USA adopting the Chinese one.
Yeah, this so weird coming from the US. The US government has a history of writing no-strings-attached blank cheques to people/companies just so avoid the stigma of government control in public companies.
I wonder how the markets will react, will stocks go up because people will assume Intel's going to be a government mandated champion or will they go down because of the negative connotations government control brings?
Kinda. But I think the current Chinese model is actually much closer to how the USA used to work when there was competition with the USSR. Closer than the US of today compared to the 70s and 80s.
The current Chinese model's basically you have fully publicly traded companies, companies who are either minority or majority owned by a certain provincial government and ones who are either minority or majority owned by the central government (although this is surprisingly rare outside of key areas like telco/banking)
"Air America was an American passenger and cargo airline established in 1946 and covertly owned and operated by the Central Intelligence Agency (CIA) from 1950 to 1976."
We aren't anywhere close to being in a depression though. What extraordinary situation requires the government to take a stake in a public company and under what conditions will this position be liquidated?
We are very close to being in a depression. Most of our money has nothing to do with actually feeding or housing people. If the wrong thing shifts, we're toast.
The unemployment rate is still near historic lows and while new job numbers are getting worse they're still positive overall. We aren't anywhere close to being in a depression currently.
> We aren't anywhere close to being in a depression though.
This may be true in the economic sense. But “depression” is as much a political sensation as it is a technically defined economic term. We happen to live in an era where economics for the public is profoundly politicized.
It somewhat makes sense in terms of industries which are deemed strategically important. Intel needs to start thinking long term instead of short term profits.
Intel has had a couple years of saying they were going into a more long term vision and failing, and it’s unclear how direct government ownership will make them get better at execution
if someone believes this, they should buy intel and just do it outright! But no one does because it's not as easy as "just think long term" - if it were, berkshire has the liquid money to buy intel several times over.
That's very cute quip but I notice that it places the blame on 'trade with China' for an alarming problem that is in fact entirely the doing of US voters expressing their values (or the lack of them) in fair elections.
A more interesting question is whether that voterbase's idea of what they were voting for does or doesn't line up with what they got.
Civil forfeiture existed since 1660s, and was used initially to confiscate smugglers' vessels. Then it was dug out during Prohibition, and turned toxic in 1980s when the agencies doing the forfeiture (e.g. police) were allowed to keep the confiscated property. Ideally it should be used for restitution (e.g. to victims of fraud), but...
I suspect you were growing up when this was in full swing already.
We also have criminal forfeiture, which was leveraged a lot more then. Civil forfeiture use expanded dramatically in recent decades due to profit sharing with DOJ alongside court challenges failing, suggesting the need for constitutional amendment if awareness of the practice improves.
The Chinese have been surprisingly willing to let companies and sectors die even at the expense of growth (see real estate), I think it's honestly too charitable to compare the US to China, which has at least some degree of technocratic governance, the US went straight for something out of the Tropico franchise
We're living in the time of irony. Up is Down, Left is Right, Right is Left. Republicans have become Socialist. Free Speech absolutist now against Free Speech.
The promiscuous relation between government and tech is as old as Silicon Valley. I'm fact, it created Silicon Valley. It started when people in China were still building backyard furnaces.
It’s not “adopting” the Chinese model yet, so much as incoherently copying bits and pieces. If you want to run effective industrial policy you need sufficient state capacity and an army of technocrats who are experts on industrial policy. Trump’s second term performance gives no hope on both fronts.
"DeepMind Technologies Limited, trading as Google DeepMind or simply DeepMind, is a British–American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc. Founded in the UK in 2010, it was acquired by Google in 2014 and merged with Google AI's Google Brain division to become Google DeepMind in April 2023"
Q: Is the HQ nominally being in London at all relevant given it was acquired by Alphabet/Google? I'm sure the accountants have the tax status all sorted by now...
It's not just the HQ, the only AI researcher I know personally is an American who moved to London to work on AI with DeepMind well after the acquisition.
The registered HQ and a large research center are in London, but ownership, executive control, substantial staffing, a big fraction of the training/serving compute, and the commercialization pathway run through Alphabet's U.S. operations, so the work is, in practical and legal senses, U.S.-based...
"As part of a wider group reorganisation, the Company distributed intellectual property assets which had a nil book value to another group undertaking on 31 October 2019."
Honestly, claiming DeepMind is still some scrappy London-based startup is quite unfortunate :/
The people (about 2,000) are London based and work for a UK registered company so in both practical and legal senses the work is in the UK (eg employment taxes are paid in the UK). That the product of that work may be sold to another country for a price that transfers profits elsewhere doesn't change either of these facts.
I don't agree with the statement that you're challenging but Google DeepMind's operations in London make it (still) an important centre for AI research and is probably why the UK is ranked third on many international AI country rankings.
+1 Europe (especially) and the UK largely don't matter - it's a battle between the US and China, the gap will only grow wider and faster than it already has (and it's already getting really noticeable).
I attribute it mostly to a cultural problem and I don't think they can fix their politics from the downward spiral they're on. It's why they have a number that rounds to zero of billion dollar software companies and why all their ambitious people do their best to get to the US.
> Honestly, claiming DeepMind is still some scrappy London-based startup is quite unfortunate :/
Since I didn't do that, I'm not sure how that is relevant or productive.
> work is, in practical and legal senses, U.S.-based...
This seems factually false. The work happening there has to comply with UK laws, not US laws and the practical locus of researchers located there provides a pool of talent that makes it a better place to do an AI startup than places that lack it.
The point is that London is enough of a research hub in AI for it to be worth maintaining a significant research presence there and to even make researchers interested in relocating there.
DeepMind is obviously foriegn owned and controlled now, which does limit the UK's ability to exert control of and profit from it. That only makes weakening the institutions they do control, like ATI, more significant.
Also in general Google satellite offices often house the engineers of acquired startups who don't want to move to the mothership. It's not their primary purpose but it's one of the things they use them for.
> in general Google satellite offices often house the engineers of acquired startups who don't want to move to the mothership
Would it be unfair to ask if (in this instance the UK's) satellite country taxpayers are subsidising corporate offices when the overall structures are arranged such that any overall corporation tax payable will be paid in the lowest-possible jurisdiction?
I'm sure representatives of those countries love to say so, especially when talking to third parties about their expertise in manufacturing: "Yes, here in Zhengzhou, CN we're leaders in electronics manufacturing - the iPhone is assembled here at Foxconn!"
However, Apple (headquartered in the US) loves to issue press releases describing how their products are "Designed by Apple in California[, USA]" even though a lot of work in the manufacturing, the software, and the design of subcomponents (or major components, I don't know how Apple is organized internally) are done in China, India and Vietnam as you listed.
I'd argue that in the same way that Shenzen and Zhengzhou are leaders in electronics assembly because the bulk of the iPhone and other products are built there, regardless of the location of the headquarters of Apple, so to can London claim to be a leader in AI because the researchers for DeepMind are located in London, regardless of who owns the DeepMind brand.
Buying a thing from another country doesn't make your location a leader in that thing.
Apple's manufacturers don't do any of the work on the software or design. They don't even manufacture the highest value-add components; those are mostly done in Taiwan and Korea.
The article says that LLMs don't summarize, only shorten, because...
"A true summary, the kind a human makes, requires outside context and reference points. Shortening just reworks the information already in the text."
Then later says...
"LLMs operate in a similar way, trading what we would call intelligence for a vast memory of nearly everything humans have ever written. It’s nearly impossible to grasp how much context this gives them to play with"
So, they can't summarize, because they lack context... but they also have an almost ungraspably large amount of context?
But "shortening other summaries from its training set" is not all an LLM is capable off. It can easily shorten/summarize a text it had never seen before, in a way that makes sense. Sure, it won't always summarize it the same way a human would, but if you do a double blind test where you ask people whether a summary was written by AI, a vast majority wouldn't be able to tell the difference (again this is with a completely novel text).
I think the real takeaway is that LLMs are very good at tasks that closely resemble examples it has in its training. A lot of things written (code, movies/TV shows, etc.) are actually pretty repetitive and so you don't really need super intelligence to be able to summarize it and break it down, just good pattern matching. But, this can fall apart pretty wildly when you have something genuinely novel...
Is anyone here aware of LLMs demonstrating an original thought? Something truly novel.
My own impression is something more akin to a natural language search query system. If I want a snippet of code to do X it does that pretty well and keeps me from having to search through poor documentation of many OSS projects. Certainly doesn't produce anything I could not do myself - so far.
Ask it about something that is currently unknown and it list a bunch of hypotheses that people have already proposed.
Ask it to write a story and you get a story similar to one you already know but with your details inserted.
I can see how this may appear to be intelligent but likely isn't.
If I come up with something novel while using an LLM, which I wouldn't have come up with had I not had the LLM at my bidding, where did the novelty really come from?
If I came up with something novel while watching a sunrise, which I wouldn't have come up with had I not been looking at it, where did the novelty really come from?
Well that's the tricky part: what is novel? There are varying answers. I think we're all pretty unoriginal most of the time, but at the very least we're a bit better than LLMs at mashing together and synthesizing things based on previous knowledge.
But seriously, how would you determine if an LLM's output was novel? The training data set is so enormous for any given LLM that it would be hard to know for sure that any given output isn't just a trivial mix of existing data.
That's because midterms are specifically supposed to assess how well you learned the material presented (or at least directed to), not your overall ability to reason. If you teach a general reasoning class, getting creative with the midterm is one thing, but if you're teaching someone how to solve differential equations, they're learning to the very edge of their ability in a given amount of time, and you present them with an example outside of what's been described, it kind of makes sense that they can't just already solve it. I mean, that's kind of the whole premise of education, that you can't just present someone with something completely outside of their experience and expect them to derive from first principles how it works.
I would argue that on a math midterm it's entirely reasonable to show a problem they've never seen before and test whether they've made the connection between that problem and the problems they've seen before. We did that all the time in upper division Physics.
A problem they've never seen before, of course. A problem that requires a solving strategy or tool they've never seen before (above and beyond synthesis of multiple things they have seen before) is another matter entirely.
It's like the difference between teaching kids rate problems and then putting ones with negative values or nested rates on a test versus giving them a continuous compound interest problem and expecting them to derive e, because it is fundamentally about rates of change, isn't it?
I honestly think that reflects more on the state of education than it does human intelligence.
My primary assertion is that LLMs struggle to generalize concepts and ideas, hence why they need petabytes of text just to often fail basic riddles when you muck with the parameters a little bit. People get stuck on this for two reasons: one, because they have to reconcile this with what they can see LLMs are capable of, and it's just difficult to believe that all of this can be accomplished without at least intelligence as we know it; I reckon the trick here is that we simply can't even conceive of how utterly massive the training datasets for these models are. We can look at the numbers but there's no way to fully grasp just how vast it truly is. The second thing is definitely the tendency to anthropomorphize. At first I definitely felt like OpenAI was just using this as an excuse to hype their models and come up with reasons for why they can never release weights anymore; convenient. But also, you can see even engineers who genuinely understand how LLMs work coming to the conclusion that they've become sentient, even though the models they felt were sentient now feel downright stupid compared to the current state-of-the-art.
Even less sophisticated pattern matching than what humans are able to do is still very powerful, but it's obvious to me that humans are able to generalize better.
And what truly novel things are humans capable of? At least 99% of the stuff we do is just what we were taught by parents, schools, books, friends, influencers, etc.
Remember, humans needed some 100, 000 years to figure out that you can hit an animal with a rock, and that's using more or less the same brain capacity we have today. If we were born in stone age, we'd all be nothing but cavemen.
Look. I get that we can debate about what's truly novel. I never even actually claimed that humans regularly do things that are actually all that novel. That wasn't the point. The point is that LLMs struggle with novelty because they struggle to generalize. Humans clearly are able to generalize vastly better than transformer-based LLMs.
Really? How do I know that with such great certainty?
Well, I don't know how much text I've read in one lifetime, but I can tell you it's less than the literally multiple terabytes of text fed into the training process of modern LLMs.
Yet, LLMs can still be found failing logic puzzles and simple riddles that even children can figure out, just by tweaking some of the parameters slightly, and it seems like the best thing we can do here is just throw more terabytes of data and more reinforcement learning at it, only for it to still fail, even if a little more sparingly each time.
So what novel things do average people do anyways, since beating animals with rocks apparently took 100,000 years to figure out? Hard call. There's no definitive bar for novel. You could argue almost everything we do is basically just mixing things we've seen together before, yet I'd argue humans are much better at it than LLMs, which need a metric shit load of training data and burn tons of watts. In return, you get some superhuman abilities, but superhuman doesn't mean smarter or better than people; a sufficiently powerful calculator is superhuman. The breadth of an LLM is much wider than any individual human, but the breadth of knowledge across humanity is obviously still much wider than any individual LLM, and there remain things people do well that LLMs definitely still don't, even just in the realm of text.
So if I don't really believe humans are all that novel, why judge LLMs based on that criteria? Really two reasons:
- I think LLMs are significantly worse at it, so allowing your critical thinking abilities to atrophy in favor of using LLMs is really bad. Therefore people need to be very careful about ascribing too much to LLMs.
- Because I think many people want to use LLMs to do truly novel things. Don't get me wrong, a lot of people also just want it to shit out another React Tailwind frontend for a Node.js JSON HTTP CRUD app or something. But, a lot of AI skeptics are no longer the types of people that downplay it as a cope or out of fear, but actually are people who were at least somewhat excited by the capabilities of AI then let down when they tried to color outside the lines and it failed tremendously.
Likewise, imagine trying to figure out how novel an AI response is; the training data set is so massive, that humans can hardly comprehend the scale. Our intuition about what couldn't possibly be in the training data is completely broken. We can only ever easily prove that a given response isn't novel, not that it is.
But honestly maybe it's just too unconvincing to just say all of this in the abstract. Maybe it would better to at least try to come up with some demonstration of something I think I've come up with that is "novel".
There's this sort-of trick I came up with when implementing falling blocks puzzle games for handling input that I think is pretty unique. See, in most implementations, to handle things like auto-repeating movements, you might do something like have a counter that increments, then once it hits the repeat delay, it gets reset again. Maybe you could get slightly more clever by having it count down and repeat at zero: this would make it easier to, for example, have the repeat delay be longer for only the first repeat. This is how DAS normally works in Tetris and other games, and it more or less mirrors the key repeat delay. It's easier with the count down since on the first input you can set it to the high initial delay, then whenever it hits zero you can set it to the repeat delay.
I didn't like this though because I didn't like having to deal with a bunch of state. I really wanted the state to be as simple as possible. So instead, for each game input, I allocate a signed integer. These integers are all initialized to zero. When a key is pressed down, the integer is set to 1 if it is less than 1. When a key is released, it is set to -1 if it is greater than 0. And on each frame of game logic, at the end of the frame, each input greater than 0 is incremented, and each input less than 0 is decremented. This is held in the game state and when the game logic is paused, you do nothing here.
With this scheme, the following side effects occur:
- Like most other schemes, there's no need to special-case key repeat events, as receiving a second key down doesn't do anything.
- Game logic can now do a bunch of logic "statelessly", since the input state encodes a lot of useful information. For example, you can easily trigger an event upon an input being pressed by using n == 1, and you can easily trigger an event upon an input being released using n == -1. You do something every five frames an input is held by checking n % 5 == 0, or slightly more involved for a proper input repeat with initial delay. On any given frame of game logic, you always know how long an input has been held down and after it's released you know how many frames it has been since it was pressed.
Now I don't talk to tons of other game developers, but I've never seen or heard of anyone doing this, and if someone else did come up with it, then I discovered it independently. It was something I came up with when playing around with trying to make deterministic, rewindable game logic. I played around with this a lot in highschool (not that many years ago, about 15 now.)
I fully admit this is not as useful for the human race as "hitting animals with a rock", but I reckon it's the type of thing that LLMs basically only come up with if they've already been exposed to the idea. If I try to instruct LLMs to implement a system that has what I think is a novel idea, it really seems to rapidly fall apart. If it doesn't fall apart, then I honestly begin to suspect that maybe the idea is less novel than I thought... but it's a whole hell of a lot more common, so far, for it to just completely fall apart.
Still, my point was never that AI is useless, a lot of things humans do aren't very novel after all. However, I also think it is definitely not time to allow one's critical thinking skills to atrophy as today's models definitely have some very bad failure modes and some of the ways they fail are ways that we can't afford in many circumstances. Today the biggest challenge IMO is that despite all of the data the ability to generalize really feels lacking. If that problem gets conquered, I'm sure more problems will rise to the top. Unilaterally superhuman AI has a long way to go.
I guess disagreement about this question often stems from what we mean by "human", even more than what we mean by "intelligence".
There are at least 3 distinct categories of human intelligence/capability in any given domain:
1) average human (non-expert) - LLMs are already better (mainly because the average human doesn't know anything, but LLMs at least have some basic knowledge),
2) domain expert humans - LLMs are far behind, but can sometimes supplement human experts with additional breadth,
3) collective intelligence of all humans combined - LLMs are like retarded cavemen in comparison.
So when answering if AI has human-level intelligence, it really makes sense to ask what "human-level" means.
Imagine an oracle that could judge/decide, with human levels of intelligence, how relevant a given memory or piece of information is to any given situation, and that could verbosely describe which way it's relevant (spatially, conditionally, etc.).
Would such an oracle, sufficiently parallelized, be sufficient for AGI? If it could, then we could genuinely describe its output as "context," and phrase our problem as "there is still a gap in needed context, despite how much context there already is."
And an LLM that simply "shortens" that context could reach a level of AGI, because the context preparation is doing the heavy lifting.
The point I think the article is trying to make is that LLMs cannot add any information beyond the context they are given - they can only "shorten" that context.
If the lived experience necessary for human-level judgment could be encoded into that context, though... that would be an entirely different ball game.
IMO we already have the technology for sufficient parallelization of smaller models with specific bits of context. The real issue is that models have weak/inconsistent/myopic judgement abilities, even with reasoning loops.
For instance, if I ask Cursor to fix the code for a broken test and the fix is non-trivial, it will often diagnose the problem incorrectly almost instantly, hyper-focus on what it imagines the problem is without further confirmation, implement a "fix", get a different error message while breaking more tests than it "fixed" (if it changed the result for any tests), and then declare the problem solved simply because it moved the goalposts at the start by misdiagnosing the issue.
You can reconcile these points by considering what specific context is necessary. The author specifies "outside" context, and I would agree. The human context that's necessary for useful summaries is a model of semantic or "actual" relationships between concepts, while the LLM context is a model of a single kind of fuzzy relationship between concepts.
In other words the LLM does not contain the knowledge of what the words represent.
> In other words the LLM does not contain the knowledge of what the words represent.
This is probably true for some words and concepts but not others. I think we find that llms make inhuman mistakes only because they don't have the embodied senses and inductive biases that are at the root of human language formation.
If this hypothesis is correct, it suggests that we might be able to train a more complete machine intelligence by having them participate in a physics simulation as one part of the training. I.e have a multimodal ai play some kind of blockworld game. I bet if the ai is endowed with just sight and sound, it might be enough to capture many relevant relationships.
I think the differentiator here might not be the context it has, but the context it has the ability to use effectively in order to derive more information about a given request.
About a year ago, I gave a film script to an LLM and asked for a summary. It was written by a friend and there was no chance it or its summary was in the training data.
It did a really good -- surprisingly good -- job. That incident has been a reference point for me. Even if it is anecdotal.
I'm not as cynical as others about LLMs but it's extremely unlikely that script had multiple truly novel things in it. Broken down into sufficient small pieces it's very likely every story element was present multiple times in the LLM's training data.
I'm not sure I understand the philosophical point being made here. The LLM has "watched" a lot of movies and so understands the important parts of the original script it's presented with. Are we not describing how human media literacy works?
The point is that if you made a point to write a completely novel script, with (content-wise, not semantically) 0 DNA in it from previous movie scripts, with an unambiguous but incoherent and unstructured plot, your average literate human would be able summarize what happened on the page, for all that they'd be annoyed and likely distressed by how unusual it was; but that an LLM would do a disproportionately bad job compared to how well they do at other things, which makes us reevaluate what they're actually doing and how they actually do it.
It feels like they've mastered language, but it's looking more and more like they've actually mastered canon. Which is still impressive, but very different.
This tracks, because the entire system reduces to a sophisticated regression analysis. That's why we keep talking about parameters and parameter counts. They're literally talking about the number of parameters that they're weighting during training. Beyond that there are some mathematical choices in how you interrelated the parameters that yields some interesting emergent phenomena, and there are architecture choices to be made there. But the whole thing boils down to regression, and regression is at its heart a development of a canon from a representative variety of examples.
We are warned in statistics to be careful when extrapolating from a regression analysis.
And have you managed to perform such a test or is that just an imaginary result you're convinced will happen ? Not trying to be snarky here but i see this kind of thing a lot and 'this is my model of how LLMs work and so this is how they would behave in this test I cannot verify' is very uncompelling.
I'm not making a philosophical point. The earlier comment is "I updated a new script and it summarized it," I was simply saying the odds of that script actually being new is very slim. Even though obviously that script or summaries of it do not exist in their entirety in the training data, its individual elements almost certainly do. So it's not really a novel (pun unintended?) summarization.
I'd like to see some examples of when it struggles to do summaries. There were no real examples in the text, besides one hypothetical which ChatGPT made up.
I think LLMs do great summaries. I am not able to come up with anything where I could criticize it and say "any human would come up with a better summary". Are my tasks not "truly novel"? Well, then I am not able, as a human, to come up with anything novel either.
If you think they can't do this task well I encourage you to try feeding ChatGPT some long documents outside of its training cutoff and examining the results. I expect you'll be surprised!
So, maybe this is just sloppiness and not intentionally misleading. But still, not a good look when the company burning through billions of dollars in cash and promising to revolutionize all human activity can't put together a decent powerpoint.
Yea I guess I won’t immediately prescribe malice but SHEESH. One of the most anticipated product launches in years and this kind of junk made it through to the public deck. Really pretty inexcusable.