There's a lot of beautiful writing on these topics on the "pure math" side, but it's hard to figure out what results are important for deep learning and to put them in a form that doesn't take too much of an investment in pure math.
I think the first chapter of [1] is a good introduction to general facts about high-dimensional stuff. I think this is where I first learned about "high-dimensional oranges" and so on.
For something more specifically about the problem of "packing data into a vector" in the context of deep learning, last year I wrote a blog post meant to give some exposition [2].
One really nice approach to this general subject is to think in terms of information theory. For example, take the fact that, for a fixed epsilon > 0, we can find exp(C d) vectors in R^d with all pairwise inner products smaller than epsilon in absolute value. (Here C is some constant depending on epsilon.) People usually find this surprising geometrically. But now, say you want to communicate a symbol by transmitting d numbers through a Gaussian channel. Information theory says that, on average, I should be able to use these d numbers to transmit C d nats of information. (C is called the channel capacity, and depends on the magnitude of the noise and e.g. the range of values I can transmit.) The statement that there exist exp(C d) vectors with small inner products is related to a certain simple protocol to transmit a symbol from an alphabet of size exp(C d) with small error rate. (I'm being quite informal with the constants C.)
This is a hacky joke. No sane engineer would ever sign off on this. Even for a 1-5 person team, why would I want a probabilistic selection of test execution?
The solution to running only e2e tests on affected files has been around long before LLM. This is a bandage on poor CI.
I have worked at large, competent companies, and the problem of "which e2e tests to execute" is significantly more complicated than you seem to suggest that it is. I've worked with smart engineers that put a lot of time into this problem to only get only middling results.
How does that reconcile with the article, which states:
> Did Claude catch all the edge cases? Yes, and I'm not exaggerating. Claude never missed a relevant E2E test. But it tends to run more tests than needed, which is fine - better safe than sorry.
If you have some particular issue with the author's methodology, you should state that.
If you have some particular issue with the article, you should state that. Otherwise, the most charitable interpretation of your position I can come up with is "the article is wrong for some reason I refuse to specify", which doesn't lead to a productive dialogue.
I think you're the one being uncharitable here. The meaning of what he's saying is very clear. You can't say this probabilistic method (using LLMs to decide your e2e test plan) works if you only have a single example of it working.
It's really not clear. Using probabilistic methods to determine your e2e test plan is already best practice at large tech shops, and to be quite honest the heuristics that they used to use were pretty poor and arbitrary.
The author said they used Claude to decide which E2E tests to run and "Claude never missed a relevant E2E test."
How many times did they conduct this experiment? Over how long time? How did they determine which tests were relevant and that Claude didn't miss them? Did they try it on more than one project?
My point was that none of this tells me this will work in general
If the author can keep the whole function code_change -> relevant E2E_TESTS in his head, it seems to be a trivial application.
We don't know the methodology, since the author does not state how he verified that function or how he would verify the function for a large code base.
It seems to me like we have the answers to all those questions.
- Do we know which projects people work on?
It's pretty easy to discover that OP works on https://livox.com.br/en/, a tool that uses AI to let people with disabilities speak. That sounds like a reasonable project to me.
- Do we know which codebases (greenfield, mature, proprietary etc.) people work on
The e2e tests took 2 hours to run and the website quotes ~40M words. That is not greenfield.
- Do we know the level of expertise the people have?
It seems like they work on nontrivial production apps.
- How much additional work did they have reviewing, fixing, deploying, finishing etc.?
I think you might be confusing end-to-end (E2E) tests with other types of testing, such as unit and integration tests. No one is advocating this approach for unit tests, which should still run in their entirety on every pull request.
Running all E2E tests in a pipeline isn't feasible due to time constraints (takes hours). Most companies just run these tests nightly (and we still do). Which means we would still catch any issues that slip through the initial screening. But so far, nothing did.
> The solution to running only e2e tests on affected files has been around long before LLM.
This doesn't work in distributed systems, since changing the behavior of one file that's compiled in one binary can cause a downstream issue in a separate binary that sends a network call to the first. e.g. A programmer makes a behavioral change to binary #1 that falls within defined behavior, but encounters Hyrum's Law because of a valid behavior of binary #2.
I completely agree with the thesis here. I also have not seen a massive productivity boost with the use of AI.
I think that there will be neurological fatigue occurring whereby if software engineers are not actively practicing problem-solving, discernment, and translation into computer code - those skills will atrophy...
Yee, AI is not the 2x or 10x technology of the future ™ is was promised to be. It may the case that any productivity boost is happening within existing private code bases. Even still, there should be a modest uptick in noticeably improved offer deployment in the market, which does not appear to be there.
In my consulting practice I am seeing this phenomenon regularly, wereby new founders or stir crazy CTOs push the use of AI and ultimately find that they're spending more time wrangling a spastic code base than they are building shared understanding and working together.
I have recently taken on advisory roles and retainers just to reinstill engineering best practices..
> I think that there will be neurological fatigue occurring whereby if software engineers are not actively practicing problem-solving, discernment, and translation into computer code - those skills will atrophy...
I've found this to be the case with most (if not all) skills, even riding a bike. Sure, you don't forget how to ride it, but your ability to expertly articulate with the bike in a synergistic and tool-like way atrophies.
If that's the case with engineering, and I believe it to be, it should serve as a real warning.
Yes and this is the placid version where lazy programmers elect to lighten their cognitive load by farming out to AI.
An insidious version is AGI replacing human cognition.
To replace human thought is to replace a biological ability which progresses on evolutionary timescales - not a Moore's law approximate curve. The issue in your skull will quite literally be as useful as a cow's for solving problems... think about that.
Automating labor in the 20th century disrupts society and we've see its consequences. Replacing cognition entirely: driving, writing, decision making, and communication; yields far worse outcomes than transitioning the population from food production to knowledge work.
If not our bodies and not our minds, then what do we have? (Note: Altman's universal basic income ought to trip every dystopian alarm bell).
Whether adopted passivity or foisted actively - cognition is what makes us human. Let's not let Claude Code be the nexus for something worse.
There's no connection between AI and AGI, apart from hopes. Besides which, if you're talking about AGI, you're talking about artificial people. That means:
• They don't really want to be servants.
• They have biases and preferences.
• Some of them are stupid.
• If you'd like to own an AGI that thinks for you, the AGI would also like one.
• They are people with cognition, even if we stop being.
AGI just means what it says it is: Artificial General Intelligence. AGIs don't have to have selfish traits like we do, they don't have to follow the rules of natural selection, they just need to solve general problems.
Think of them like worker bees. Bees can solve general problems, though not on level as humans do, they are like some primitive kind of AGI. They also live and die to be servants to the queen and they don't want to be queens themselves, the reason why is interesting btw, it involves genetics and game theory.
This is highly theoretical anyways, we have no idea how to make an AGI yet, and LLMs are probably a dead end as they can't interact with the physical world.
These postulated entities are by definition people. Not humans, because they lack the biology, but that's a detail.
If you think they're going to be trained on all the world's data, that's still supposing them to be an extension of AI. No, they'll have to pick up their knowledge culturally, the same way everybody else does, by watching cartoons - I mean by interactions with mentors. They might have their own culture, but only the same way that existing groups of people with a shared characteristic do, and they can't weave it out of air; it has to derive from existing culture. There's a potential for an AGI to "think faster", but I'm skeptical about what that amounts to in practice or how much use it would be to them.
> These postulated entities are by definition people.
Why? Does your definition postulate that people are the only thing in the universe that can measure up to us? Or the inverse, that every entity as sentient and intelligent as us must be called a person?
My opinion is that a lot of what makes us like this is physiological. Unless the developers go out of their way to simulate these things, a hypothetical AGI won't be similar to us no matter how much human-made content it ingests. And why would they do that? Why would you want to implement physical pain, or fear, or human needs, or biases and fallacies driven from our primal instincts? Would implementing all these things even be possible at the point where we find an inroad towards AGI? All of that might require creating a comprehensive human brain simulation, not just a self-learning machine.
I think it's almost certain that, while there would be some mutual understanding, an AGI would almost certainly feel like a completely different species to us.
The latter, that intelligence is one thing, and that to imagine that an artificial intelligence would be some kind of beyond-intelligence, and would be a beyond-person, is to needlessly multiply entities. The assumption should be there's only (potential to create) people like us, because to imagine beyond-people is to get mystical about it. "Beyond-rats" is what I say to that.
I have sympathy with the point about physiology, though, I think being non-biological has to feel very different. You're released from a lot of the human condition, you're not driven by hormones or genes, your plans aren't hijacked to get you to reproduce or eat more or whatever animal thing, you don't have the same needs. That's all liable to alienate you from the meat-based folk. However, you're still a person.
Same - I use it at work at a big tech company and the real world efficiency gains on net are probably nonexistent. We have multiple large and not so large codebases. In a super trivial script or creating a struct from documentation it does the thing - great. For unit tests it’s about 50-50 if it’s useful or if I waste a few hours and delete the change set. In any moderately complex codebase Claude Sonnet or GPT in agent mode builds unneeded complexity, gets lost in a spiraling amount of nonsense steps, builds things that already exist in the codebase constantly. The best outcome I have to edit and review so heavily it’s like I’m jumping in on someone else’s PR halfway and have to grok what the heck did they misunderstand.
The only actually net positive is the Claude.md that some people maintain - it’s actually a good context dump for new engineers!
What most of these comments are missing is the attempt at standardization and unification.
There are a lot of comments that people need X feature in order to switch to Y editor. While that may be true and your particular workflow requires certain features, what is overlooked is the survival pressure for editors.
It appears that our industry is moving towards adoption, sometimes mandatory, of AI coding agents. Regardless of your feelings on the topic, having good tooling to support this effort comes down to: switching costs, compatibility with existing editors, and a strong ecosystem of third party extensions.
While Cursor/Windsurf jumped the gun on bespoke editor integrations with LLMs - the adoption of MCP and other SDKs for coding agents means it's plug and play. The full feature set will be in every editor connected to every agent.
I think Zed wins on having the lowest switching costs for most developers. Paying down generic solutions like Agent Client Protocol (AC) now is a good strategy. It took multiple parties coming together for us to get TLS, OAuth 2.0, and ECMAScript.
I don't see why most editors should behave like hand crafted musical instruments when in reality they are much more akin to high quality knives in a kitchen (sure you have your favorite knife set and bring it from job to job, but at the end of the day you can be just as productive with a different knife when necessary).
> I don't see why most editors should behave like hand crafted musical instruments when in reality they are much more akin to high quality knives in a kitchen (sure you have your favorite knife set and bring it from job to job, but at the end of the day you can be just as productive with a different knife when necessary).
This is such a poor analogy. Yes, a good chef can make do with a different knife, but there is a reason why chefs pay for significantly higher quality knives, keep them sharpened, and treat them with diligence and care, than other kitchen tools. A blunt knife can actually be dangerous. Consequently, a lot of chefs buy knives that are effectively hand crafted / forged knives out of this relentless pursuit of quality.
> What most of these comments are missing is the attempt at standardization and unification.
> While that may be true and your particular workflow requires certain features, what is overlooked is the survival pressure for editors.
I think your general perception is not something I agree with. I want to use software I enjoy using. Programming is a creative exercise for me, and I want to use the tools I enjoy. If a tool is not enjoyable to use, I do not want to use it. Sometimes, productivity does increase enjoyment, but sometimes it doesn't. For example, arguably I would have been more productive in my Java days if I used Eclipse, but because the editor was so bad, I preferred to learn the APIs myself and use Sublime Text instead.
I also don't think I'm sympathetic to the survival of any particular editor. Software comes and goes, and sustainably built business models will prevail. All of the AI-first editors hinge on this being the right iteration of this technology, and we simply do not have a long enough timeframe or context to know if this is truly the best way to write code using AI. MCP/ACP, whatever else might be the best strategy for now, but I think it's too early for anyone to suggest that we've come to the right conclusion forever.
As someone who is in the position to see what the next really disruptive innovation is, you're quite right that there exist much, much better ways to write and collaborate on code. Flying leaps of innovation to Zed's tiny shuffle-steps.
Zed spent their innovation budget on Rust and GPUI, and as a result they have no energy to question the status quo of IDEs as a whole. Git and LSP are antiquated but form the bedrock of their plans for the future.
Essentially at this point they can only do spaghetti enigneering: adding more and more complexity on top of the complexity that already exists. IDEs have been through so many iterations of this process already that all the real wins are in refactoring: moving the whole system (and ecosystem) design sideways, which is he one thing they dare not try to do (though it happens to be my forte).
BABLR -- a parser framework, and agAST, the DOM structure at the heart of our state layer. Come to our Discord if you want to learn more. We're trying to launch in the next day or two here.
I'm sorry if this is blunt but is Agent Client Protocol... ...good?
It just looks to me like a bolted-on dongle to the past 50 years of kludges in editor design. It hasn't got 1/20th of the value proposition that a proper shared state layer would offer.
Zed succeeds at reducing the switching cost. I used NeoVim for ten years daily and configured it way back in college days.
I thought I would be unable to move to a GUI editor and it turns out that the speed and efficiency of Zed plus the almost one-to-one mapping of Vim features means that I am extremely productive in Zed.
A trend I have noticed as well. I consider this pattern to ultimately be the forcing function of free market capitalism itself.
Once a brand accumulates sufficient reputational capital through genuine quality, the profit-maximizing imperative inevitably drives extraction over quality. (I would extend this argument briefly outside of the domain of economic theory and into physics: we do not observe low entropy being temporally consistent anywhere in the universe.)
The market doesn’t reward maintaining expensive quality standards when cheaper alternatives can temporarily coast on accumulated goodwill - shareholders demand margin expansion, private equity needs returns, and the competitive landscape punishes companies that leave money on the table by over-investing in product integrity.
Less of a moral failure by individual companies and more structural incentive alignment: capitalism systematically rewards converting hard won brand trust into extractable rents until the reputation is depleted, at which point capital simply moves to the next target.
The pattern you’re observing isn’t a bug but the logical endpoint of a system that treats reputation as just another asset to be optimized for shareholder value rather than a covenant with customers.
Started an entire consulting practice to get engineering teams and founders out of vibe coded pits. Even got a great domain for it - vibebusters
So far business is booming and clients are happy with both human interactions with senior engineers as well as a final deliverable on best practices for using AI to write code.
In the last few months we have worked with startups who have vibe coded themselves into an abyss. Either because they never made the correct hires in the first place or they let technical talent go. [1]
The thinking was that they could iterate faster, ship better code, and have an always on 10x engineer in the form of Claude code.
I've observed perfectly rational founders become addicted to the dopamine hit as they see Claude code output what looks like weeks or years of software engineering work.
It's overgenerous to allow anyone to believe AI can actually "think" or "reason" through complex problems. Perhaps we should be measuring time saved typing rather than cognition.
As if startups before LLMs were creating great code. Right now on the front page, a YC company is offering a “Founding Full Stack Engineer” $100K-$150K. What quality of code do you think they will end up with?
Notably, that is a company that... adds AI to group chats. Startups offering crap salaries with a vague promise of equity in a vague product idea with no moat are a dime a dozen, and have been well before LLMs came around.
Have you seen the companies YC has been funding recently? All you need to do is mention AI and YC will throw some money your way. I don't know if you saw my first attempt at a post, but someone should suggest AI for HN comment formatting and I'm sure it will be funded.
Acrely — AI for HVAC administration
Aden — AI for ERP operations
AgentHub — AI for agent simulation and evaluation
Agentin AI — AI for enterprise agents
AgentMail — AI for agent email infrastructure
AlphaWatch AI — AI for financial search
Alter — AI for secure agent workflow access control
Altur — AI for debt collection voice agents
Ambral — AI for account management
Anytrace — AI for support engineering
April — AI for voice executive assistants
AutoComputer — AI for robotic desktop automation
Autosana — AI for mobile QA
Autotab — AI for knowledge work
Avent — AI for industrial commerce
b-12 — AI for chemical intelligence
Bluebirds — AI for outbound targeting
burnt — AI for food supply chain operations
Cactus — AI for smartphone model deployment
Candytrail — AI for sales funnel automation
CareSwift — AI for ambulance operations
Certus AI — AI for restaurant phone lines
Clarm — AI for search and agent building
Clodo — AI for real estate CRMs
Closera — AI for commercial real estate employees
Clueso — AI for instructional content generation
cocreate — AI for video editing
Comena — AI for order automation in distribution
ContextFort — AI for construction drawing reviews
Convexia — AI for pharma drug discovery
Credal.ai — AI for enterprise workflow assistants
CTGT — AI for preventing hallucinations
Cyberdesk — AI for legacy desktop automation
datafruit — AI for DevOps engineering
Daymi — AI for personal clones
DeepAware AI — AI for data center efficiency
Defog.ai — AI for natural-language data queries
Design Arena — AI for design benchmarks
Doe — AI for autonomous private equity workforce
Double – Coding Copilot — AI for coding assistance
EffiGov — AI for local government call centers
Eloquent AI — AI for complex financial workflows
F4 — AI for compliance in engineering drawings
Finto — AI for enterprise accounting
Flai — AI for dealership customer acquisition
Floot — AI for app building
Fluidize — AI for scientific experiments
Flywheel AI — AI for excavator autonomy
Freya — AI for financial services voice agents
Frizzle — AI for teacher grading
Galini — AI guardrails as a service
Gaus — AI for retail investors
Ghostship — AI for UX bug detection
Golpo — AI for video generation from documents
Halluminate — AI for training computer use
HealthKey — AI for clinical trial matching
Hera — AI for motion design
Humoniq — AI for BPO in travel and transport
Hyprnote — AI for enterprise notetaking
Imprezia — AI for ad networks
Induction Labs — AI for computer use automation
iollo — AI for multimodal biological data
Iron Grid — AI for hardware insurance
IronLedger.ai — AI for property accounting
Janet AI — AI for project management (AI-native Jira)
Kernel — AI for web agent browsing infrastructure
Kestroll — AI for media asset management
Keystone — AI for software engineering
Knowlify — AI for explainer video creation
Kyber — AI for regulatory notice drafting
Lanesurf — AI for freight booking voice automation
Lantern — AI for Postgres application development
Lark — AI for billing operations
Latent — AI for medical language models
Lemma — AI for consumer brand insights
Linkana — AI for supplier onboarding reviews
Liva AI — AI for video and voice data labeling
Locata — AI for healthcare referral management
Lopus AI — AI for deal intelligence
Lotas — AI for data science IDEs
Louiza Labs — AI for synthetic biology data
Luminai — AI for business process automation
Magnetic — AI for tax preparation
MangoDesk — AI for evaluation data
Maven Bio — AI for BioPharma insights
Meteor — AI for web browsing (AI-native browser)
Mimos — AI for regulated firm visibility in search
Minimal AI — AI for e-commerce customer support
Mobile Operator — AI for mobile QA
Mohi — AI for workflow clarity
Monarcha — AI for GIS platforms
moonrepo — AI for developer workflow tooling
Motives — AI for consumer research
Nautilus — AI for car wash optimization
NOSO LABS — AI for field technician support
Nottelabs — AI for enterprise web agents
Novaflow — AI for biology lab analytics
Nozomio — AI for contextual coding agents
Oki — AI for company intelligence
Okibi — AI for agent building
Omnara — AI for agent command centers
OnDeck AI — AI for video analysis
Onyx — AI for generative platform development
Opennote — AI for note-based tutoring
Opslane — AI for ETL data pipelines
Orange Slice — AI for sales lead generation
Outlit — AI for quoting and proposals
Outrove — AI for Salesforce
Pally — AI for relationship management
Paloma — AI for billing CRMs
Parachute — AI for clinical evaluation and deployment
PARES AI — AI for commercial real estate brokers
People.ai — AI for enterprise growth insights
Perspectives Health — AI for clinic EMRs
Pharmie AI — AI for pharmacy technicians
Phases — AI for clinical trial automation
Pingo AI — AI for language learning companions
Pleom — AI for conversational interaction
Qualify.bot — AI for commercial lending phone agents
Reacher — AI for creator collaboration marketing
Ridecell — AI for fleet operations
Risely AI — AI for campus administration
Risotto — AI for IT helpdesk automation
Riverbank Security — AI for offensive security
Saphira AI — AI for certification automation
Sendbird — AI for omnichannel agents
Sentinel — AI for on-call engineering
Serafis — AI for institutional investor knowledge graphs
Sigmantic AI — AI for HDL design
Sira — AI for HR management of hourly teams
Socratix AI — AI for fraud and risk teams
Solva — AI for insurance
Spotlight Realty — AI for real estate brokerage
StackAI — AI for low-code agent platforms
stagewise — AI for frontend coding agents
Stellon Labs — AI for edge device models
Stockline — AI for food wholesaler ERP
Stormy AI — AI for influencer marketing
Synthetic Society — AI for simulating real users
SynthioLabs — AI for medical expertise in pharma
Tailor — AI for retail ERP automation
Tecto AI — AI for governance of AI employees
Tesora — AI for procurement analysis
Trace — AI for workflow automation
TraceRoot.AI — AI for automated bug fixing
truthsystems — AI for regulated governance layers
Uplift AI — AI for underserved voice languages
Veles — AI for dynamic sales pricing
Veritus Agent — AI for loan servicing and collections
Verne Robotics — AI for robotic arms
VoiceOS — AI for voice interviews
VoxOps AI — AI for regulated industry calls
Vulcan Technologies — AI for regulatory drafting
Waydev — AI for engineering leadership insights
Wayline — AI for property management voice automation
Wedge — AI for healthcare trust layers
Workflow86 — AI for workflow automation
ZeroEval — AI for agent evaluation and optimization
And the ideas may or may not be bad. I don’t know enough about any of the business segments. But to paraphrase the famous Steve Jobs quote “those aren’t businesses, they are features” [1] that a company that is already in the business should be able to throw a few halfway decent engineers at and add the feature to an existing product with real users.
[1] He said that about Dropbox. He wasn’t wrong just premature. For the price of 2TB on Dropbox, you can get the entire GSuite with 2TB or Office365 with 1TB for up to five users for 5TB in all.
now you can, but, what, are you gonna lie down and wait for tech giants to do everything? Not every company needs to be Apple. If Dropbox filed for bankruptcy tomorrow, they've still made millionaires of thousands of people and given jobs to hundreds more, and enabled people to share their files online.
Steve Jobs gets to call other companies small because Apple is huge, but there are thousands of companies that "are just features". Yeah, features they forgot to add!
Out of the literally thousands of companies that YC has invested in, only about a dozen have gone public, the rest are either dead, zombies or got acquired. These are all acquisition plays.
Even the ones that have gone public haven’t done that well in aggregate.
Dropbox was solving a hard infrastructure problem at scale. These companies are just making some API calls to a model.
If an established company in any of these verticals - not necessarily BigTech - see an opportunity, they are either going to throw a few engineers at the problem and add it as a feature or hire a company like the one I work for and we are going to knock out an implementation in a few months.
The one YC company I mentioned above is expecting to have their product written by one “full stack engineer” that they are only willing to pay $150K for. How difficult can it be?
Which seems fine? VC money gets thrown at a problem, the problem may or may not get solved by a particular team, but a company gets created, some people do some work, some people make money, others don't. I don't get it. Are you saying no one should bother doing anything because someone else is already doing it or that it's not difficult so why try?
Do you think they're all using actual LLMs? I've got a natural language parser I could probably market as "AI Semantic Detection" even though it's all regular expressions
I have a confession to make, I was about to downvote you because I thought you just asked ChatGPT to come up with some ridiculous company concepts and copy and pasted.
Then I saw the sibling comment and searched a couple of company names and realized they were real.
From what I’ve read, this is a consequence of applicants themselves concentrating on AI, which preceded their AI-filled batches. YC still has a very low acceptance rate, btw.
Shush please. I wasn't old enough to cash in on the Y2K contracting boons; I'm hoping the vibe coding 200k LOC b2b AI slop "please help us scale to 200 users" contracting gigs will be lucrative.
Does this functionality exist on iOS ? I'm looking for an iOS app that wraps Parakeet or whisper in a custom iOS keyboard.
That way I can switch to the dictation keyboard, press dictate, and have the transcription inserted in any application (first or third party).
MacWhisper is fantastic for macOS system dictation but the same abilities don't exist on iOS yet. The native iOS dictation is quite good but not as accurate with bespoke technical words / acronyms as Whisper cpp.
I really want to run it locally on a phone, but as a developer it's scary to think about making a native mobile app and having to work with the iOS toolchain I don't have bandwidth at the moment, but if anyone knows of any OSS mobile alternatives, feel free to drop them!
It doesn’t do any of that, it just captures the student market more.
They want a student to use it and say “I wouldn’t have learned anything without study mode”.
This also allows them to fill their data coffers more with bleeding edge education. “Please input the data you are studying and we will summarize it for you.”
Not to be contrarian, but do you have any evidence of this assertion? Or are you just confidently confabulating a response for something outside of the data you've been exposed to? Because a commentor below provided a study that directly contradicts this.
This isn't study mode, it's a different AI tutor, but:
"The median learning gains for students, relative to the pre-test baseline (M = 2.75, N = 316), in the AI-tutored group were over double those for students in the in-class active learning group."
"The occurrence of inaccurate “hallucinations” by the current [LLMs] poses a significant challenge for their use in education. [...] we enriched our prompts with comprehensive, step-by-step answers, guiding the AI tutor to deliver accurate and high-quality explanations (v) to students. As a result, 83% of students reported that the AI tutor’s explanations were as good as, or better than, those from human instructors in the class."
Not at all dismissing the study, but to replicate these results for yourself, this level of gain over a classroom setting may be tricky to achieve without having someone make class materials for the bot to present to you first
Edit: the authors further say
"Krupp et al. (2023) observed limited reflection among students using ChatGPT without guidance, while Forero (2023) reported a decline in student performance when AI interactions lacked structure and did not encourage critical thinking. These previous approaches did not adhere to the same research-based best practices that informed our approach."
Two other studies failed to get positive results at all. YMMV a lot apparently (like, all bets are off and your learning might go in the negative direction if you don't do everything exactly as in this study)
In case you find it interesting: I deployed an early version of a "lesson administering" bot deployed on a college campus that guides students through tutored activities of content curated by a professor in the "study mode" style -- that is, forcing them to think for themselves. We saw an immediate student performance gain on exams of about 1 stdev in the course. So with the right material and right prompting, things are looking promising.
OpenAI should figure out how to onboard teachers. Teacher uploads context for the year, OpenAI distributes a chatbot to the class that's perma fixed into study mode. Basically like GPT store but with an interface and UX tuned for a classroom.
There's studies showing that LLM makes experienced devs slower in their work. I wouldn't be surprised if it was the same for self study.
However consider the extent to which LLMs make the learning process more enjoyable. More students will keep pushing because they have someone to ask. Also, having fun & being motivated is such a massive factor when it comes to learning. And, finally, keeping at it at 50% the speed for 100% the material always beats working at 100% the speed for 50% the material. Who cares if you're slower - we're slower & faster without LLMs too! Those that persevere aren't the fastest; they're the ones with the most grit & discipline, and LLMs make that more accesible.
The study you're referencing doesn't make that conclusion.
It concludes theres a learning curve that generally takes about 50 hours of time to figure out. The data shows that the one engineer who had more than 50 hours of experience with Cursor actually worked faster.
This is largely my experience, now. I was much slower initially, but I've now figured out the correct way to prompt, guide, and fix the LLM to be effective. I produce way more code and am mentally less fatigued at the end of each day.
People keep citing this study (and it was on the top of HN for a day). But this claim falls flat when you find out that the test subjects had effectively no experience with LLM equipped editors and the 1-2 people in the study that actually did have experience with these tools showed a marked increase in productivity.
Like yeah, if you’ve only ever used an axe you probably don’t know the first thing about how to use a chainsaw, but if you know how to use a chainsaw you’re wiping the floor with the axe wielders. Wholeheartedly agree with the rest of your comment; even if you’re slow you lap everyone sitting on the couch.
I presume you're referring to the recent METR study. One aspect of the study population, which seems like an important causal factor in the results, is that they were working in large, mature codebases with specific standards for code style, which libraries to use, etc. LLMs are much better at producing "generic" results than matching a very specific and idiosyncratic set of requirements. The study involved the latter (specific) situation; helping people learn mainstream material seems more like the former (generic) situation.
(Qualifications: I was a reviewer on the METR study.)
I believe we'll see the benefits and drawbacks of AI augmentation to humans performing various tasks will vary wildly based on the task, the way the AI is being asked to interact, and the AI model.
I would be interested to see if there have already been studies about the efficacy of tutors at good colleges. In my experience (in academia), the students who make it into an Ivy or an elite liberal arts school make extensive use of tutor resources, but not in a helpful way. They basically just get the tutor to work problems for them (often their homework!) and feel like they've "learned" things because tough questions always seems so obvious when you've been shown the answer. In reality, what it means it that they have no experience being confused or having to push past difficult things they were stuck on. And those situations are some of the most valuable for learning.
I bring this up because the way I see students "study" with LLMs is similar to this misapplication of tutoring. You try something, feel confused and lost, and immediately turn to the pacifier^H^H^H^H^H^H^H ChatGPT helper to give you direction without ever having to just try things out and experiment. It means students are so much more anxious about exams where they don't have the training wheels. Students have always wanted practice exams with similar problems to the real one with the numbers changed, but it's more than wanting it now. They outright expect it and will write bad evals and/or even complain to your department if you don't do it.
I'm not very optimistic. I am seeing a rapidly rising trend at a very "elite" institution of students being completely incapable of using textbooks to augment learning concepts that were introduced in the classroom. And not just struggling with it, but lashing out at professors who expect them to do reading or self study.
Come on. Asking an educational product to do a basic sanity test as to whether it helps is far too high a bar. Almost no educational app does that sort of thing.
reply