Hacker Newsnew | past | comments | ask | show | jobs | submit | more xcodevn's commentslogin

My observation is that vibe-coded applications are significantly lower quality than traditional software. Anthropic software (which they claim to be 90% vibe coded) is extremely buggy, especially the UI.


That's a misunderstanding based on loose definition of "vibe coding". When companies threw around the "90% of code is written by AI" claims, they were referring to counting characers of autocomplete basing on users actually typing code (most of which was eequivalent to "AI generated" code by Eclipse tab-completion decade ago), and sometimes writing hyperlocal prompts for a single method.

We can identify 3 levels of "vibe coding":

1. GenAI Autocomplete

2. Hyperlocal prompting about a specific function. (Copilot's orginal pitch)

3. Developing the app without looking at code.

Level 3 is hardly considered "vibe" coding, and Level 2 is iffy.

"90% of code written by AI" in some non-trivial contexts only very recently reached level 3.

I don't think it ever reached Level 2, because that's just a painfully tedious way of writing code.


I believe Anthropic is already doing Level 3 vibe coding for >90% of their code.


They have not said that. They've only said that most of their code is written by Claude. That is different than "vibe coding". If competent engineers review the code then it is little different than any coding.


IIRC, the Claude Code creator mentioned that all the PRs are reviewed by humans, just like normal human PRs. So yes, humans still look at the code at the review stage. Though I still consider this to be level 3, but anyway, this is just a matter of definition.


I mostly work at level 2, and I call it "power coding", like power armor, or power tools. Your will and your hand still guides the process continuously. But now your force is greatly multiplied.


Over the weekend, I wrote this small Python library to teach myself the core idea behind modern agentic systems. This kind of software sits at the core of Claude Code, Codex, etc. I wanted to see if I could build it from scratch, so this is mostly educational for me.

The result is a surprisingly simple piece of software. At its core are immutable DAGs, which keep the design simple and easy to reason about.

I also added a set of built-in tools that are inspired by Claude Code's built-in tools.

A bonus point: it can also capture Claude Code auth tokens, so you can use it with your Claude Code subscription. However, there is a chance that Anthropic will ban you if they detect this, so use it at your own risk.

P.S.: One additional point I also want to mention is that Claude Code (SDK) is closed-source, so I cannot modify it for my use case or fix its buggy UI on my own. This is one of the factors for why I'm creating this library.


> only ~1/3 of sessions see at least a flicker.

...after many months, for such a visible bug, is such a crazy thing to say.

In case the above comes across as too hostile, to balance this, I would say thank you to the claude code team for such an amazing product!


More than 30% of the times you use Claude Code it "flickers"? That can't be right? I use neovim and codex side by side with tmux, both flicker about 0%, what is Claude Code doing that makes it flicker so much? Seems strange


(It's worth reading the gh comment I linked if you're interested in terminals!)

tl;dr other programs like Neovim and Codex use the "alternate screen buffer" which means they don't use scrollback and reimplement their own scrolling. CC uses scrollback (because that's what most users expect) which it has to clear entirely and redraw everything when it changes (causing tearing/flickering). There's no way to incrementally update scrollback in a terminal.

(I also want to add some more flavor to the 1/3 metric because I don't want it to be mis-interpreted. "30% of the time you use CC it flickers" isn't quite accurate - it's dependent on screen height and what you do. Most people will not see _any_ flickers at all. Some people with short screens (typically VSCode users because by default the terminal opens fairly short) will see flickers. Previously, if something rendered offscreen users would see a flicker for _every subsequent frame_ regardless of wether anything was actually changing. Now they will only see a flicker occasionally when it's _absolutely_ needed. Once or twice vs thousands.

Additionally, the metric really tracks when CC emits a "clear scrollback" operation. If the user is in a terminal that supports DEC 2026 they won't see a flicker even if we emit that clear scrollback command.)


There is absolutely a way to incrementally update scrollback in a terminal, 100% flicker-free. Whether it works in every terminal is a different question. But if you can accept that your code will work in pretty much every modern terminal -- this is absolutely doable. I double people are still using xterm and other older terminals for this. And in that case, you can fall back to this more compatible way.


I have a hypothesis: they haven't fixed this because they're using Claude Code to develop Claude Code. I'm a fan of Claude Code, but it isn't good enough to fix tricky issues like this. And because no one looks at the codebase themselves, they haven't been able to fix it after many months. Sometimes, all we need is an engineer to sit down for the weekend and fix the damn bug, not spin up 9 different Claude Agents prompted to fix itself.


Perhaps the engineer could sit down for 8 hours a day during the work week. The Silicon Valley obsession with having no life and working weekends is so endemic.


Interesting, i can see this being very similar to Nvidia's CUTE DSL. This hints that we are converging to a (local) optimal design for Python-based DSL kernel programming.


> Once the model is fully released, scientists will be able to adapt and fine-tune it on their own datasets to better tackle their unique research questions.

This is in the press release, so they are going to release the weights.


Author here: (1) We didn't remove the stddev term. (2) We use token-level loss (every token has the same weight), which is very similar to what Dr. GRPO does. However, we compute the mean gradient per token, while Dr. GRPO computes the sum. Typically, these are equivalent. However, since we're also doing gradient accumulation over micro-batches to reduce memory usage during training, this led to a bug in our implementation: it gives more weight to tokens in short sequences than to those in long sequences.

Interestingly, this is the same bug that most open-source LLM training frameworks (such as HF Trainer) had and only recently fixed.

In short, I'm working on a quick fix, after that, using sum or mean should yield equivalent results.

P.S. Fixed!


Cool!


On a very similar theme, here is the work from World Lab (founded by Fei-Fei Li, ImageNet dataset, et al.) about creating 3D worlds:

https://www.worldlabs.ai/blog


I find this work much more exciting. They're not just teaching a model to hallucinate given WASD input. They're generating durable, persistent point clouds. It looks so similar to Genie2 yet they're worlds apart.


Who cares? If it brings benefits to the people who paid for the service, duh!


For context, the author is Steven Johnson, one of the key people behind Google's latest hit, NotebookLM.

For those who are curious, how can we technically support really long context window (like in the millions or even billions)? The short answer is simple: we can just use more GPUs. The long answer is detailed in my recent note here: https://neuralblog.github.io/scaling-up-self-attention-infer...


I'd rephrase that as "the author is the author, Stephen Johnson, who is also one of the key people..."

I've read many of his books over the last 20 some years (and even watched a PBS documentary series he hosted). I was aware via his Substack that he was collaborating somehow with the NotebookLM team. But I was rather startled when he demoed NotebookLM at a Google all hands meeting a few weeks ago! Apparently he's a full time product manager now.


He plays an essential role as the model for NotebookLM. As Raiza Martin, the PM for NotebookLM, mentioned on a recent podcast, Steven is the product. The NotebookLM team essentially emulated his workflow, how he conducts research and compiles information on a particular topic.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: