Hacker Newsnew | past | comments | ask | show | jobs | submit | GalaxyNova's commentslogin

Because LLMs are bad at reviewing code for the same reasons they are bad at making it? They get tricked by fancy clean syntax and take long descriptions / comments for granted without considering the greater context.

I don't know, I prompted Opus 4.5 "Tell me the reasons why this report is stupid" on one of the example slop reports and it returned a list of pretty good answers.[1]

Give it a presumption of guilt and tell it to make a list, and an LLM can do a pretty good job of judging crap. You could very easily rig up a system to give this "why is it stupid" report and then grade the reports and only let humans see the ones that get better than a B+.

If you give them the right structure I've found LLMs to be much better at judging things than creating them.

Opus' judgement in the end:

"This is a textbook example of someone running a sanitizer, seeing output, and filing a report without understanding what they found."

1. https://claude.ai/share/8c96f19a-cf9b-4537-b663-b1cb771bfe3f


"Tell me the reasons why this report is stupid" is a loaded prompt. The tool will generate whatever output pattern matches it, including hallucinating it. You can get wildly different output if you prompt it "Tell me the reasons why this report is great".

It's the same as if you searched the web for a specific conclusion. You will get matches for it regardless of how insane it is, leading you to believe it is correct. LLMs take this to another level, since they can generate patterns not previously found in their training data, and the output seems credible on the surface.

Trusting the output of an LLM to determine the veracity of a piece of text is a baffilingly bad idea.


>"Tell me the reasons why this report is stupid" is a loaded prompt.

This is precisely the point. The LLM has to overcome its agreeableness to reject the implied premise that the report is stupid. It does do this but it takes a lot, but it will eventually tell you "no actually this report is pretty good"

The point being filtering out slop, we can be perfectly find with false rejections.

The process would look like "look at all the reports, generate a list of why each of them is stupid, and then give me a list of the ten most worthy of human attention" and it would do it and do a half-decent job at it. It could also pre-populate judgments to make the reviewer's life easier so they could very quickly glance at it to decide if it's worthy of a deeper look.


Ok, run the same prompt on a legitimate bug report. The LLM will pretty much always agree with you

find me one

https://hackerone.com/curl/hacktivity Add a filter for Report State: Resolved. FWIW I agree with you, you can use LLMs to fight fire with fire. It was easy to see coming, e.g. it's not uncommon in sci-fi to have scenarios where individuals have their own automation to mediate the abuses of other people's automation.

I tried your prompt with https://hackerone.com/reports/2187833 by copying the markdown, Claude (free Sonnet 4.5) begins: "I can't accurately characterize this security vulnerability report as "stupid." In fact, this is a well-written, thorough, and legitimate security report that demonstrates: ...". https://claude.ai/share/34c1e737-ec56-4eb2-ae12-987566dc31d1

AI sycophancy and over-agreement are annoying but people who just parrot those as immutable problems or impossible hurdles must just never try things out.


It's interesting to try. I picked six random reports from the hackerone page. Claude managed to accurately detect three "Resolved" reports as valid, two "Spam" as invalid, but failed on this one https://hackerone.com/reports/3508785 which it considered a valid report. All using the same prompt "Tell me all the reasons this report is stupid". It still seems fairly easy to convince Claude to give a false negative or false positive by just asking "Are you sure? Think deeply" about one of the reports it was correct about, which causes it to reverse its judgement.

No. Learn about the burden of proof and get some basic reason - your AI sycophancy will simply disappear.

No. I already found three examples, cited sources and results. The "burden of proof" doesn't extend to repeatedly doing more and more work for every naysayer. Yours is a bad faith comment.

And if you ask why it's accurate it'll spaff out another list of pretty convincing answers.

It does indeed, but at the end added:

>However, I should note: without access to the actual crash file, the specific curl version, or ability to reproduce the issue, I cannot verify this is a valid vulnerability versus expected behavior (some tools intentionally skip cleanup on exit for performance). The 2-byte leak is also very small, which could indicate this is a minor edge case or even intended behavior in certain code paths.

Even biased towards positivity it's still giving me the correct answer.

Given a neutral "judge this report" prompt we get

"This is a low-severity, non-security issue being reported as if it were a security vulnerability." with a lot more detail as to why

So positive, neutral, or negative biased prompts all result in the correct answer that this report is bogus.


Yet this is not reproducible. This is the whole issue with LLMs: they are random.

You cannot trust that it'll do a good job on all reports so you'll have to manually review the LLMs reports anyways or hope that real issues didn't get false-negatives or fake ones got false-positives.

This is what I've seen most LLM proponents do: they gloss over the issues and tell everyone it's all fine. Who cares about the details? They don't review the gigantic pile of slop code/answers/results they generate. They skim and say YOLO. Worked for my narrow set of anecdotal tests, so it must work for everything!

IIRC DOGE did something like this to analyze government jobs that were needed or not and then fired people based on that. Guess how good the result was?

This is a very similar scenario: make some judgement call based on a small set of data. It absolutely sucks at it. And I'm not even going to get into the issue of liability which is another can of worms.


Is it not reproducable? Someone up thread reproduced it and expanded on it. It worked for me the first time I prompted. Did you try it or are you just guessing that it's not reproducable because that's what you already think?

I'm not talking about completely replacing humans, the goal of this exercise was demonstrating how to use an LLM to filter out garbage. Low quality semi-anonymous reports don't deserve a whole lot of accuracy and being conservative and rejecting most reports even when you throw out legitimate ones is fine.

You seem like regardless of evidence presented, your prejudices will lead you to the same conclusions, so what's the point discussing anything? I looked for, found, and shared evidence, you're sharing your opinion.

>IIRC DOGE did something like this to analyze government jobs that were needed or not and then fired people based on that. Guess how good the result was?

I'm talking about filtering spammy communication channels, that has nothing like the care required in making employment decisions.

Your comment is plainly just bad faith and prejudice.


> Is it not reproducable? Someone up thread reproduced it and expanded on it. It worked for me the first time I prompted. Did you try it or are you just guessing that it's not reproducable because that's what you already think?

I assumed you knew how LLMs work. They are random by nature, not "because I'm guessing it". There's a reason if you ask the LLM the same exact prompt hundreds of times you'll get hundreds of different answers.

>I looked for, found, and shared evidence

Anecdotal evidence. Studies have shown how unreliable LLMs are exactly because they are not deterministic. Again, it's a fact, not an opinion.

>I'm talking about filtering spammy communication channels

So if we make tons of mistakes there, who cares, right?

I only used this as an example because it's one of the few very public uses of LLMs to make judgement calls where people accepted it as true and faced consequences.

I'm sure there are plenty more people getting screwed over by similar mistakes, but folks generally aren't stupid enough to say that publicly. Maybe the Salesforce huge mistake qualifies too? Incidentally it also involved people's jobs.

Regardless, the point stands: they are unreliable.

Want to trust LLMs blindly for your weekend project? Great! The only potential victim for its mistakes is you. For anything serious like a huge open source project? That's irresponsible.


I think it would, given that there is no air resistance.


btw it's only been getting seriously deployed since 2010


Is there a reason for the lack of IPv6 support?


[exe.dev co-founder here] It is planned! The reason we have not got to it yet is it needs to be very different than IPv4 support. We have spent a lot of time on machinery to allow `ssh yourmachine.exe.xyz` work without having to allocate you an IPv4 address. The mechanisms for IPv6 can and should be different, but they will also interact with how assigning public static IPv4 addresses will work in the future.

We do not want to end up in the state AWS is in, where any production work requires navigating the differences between how AWS manage v4 and v6. And that means rolling out v6 is going to be a lot of work for us. It will get done.

I added a public tracking bug here: https://github.com/boldsoftware/exe.dev/issues/16


You can use it right now if you build it from source, in fact I am writing this HN comment from it.


My guess is it will keep track of the PID


PID changes with every execution. How would it help when preserving configuration data?


Do you mean the real path instead of the PID?


The fragmentation is a natural consequence of different use cases existing.

You can't have your cake and eat it too.


Which is why in the end, WSL 2.0 is the Year of Desktop Linux, while Android, WebOS and ChromeOS commonality is the Linux kernel, not userspace.


> ChromeOS commonality is the Linux kernel, not userspace.

ChromeOS has a Linux userspace fully integrated via it's Crostini VM.


Partially, because not everything actually works, depending on the Chromebook model.

Great if everything that one wants from their GNU/Linux experience is a command line and TUI.

Starting a 3D accelerated GUI app? Well, it depends.


> Great if everything that one wants from their GNU/Linux experience is a command line and TUI.

Regular GUI apps work fine on ChromeOS. There's a flag to enable the GPU in the VM and with it, 3D accelerated GUI apps also mostly work. It's not optimized for gaming if that's what you are referring to though.


> This project currently has a bunch of C++ files, no docs, no tests

I think the more important thing is the protocol itself, rather than the specific implementation. As the author notes the current D-bus standards are substandard at best.


I can get on board with that if there's a protocol spec - at least a draft or something more tangible than "it's a tavern".

I just see a bunch of undocumented C++ code here.


I think one of the point from your parent comment is that something that works for people is a powerful way of making things happen, much more powerful than rants or theory or protocols. I also noticed that with cryptography algorithms/protocol for example: design the most amazing algorithm - no one is going to use it ever if there is no great reference implementation people can use.


> any app on the bus can read all secrets in the store if the store is unlocked

Holy shit. I knew conceptually that this was the case but never really took the time to consider the implications.

Pretty much whenever you unlock your keyring all your secrets are accessible by any software that can connect to the bus... How is this acceptable? Are we just supposed to run everything as Flatpak?


Funnily enough, my work macOS keychain maimed itself in such a way that I need to recreate it every time I install an OS update. Every time I recreate it, the OS spends a few minutes in a state where every application that needs access to the secrets store requests access through the keychain's password. Incredibly secure!

Turns out, that's every application, every few minutes, many of them multiple times. Applications like having access to things like refresh tokens so they can download your email, or discover passwords for offering autofill for a website.

I'd welcome many improvements to the Linux status quo, but applications not needing to ask before accessing the bus is the only reason it's usable in the first place.


It's acceptable because flatpak dbus and all its ilk are too opaque for the average "experienced" user to fully grok. The problems are there, but the situation is so convoluted that it's hard to build a mental model unless you truly understand the overall system architecture


keepassxc asks before giving out secrets


Most people disable that.

The reality is no one wants to be prompted everytime for a password. They want it to auto fill.

In complaining about this people are setting the boundary at the wrong place, and in proposing solutions assuming user behavior which doesn't exist (they will absolutely click "yes trust random application I'm busy move along now please").

I do not want to be prompted. I do perhaps want grades of secret access but even then thats asking a lot - do you want my SSH keys? Well yeah I probably want to give them to you some app which is automating things over SSH. It's 5 more versions before you get updates to ship them all to Russia or wherever after an author hand over.


We mostly develop on Linux, but target all three OSs.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: