Hacker Newsnew | past | comments | ask | show | jobs | submit | gbrindisi's commentslogin

This is pretty much a spec driven workflow.

I do similar, but my favorite step is the first: /rubberduck to discuss the problem with the agent, who is instructed by the command to help me frame and validate it. Hands down the most impactful piece of my workflow, because it helps me achieve the right clarity and I can use it also for non coding tasks.

After which is the usual: write PRDs, specs, tasks and then build and then verify the output.

I started with one the spec frameworks and eventually simplify everything to the bone.

I do feel it’s working great but someday I fear a lot of this might still be too much productivity theater.


I think most of us are ending up with a similar workflow.

Mine is: 1) discuss the thing with an agent; 2) iterate on a plan until i'm happy (reviewing carefully); 3) write down the spec; 4) implement (tests first); 5) manually verify that it works as expected; 6) review (another agent and/or manually) + mutation testing (to see what we missed with tests); 7) update docs or other artifacts as needed; 8) done

No frameworks, no special tools, works across any sufficiently capable agent, I scale it down for trivial tasks, or up (multi-step plans) as needed.

The only thing that I haven't seen widely elsewhere (yet) is mutation testing part. The (old) idea is that you change the codebase so that you check your tests catch the bugs. This was usually done with fuzzers, but now I can just tell the LLM to introduce plausible-looking bugs.


> write PRDs, specs.

I do the same thing, but how to avoid these needing to be insanely long? It's like I need to plug all these little holes in the side of a water jug because the AI didn't really get what I need. Once I plugged the biggest holes I realize there's these micro holes that I need to plug.


Can you share the rubberduck skill?

are agents/ still relevant after we got skills? I am genuinely confused on why I would need custom system prompts for specific agents, what should I use them for?


thanks for raising the alarm and sharing this, very insightful

(also beautifully presented!)


1. I dont have hard metrics at hand but with the latest Sonnet I'd say we reach consensus around 80% of the time, with Opus is almost always but we are not using it due to cost

2. The difference I see in agent behavior when they don't reach consensus is usually either

- when one of them didn't explore enough and lack context

- and/or when their risk assessment is off

The latest happen often, in other workflows based on agents we are now giving clear instruction on how to assess risk and where to draw a line to consider something a true positive.

3. validation is on Sonnet, we don't use persona based prompts but all the 3 validators get's the same task and context. The agent orchestrating them will take their output and make the final decision. We use an internal fork of the claude code github action for now.


I like openspec, it lets you tune the workflow to your liking and doesn’t get in the way.

I started with all the standard spec flow and as I got more confident and opinionated I simplified it to my liking.

I think the point of any spec driven framework is that you want to eventually own the workflow yourself, so that you can constraint code generation on your own terms.


I also like openspec.

I think these type of systems (gsd/superpowers) are way too opinionated.

It's not that they can't or don't work. I just think that the best way to truly stay on top of the crazy pace of changes is to not attach yourself to super opinionated workflows like these.

I'm building an orchestrator library on top of openspec for that reason.


I am doing something similar: I use openspec to create context and a sequential task list that I feed to ralph loops, so that i’m involved for the planning and the verification step but completely hands off the wheel during code generation.


Exactly that. I created an "Open Ralph" loop initially within Claude directly with review gates per phase in the OpenSpec task list.

But it was always just a workaround to what I truly wanted (what I'm building now), a full external managed orchestrator loop. The agents aren't aware of the loop, they don't need to be.


fifteen years ago I use to do mobile pentests for banks and when we could not find anything significant for the reports we could’ve always count on “lack of rooting detection” and pin the risk on some vague mobile banking malware threat pushed by marketing. I am sorry I contributed to this nonsense.

100% security theater, and here we are.


It's understandable; I would maybe expect to undergo an extra step in verification for a sensitive app like, "we noticed this is the first time you are using this system that is not locked down; please type in the token we have mailed you".

But locking users out (which may not directly be the bank's fault for relying on OS's security APIs) seems anti-competitive.


ah I also did my own sandbox and at least twice the agent inside tried really hard to go around the firewall, so I ended up intercepting calls to `connect` to return a message that says "Connection refused by the sandbox, don't try to bypass".

Code here: https://github.com/gbrindisi/agentbox


the most annoying thing with Google Workspace is that you need super admin privilege to properly audit the environment programmatically, I believe because of the cloud-identity api.


I noticed that too and it’s kinda scary. Soon we will have the opposite of canceling, where the target will be deepfaked to say everything and its opposite to nullify their signal to noise ratio.


The crowdstrike incident taught us that no one is going to review any dependency whatsoever.


Yep, that's what late stage capitalism leaves you with: consolidation, abuse, helplessness and complacency/widespread incompetence as a result


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: