Making complete coherent products is as hard as ever, or even harder if you intend to trade robustness for max agentic velocity.
What I do very successfully is low stakes stuff for work (easy automations, small QoL improvements for our tooling, a drive-by small Jira plugin)
And then I do a lot of crazy exploring, or hyper-personal just for myself stuff that can only exist because I can now spawn and abandon it in a couple days instead of weeks or months.
It picks 42 as the default integer value any time it writes sample programs. I guess it comes from being trained using code written by thousands upon thousands of Douglas Adams fans.
Basically every ml script I see has 42 as the default seed, even before LLMs. Pretty sure it was what I used for my thesis code haha. So not surprising it always picks it.
Right on, I was going off the OP's GSD link, which looks like the def of a cli wrapper to me. Hadn't seen superpowers before, seems way too deterministic and convoluted, but you're right, not a cli wrapper.
There's a CLI tool that writes the agent skills into the right folder. The other option would be to have everybody manually unzip a download into a folder which they might not remember.
A couple weeks ago on a lark I asked Claude/Gemini/Codex to hallucinate a language they would like to program in and they always agreed on strong types, contracts, verification, proving and testing. So they ended up brainstorming a weird Forth-like with all those on top. I then kept prodding for an implementation and burned my weekly token budget until a lot of the language worked. They called it Cairn.
So now I prompted this: "can you generate a fizzbuzz implementation in Cairn that showcases as much as possible the TEST/PROVE/VERIFY characteristics of the language? "
Same originating idea: "a language for AI to write in" but then everything else is different.
The features of both are quite orthogonal. Cairn is a general purpose language with features that help in writing probably working code. Mog is more like "let's constraint our features so bad code can't do much but trade that for good agent ergonomy".
Cairn is a crazy sprawling idea, Mog is a little attempt at something limited but practical.
Mog seems like something someone has thought about. No one has thought about Cairn, it's pure LLM hallucination, the fact that it exists and can do a lot of stuff it's just the result of someone (me) not knowing when a joke has gone too far.
This enables different satisfactions. You can still choose all your options but have a working repl or small compiler where you are trying them within minutes.
Also you decide how much in control you are. Want to provide a hand made grammar? go ahead, want the agent to come up with it just from chatting and pointing it to other languages, ok too. Want to program just the first arithmetic operator yourself and then save the tedium of typing all the others so you can go to the next step? fine...
So you can have a huge toy language in mere days and experiment with stuff you'd have to build for months by hand to be able to play with.
My own 100% hallucinated language experiment is very very weird and still has thousands of lines of generated examples that work fine. When doing complex stuff you could see the agent bounce against the tests here and there, but never produced non-working code in the end. The only examples available were those it had generated itself as it made up the language.
It was capable of making things like a JSON parser/encoder, a TODO webapp or a command line kanban tracker for itself in one shot.
I have let Gemini, Claude Code and Codex hallucinate the language they wanted to for a few days. I prompted for "design the language you'd like to program in" and kept prompting "go ahead". Just rescued it from a couple too deep rabbit holes or asked it for some particular examples to stress it a bit.
It´s a weird-ass Forth-like but with a strong type system, contracts, native testing, fuzz testing, and a constraint solver for integer math backed by z3. Interpreter implemented in Elixir.
In about 150 commits, everything it has done has always worked without runtime errors, both the Elixir interpreter and the examples in the hallucinated language, some of them non-trivial for a week old language (json parser, DB backed TODO web app).
It´s a deranged experiment, but on the other hand seems to confirm that "compile" time analysis plus extensive testing facilities do help LLM agents a lot, even for a weird language that they have to write just from in-context reference.
Don´t click if you value your sanity, the only human generated thing there is the About blurb:
Interesting project, but I believe the base assumption is already slightly wrong. Why do we assume that LLMs know what kind of language would benefit them? This information is not knowable without doing proper research, and even if there is some research like that, it would have to be a part of the training data. Otherwise it's just hallucination.
I agree, it´s mostly a silly whim taken too far. Too much time in my hands.
In particular the whole stack based thing looks questionable.
In fact the very first answer by Gemini proposed an APL-like encoding of the primitives for token saving, but when I started the implementation Claude Code pushed back on that, saying it would need to keep some sane semantics around the keywords to be able to understand the programs.
The very strict verification story seems more plausible, tracks with the rest of the comments here.
What has surprised me is that the language works at all, adding todo items to a web app written in a week old language felt a bit eery.
Have the LLMs generate tests that measure the “ease of use” and “effectiveness” of coding agents using the language.
Then have them use these tests to get data for their language design process.
They should also smoke test their own “meta process” here. E.g. Write a toy language that should be obviously much worse for LLMs, and then verify that the effectiveness tests produce a result agreeing with that.
Wow that is wild, that is exactly along the lines of my fantasy language. It'd be so easy to go into the deep end building tooling and improving a language like this.
This is actually quite impressive, especially as AI vibe-coded slop. How easy is the language to learn for novice coders, compared to other FORTH lookalikes?
There's a lot of language for such a little time, but if you have programmed any Forth it should be easy to pick up, have a look at some of the top level examples.
I have programmed about 3 Forth implementations by hand throughout the years for fun, but I have never been able to really program in it, because the stack wrangling confuses me enormously.
So for me anything vaguely complex is unreadable , but apparently not for the LLMs, which I find surprising. When I have interrogated them they say they like the lack of syntax more than the stack ops hamper them, but it might be just an hallucinated impression.
When they write Cairn I sometimes see stack related error messages scroll by, but they always correct them quickly before they stop.
What I do very successfully is low stakes stuff for work (easy automations, small QoL improvements for our tooling, a drive-by small Jira plugin)
And then I do a lot of crazy exploring, or hyper-personal just for myself stuff that can only exist because I can now spawn and abandon it in a couple days instead of weeks or months.
reply