Written poetry all my life, of varying badness. Have never had the ear or the talent for music though and – unfortunately – always felt I wanted to write “songs” not “poems.” Since 2017, I’ve been trying to “set my poems to music” using the machine. Started with my own algos in 2017, got going in earnest in 2020 with openAI’s jukebox; then last year a friend turned me to Suno.
Take the first poem talked about in OP’s article and one of the comments: “humans working hard to prove that they can make art that’s somehow even worse than AI slop.” I see this sort of comment a lot and I’m not saying that’s wrong at all – undoubtedly the vast vast majority of AI “content” is truly “slop.”
But I’ve also believed that genAI could be thought of like an instrument. Most music played on a piano or a synth or a guitar is slop; but it undoubtedly allows for music to be made that would otherwise not exist. I hope the same can be said of Suno (or whatever – hopefully opensourced alternative - follows).
My understanding is that this is a (somewhat) open-source project that does music generation. I haven't read the license to see how permissive it truly is, but being someone who has been involved in this space for a while, I can say that we definitely need more open-source projects. Suno is great but completely walled off.
So yeah YueAI team... if you're really going to keep this project open... don't listen to the haters and keep going.
Shouldn't the "sonic boom" here provide good data as to the existence of dark matter (akin to the Bullet cluster)? Anyone on hn with good background care to comment? Don't see anything in the article about it, but would think is one of the most significant experimental goals from detecting these sorts of collisions.
My understanding is that, yes, the way matter in a galaxy merger behaves acts as strong evidence for the existence of dark matter and the theory that it's made of something that interacts weakly with normal matter.
I don't know...
It's like claiming that Samsung "enhanced their phone camera abilities" when they replaced zoomed-in moon shots with hi-res images of the moon.
I think that's meaningfully different. If you ask for chess advice, and get chess advice, then your request was fulfilled. If you ask for your photo to be optimized, and they give you a different photo, they haven't fulfilled your request. If GPT was giving Go moves instead of Chess moves, then it might be a better comparison, or maybe generating random moves. The nature of the user's intent is just too different.
It's cheating to the extent that it misrepresents the strength and reasoning ability of the model, to the extent that anyone is going to look at it's chess playing results and incorrectly infer this says anything about how good the model is.
The takeaway here is that if you are evaluating different models for your own use case, the only indication of how useful each may be is to test it on your actual use case, and ignore all benchmarks or anything else you may have heard about it.
It represents the reasoning ability of the model to correctly choose and use a tool... Which seems more useful than a model that can do chess by itself but when you need it to do something else, it keeps playing chess.
Where it’ll surprise people is if they don’t realize it’s using an external tool and expect it to be able to find solutions of similar complexity to non-chess problems, or if they don’t realize this was probably a special case added to the program and that this doesn’t mean it’s, like, learned how to go find and use the right tool for a given problem in a general case.
I agree that this is a good way to enhance the utility of these things, though.
It doesn't take much to recognize a sequence of chess moves. A regex could do that.
If what you want is intelligence and reasoning, there is no tool for that - LLMs are as good as it gets for now.
At the end of the day it either works on your use case, or it doesn't. Perhaps it doesn't work out of the box but you can code an agent using tools and duct tape.
Do you really think it's feasible to maintain and execute a set of regexes for every known problem every time you need to reason about something? Welcome to the 1970s AI winter...
Sure, but how do you train a smarter model that can use tools, without first having a less smart model that can use tools? This is just part of the progress. I don't think anyone claims this is the endgame.
I really don't understand what point you are trying to make.
Your original comment about a model that might "keep playing chess" when you want it to do something else makes no sense. This isn't how LLMs work - they don't have a mind of their own, but rather just "go with the flow" and continue whatever prompt you give them.
Tool use is really no different than normal prompting. Tools are internally configured as part of the hidden system prompt. You're basically just telling the model to use a specific tool in specific circumstances, and the model will have been trained to follow instructions, so it does so. This is just the model generating the most expected continuation as normal.
"Is gpt-3.5-turbo-instruct function calling a chess-playing model instead of generating through the base LLM?"
I'm absolutely certain it is not. gpt-3.5-turbo-instruct is one of OpenAI's least important models (by today's standard) - it exists purely to give people who built software on top of the older completion models something to port their code to (if it doesn't work with instruction tuned models).
I would be stunned if OpenAI had any special-case mechanisms for that model that called out to other systems.
When they have custom mechanisms - like Code Interpreter mode - they tell you about them.
I think it's much more likely that something about instruction tuning / chat interferes with the model's ability to really benefit from its training data when it comes to chess moves.
It should be easy to test for. An LLM playing chess itself tries to predict the most likely continuation of a partial game it is given, which includes (it has been shown) internally estimating the strength of the players to predict equally strong or weak moves.
If the LLM is just pass through to a chess engine, then it more likely to play at the same strength all the time.
It's not clear in the linked article how many moves the LLM was given before being asked to continue, or if these were all grandmaster games. If the LLM still crushes it when asked to continue a half played poor quality game, then that'd be a good indication it's not an LLM making the moves (since it would be smart enough to match the poor quality of play).
LLMs have this unique capability. Yet, every AI company seems hell bent on making them... not have that.
I want the essence of this unique aspect, but better, not this unique aspect diluted with other aspects such as the pure logical perfection of ordinary computer software. I already have that!
The problem with every extant AI company is that they're trying to make finished, integrated products instead of a component.
It's as-if you just wanted a database engine and every database vendor insisted on selling you a shopfront web app that also happens to include a database in there somewhere.
If that's what it does, then it's "cheating" in the sense that people think they're interacting with an LLM, but they're actually interacting with an LLM + chess engine. This could give the impression that LLM's are able to generalize to a much broader extent than they actually are – while it's actually all just a special-purpose hack. A bit like putting invisible guard rails on some popular difficult test road for self-driving cars – it might lead you to think that it's able to drive that well on other difficult roads.
Calling out to some chess-playing-function would be a deviation from the pure LLM paradigm. As a medium-level chess player I have walked through some of the LLM victories (ChatGPT 3-5-turbo-instruction); I find it is not very good at winning by mate - it misses several chances of forced mate. But forced mate is what chess engines are good at - can be calculated by exhaustive search of valid moves in a given board position.
So I'm arguing that it doesn't call out - it should gotten better advice if it did.
But I remain amazed that OP does not report any illegal moves made any of by LLMs. Assuming training material includes introductory texts of chess playing and a lot of chess games in textual notation (e.g. PGN) I would expect at least occasional illegal moves since the rules are defined in terms of board positions. And board positions are a non-trivial function of the set of moves made in a game. Does an LLM silently perform a transformation of the set of moves to a board position? Can LLMs, during training, read and understand board-position diagrams of chess books?
> But I remain amazed that OP does not report any illegal moves made any of by LLMs.
They did (but not enough detail to know how much of an impact it had):
> For the open models I manually generated the set of legal moves and then used grammars to constrain the models, so they always generated legal moves. Since OpenAI is lame and doesn’t support full grammars, for the closed (OpenAI) models I tried generating up to 10 times and if it still couldn’t come up with a legal move, I just chose one randomly.
I don't think it is, since OpenAI never mentions that anywhere AFAIK. That would be a really niche feature to include and then drop instead of building on more.
Helping that along is that it's an obvious scenario to optimize, for all kinds of reasons. One of them being that it is a fairly good "middle of the road" test for integrating with such systems; not as trivial as "Let's feed '1 + 1' to a calculator" and nowhere near as complicated as "let's simulate an entire web page and pretend to click on a thing" or something.
Why would they only incorporate a chess engine into (seemingly) exactly one very old, dated model? The author tests o1-mini and gpt-4o. They both fail at chess.
Because they decided it wasn't worth the effort. I can point to any number of similar situations over the many years I've been working on things. Bullet-point features that aren't pulling their weight or are no longer attracting the hype often don't transition upgrades.
A common myth that people have is that these companies have so much money they can do everything, and then they're mystified by things like bugs in Apple or Microsoft projects that survive for years. But from any given codebase, the space of "things we could do next" is exponential. That defeats any amount of money. If they're considering porting their bespoke chess engine code up to the next model, which absolutely requires non-trivial testing and may require non-trivial work, even for the richest companies in the world it is still an opportunity cost and they may not choose to spend their time there.
I'm not saying this is the situation for sure; I'm saying that this explanation is sufficient that I'm not going "oh my gosh this situation just isn't possible". It's definitely completely possible and believable.
Based on looking at the games at the end of the post, it seems unlikely. Both sides play extremely poorly — gpt-instruct is just slightly less bad — and I don't see any reasonable engine outputting those moves.
If the goal is to produce a LLM-like interface that generates correct output, then sure, it's not cheating..... but is it really a data-driven LLM at that point? If the LLM amounts to a chat-frontend that calls a host of human-prepared programs or draws from human-prepared databases, etc, it's starting to sound a lot more like Wolfram Alpha v2 than a LLM, and strikes me as walking away from AGI rather than toward it
I would note that "hybrids" in China (where plug-in-hybrids have gone to 16% of market share, up 700bps y/y) is a fundamentally different architecture than hybrids in the west. In China (see Li Auto [1] for example) hybrids are Battery Electric vehicles (ie no gearbox, fully electric motor) with a small gasoline generator and tank to recharge the battery. This is "best of both worlds"... you get the electric motor, which is much more efficient / cheaper than an ICE transmission. Then the gasoline generator is just tuned to maximize efficiency (~44% efficiency to electric, vs the mid-30s on an ICE motor) so the net efficiency of the hybrid is far superior to a Prius-plug-in type structure. These are termed Electric Range Extender Vehicles ("EREV"s), which is type of a Plug-in Hybrid since you can charge by plugging in the battery or by filling in gas.
Really surprised haven't seen these EREVs in the west, although Hyundai is supposed to launch in US in 2026. Could be game-changer when that happens...
Yes! Difference is the current-gen EREVs are in a different league of performance with the advancement in electric powertrain and software. Think change between a Model 3 and a Chevy Bolt...
That does sound very high. Not sure where you are, but guessing you are in the US at least.
As a comp, SunRun puts out detailed cost estimates for their residential systems each quarter[1]. Their average system cost was $5/watt, but with 50% of their installs having batteries. So for an 8Kw system (if you were buying outright) you should have gotten a quote of $40k with a "half-sized" battery. After US tax incentives, your cost should be <$30k including battery.
But yeah, if you're not in a region where they do a lot of installs, you won't get that price...
Yes, the cost of GENERATING electricity will undoubtedly be cheaper at the industrial scale level than on your rooftop.
But that generated electricity is likely to be a region very far from your (or someone else's) consumption - needing a lot of money to lay transmission and distribution lines to the end consumer.
Co-locating with consumption makes the difference in total costs far closer.
Very location dependent, but please don't dismiss offhand without considering the very real transmission costs.
Find your statement fascinating as well, but could also provide another haha-but-serious explanation that I personally like... which is that there are universes with an infinite permutation/computation/value of state variables, and our one could only be "simulated" within the one universe where these sort of limits exist.
At this point, utility scale solar projects have reached scale in the supply chain, and cost estimates tend to be quite accurate once you break ground and start construction. Yes items such as solar panels tend to be procured upfront and costs locked in (so even if a 2020 style supply chain crisis hits, the extra cost tends to pass down the chain... even though, of course, that sort of crisis will always balloon costs.) The main risk on cost slippage in this sort of project would come from 2 areas:
1) pre-breaking ground (i.e. pre Final Investment Decision)... going through regulatory/licensing/permitting. This is exactly when you can't lock down long-lead time items and are subject to market prices.
2) the transmission line: unlike utility-scale power plants, these tend to be one-off projects and also face a lot of social challenges once construction starts. That price/time could definitely balloon, even after breaking ground/FID.
(edited)
I think mutable is generating an auto-wiki of your repo.
Separately - would like to know if wiki can be auto-generated from a large corpus of text. Should be a much simpler problem? Any answers would be much appreciated!
Given that the company is called Mutable AI and they called the product (?) Auto Wiki then I have to assume that they auto-generated the wiki. But I agree that the wording is ambiguous and could be interpreted as "we manually created the wiki".
> So we went back to step 1, let’s understand the code, let’s do our homework, and for us, that meant actually putting an understanding of the codebase down in a document — a Wikipedia-style article — called Auto Wiki. The wiki features diagrams and citations to your codebase.
(Edit) Homepage makes it clear they're talking about ML-generated wiki https://mutable.ai/
Yeah I mean when I read "for us, that meant actually putting an understanding of the codebase down in a document" I assume that's a person. For me, current AI's don't have any understanding as such.
But yeah, upon closer inspection I see in the sidebar it says "Create your own wiki - AI-generated instantly". So that clears up my confusion.
Take the first poem talked about in OP’s article and one of the comments: “humans working hard to prove that they can make art that’s somehow even worse than AI slop.” I see this sort of comment a lot and I’m not saying that’s wrong at all – undoubtedly the vast vast majority of AI “content” is truly “slop.”
But I’ve also believed that genAI could be thought of like an instrument. Most music played on a piano or a synth or a guitar is slop; but it undoubtedly allows for music to be made that would otherwise not exist. I hope the same can be said of Suno (or whatever – hopefully opensourced alternative - follows).
And here’s one of my attempts, a song about the ethics of making music with the machine: https://www.youtube.com/watch?v=3w5HBrMenZM