I'm partial to bioinformatics as per Pauline Hogeweg's definition; which explicitly has computation as a property of life.
This approach actually makes testable (and tested) scientific predictions.
This makes Searle-derived papers super-weird for me; since from my perspective they seem to disprove the existence of life. (and it makes the name of the philosophy "biological naturalism" very ironic to me :-P )
(for extra irony, Turing actually went into biology late in his life. See: Turing 1952 "The Chemical Basis of Morphogenesis" )
The architecture of zswap does make more sense, because you might as well combine the low speed and latency of compression with the same from writing to storage.
Why does it have to be reserved to security space? Here is my API please find vulnerabilities I missed (otherwise someone with not restricted AI will find them first).
Cat is out of the bag.
Removing restrictions will help everybody in the long run.
You just put a pile of tokens in front of all the good models and let them fight it out like Thunderdome. Then keep track of how they undermined each other and do that when you want to do some hackin’.
OpenAI had been very strict about blocking reverse engineering/Ghidra/IDA_Pro-MCP tasks. I even got a warning email. I was having much more success convincing Claude Code for those tasks without warnings. Seems like they've tightened things up.
> It's easier to produce vulnerable code than it is to use the same Model to make sure there are no vulnerabilities.
I once had a car where the engine was more powerful than the brakes. That was one heck of an interesting ride.
So now we have a company that supplies a good chunk of the world's software engineering capability.
They're choosing a global policy that works the same as my fun car. Powerful generative capacity; but gating the corrective capacity behind forms and closed doors.
Anthropic themselves are already predicting big trouble in the near term[1] , but imo they've gone and done the wrong thing.
Pandora is an interesting parable here: Told not to do it, she opens the box anyway, releases the evils, then slams the lid too late and ends up trapping hope inside.
Given their model naming scheme, they should read more Greek Mythos. (and it was actually a jar ;-)
> its cyber capabilities are not as advanced as those of Mythos Preview (indeed, during its training we experimented with efforts to differentially reduce these capabilities)
I wonder if this means that it will simply refuse to answer certain types of questions, or if they actually trained it to have less knowledge about cyber security. If it's the latter, then it would be worse at finding vulnerabilities in your own code, assuming it is willing to do that.
I can confirm from experience that reviewing your own code for vulnerabilities has fallen under "prohibited uses" starting with Opus 4.6 as recently as April 10; forcing me to spend a day troubleshooting and quarantining state from my search system.
"This request triggered restrictions on violative cyber content and was blocked under Anthropic's Usage Policy. To learn more, provide feedback, or request an exemption based on how you use Claude, visit our help center: https://support.claude.com/en/articles/8241253-safeguards-wa..."
"stop_reason":"refusal"
To be fair, they do provide a form at https://claude.com/form/cyber-use-case which you can use, and in my case Anthropic actually responded within 24 hours, which I did not expect.
I admit I'm now once bitten twice shy about security testing though.
Opus 4.7 was still 'pausing' (refusing) random things on the web interface when I tested it yesterday, so I'm unable to confirm that the form applies to 4.7 or how narrow the exemptions are or etc.
i've not had the issue with codex, i was testing a public api i work on for issues, codex was happy to attempt to break it but did refuse to create a script that would automate the issue it found.
I'm assuming finding vulnerabilities in open source projects is the hard part and what you need the frontier models for. Writing an exploit given a vulnerability can probably be delegated to less scrupulous models.
Currently 4.7 is suspicious of literally every line of code. May be a bug, but it shows you how much they care about end-users for something like this to have such a massive impact and no one care before release.
Good luck trying to do anything about securing your own codebase with 4.7.
(I haven't run my own mail-server in a while. It's getting harder and harder.)
Are the real-time-blackhole lists still a thing?
If they're regularly allowing spam and not responding to reports in any sort of timely manner, possibly they should be reported to those.
Not going to work though, is it. Too big to fail shouldn't be a thing. It's not like you can't be flexible about it or give them some room to deal with it within corporate policy; but they do need to deal with it, right?
Realistically, I think some companies have outgrown the size where internet can still self-regulate them. You'd hurt yourself more than gmail.
This either needs laws or new game theory.
Or -you know- deprecate the current email system. I know that's a perennial proposal; but that's because every year it gets even more broken in even more interesting ways. It's patch-on-patch-on-patch at the moment. Just spinning up sendmail on a random box won't quite cut it anymore, if you want to participate.
Questions this raises for me (making a note here to maybe research a bit later):
Does this analysis change if using on-site AI? What if the ToS is different? Is it possible to stand up a service that does get the protections required? This might also be interesting when dealing with trans-atlantic work.
People have tried to run Qwen3-235B-A22B-Thinking-2507 on 4x $600 used, Nvidia 3090s with 24 GB of VRAM each (96 GB total), and while it runs, it is too slow for production grade (<8 tokens/second). So we're already at $2400 before you've purchased system memory and CPU; and it is too slow for a "Sonnet equivalent" setup yet...
You can quantize it of course, but if the idea is "as close to Sonnet as possible," then while quantized models are objectively more efficient they are sacrificing precision for it.
So next step is to up that speed, so we're at 4x $1300, Nvidia 5090s with 32 GB of VRAM each (128 GB), or $5,200 before RAM/CPU/etc. All of this additional cost to increase your tokens/second without lobotomizing the model. This still may not be enough.
I guess my point is: You see this conversation a LOT online. "Qwen3 can be near Sonnet!" but then when asked how, instead of giving you an answer for the true "near Sonnet" model per benchmarks, they suddenly start talking about a substantially inferior Qwen3 model that is cheap to run at home (e.g. 27B/30B quantized down to Q4/Q5).
The local models absolutely DO exist that are "near Sonnet." The hardware to actually run them is the bottleneck, and it is a HUGE financial/practical bottleneck. If you had a $10K all-in budget, it isn't actually insane for this class of model, and the sky really is the limit (again to reduce quantization and or increase tokens/second).
PS - And electricity costs are non-trivial for 4x 3090s or 4x 5090s.
Qwen3.5-35B-A3B is reported to perform slightly better than the model you mentioned.
It runs fine but non-optimal on a single 3090 with even 131072 tokens of context , and due to the hybrid attention architecture, the memory usage and compute scale rather less drastically than ctx^2. I've had friends with smaller cards still getting work out of it. Generation is at around 20 tokens/sec on that 3090 (without doing anything special yet) . You'll need enough DRAM to hold the bits of the model that don't fit. Nothing to write home about, but genuinely usable in a pinch or for tasks that don't need immediate interactivity.
It's the first local model that passes my personal kimbench usability benchmark at least. Just be aware that it is extremely verbose in thinking mode. Seems to be a qwen thing.
(edit: On rechecking my numbers; I now realize I can possibly optimize this a lot better)
With respect, this isn't "new data" it is an anecdote. And it kind of represents exactly the problem I was talking about above:
- Qwen is near Sonnet 4.5!
- How do I run that?
- [Starts talking about something inferior that isn't near Sonnet 4.5].
It is this strange bait/switch discussion that happens over and over. Least of all because Sonnet has a 200K context window, and most of these ancdotes aren't for anywhere near that context size.
You're not wrong; but... imho it's closer to Sonnet 4.0 [1] on my personal benchmark [2]. And I HAVE run it at just over 200Ktoken context, it works, it's just a bit slow at that size. It's not great, but ... usable to me? I used Sonnet 4.0 over api for half a year or so before, after all.
Only way to know if your own criteria are now matched -or not yet- is to test it for yourself with your own benchmark or what have you.
And it does show a promising direction going forward: usable (to some) local models becoming efficient enough to run on consumer hardware.
[1] released mid-2025
[2] take with salt - only tests personal usability
+ Note that some benchmarks do show Qwen3.5-35B-A3B matching Sonnet 4.5 (released later last year); but I treat those with the same skepticism you do , clearly ;)
> The hardware to actually run them is the bottleneck, and it is a HUGE financial/practical bottleneck.
That's unsurprising, seeing as inference for agentic coding is extremely context- and token-intensive compared to general chat. Especially if you want it to be fast enough for a real-time response, as opposed to just running coding tasks overnight in a batch and checking the results as they arrive. Maybe we should go back to viewing "coding" as a batch task, where you submit a "job" to be queued for the big iron and wait for the results.
A machine with 128GB of unified system RAM will run reasonable-fidelity quantizations (4-bit or more).
If you ever want to answer this type of question yourself, you can look at the size of the model files. Loading a model usually uses an amount of RAM around the size it occupies on disk, plus a few gigabytes for the context window.
Qwen3.5-122B-A10B is 120GB. Quantized to 4 bits it is ~70GB. You can run a 70GB model in 80GB of VRAM or 128GB of unified normal RAM.
Systems with that capability cost a small number of thousand USD to purchase new.
If you are willing to sacrifice some performance, you can take advantage of the model being a mixture-of-experts and use disk space to get by with less RAM/VRAM, but inference speed will suffer.
I may consider showing my ID to a company I already have a business relationship with; given demonstrable legal obligations, contractual necessities, legitimate interests etc . Eg the standard GDPR list.
I do have an existing business relationship with Anthropic, so I might under some circumstances decide to show them my id. I don't have a business relationship with Persona though.
I understand the instinct: they want to insulate themselves from holding PII. Not the worst idea. I'm not happy with it being a third party though. Especially the third party in question.
But they already have PII on nearly all users. Many user upload documents with their name, or pictures of themselves, or have a chat where home addresses are involved. All of this is information anthropic already has on their users (voluntarily provided via chats or via api) and is equivalent to what Persona gets via their verification - it’s just more convenient to use a third party SaaS product for this than vibe coding their own identity verification platform I guess
This might be conflating two things. What data exists somewhere, and how many different independent parties hold it. It's not the same risk.
Put this way: I sort of already trust Anthropic with some of my PII. And that's ... maybe not ok actually. But it's a single failure surface.
But that's definitely not the same thing as trusting Anthropic, AND Persona AND All Persona's partners AND their Partners ad infinitum.
And let's say Persona is actually ok; who knows, they might be? But it's still an extra surface; and if they share again, that's another extra surface again.
It's fairly common sense blast radius minimization. This is part of the actual theory behind GDPR.
"We already seem to accidentally be leaking some data through channel A" , doesn't mean it's a good idea to open channels B-Z as well. It means you might want to tighten down that channel A.
This approach actually makes testable (and tested) scientific predictions.
This makes Searle-derived papers super-weird for me; since from my perspective they seem to disprove the existence of life. (and it makes the name of the philosophy "biological naturalism" very ironic to me :-P )
(for extra irony, Turing actually went into biology late in his life. See: Turing 1952 "The Chemical Basis of Morphogenesis" )
reply