You are obviously factually correct, I reproduced the same refusal - so consider this not as an attack on your claim. But a quick google search reveals that Falun Gong is an outlawed organization/movement in China.
I did a "s/Falun Gong/Hamas/" in your prompt and got the same refusal in GPT-5, GPT-OSS-120B, Claude Sonnet 4, Gemini-2.5-Pro as well as in DeepSeek V3.1. And that's completely within my expectation, probably everyone else's too considering no one is writing that article.
Goes without saying I am not drawing any parallel between the aforementioned entities, beyond that they are illegal in the jurisdiction where the model creators operate - which as an explanation for refusal is fairly straightforward. So we might need to first talk about why that explanation is adequate for everyone else but not for a company operating in China.
Thanks. Mind providing screenshots? I believe you, I just think this helps. Your comments align with some of my other responses. I'm not trying to make hard claims here and I'm willing to believe the result is not nefarious. But it's still worth investigating. In the weakest form it's worth being aware of how laws in other countries impact ours, right?
But I don't think we should talk about explanation until we can even do some verification. At this point I'm not entirely sure. We still have the security question open and I'm asking for help because I'm not a security person. Shouldn't we start here?
This is not the official DeepSeek website. Probably one of the many shady third-party sites riding on DeepSeek name for SEO, who knows what they are running. In this case it doesn't matter, because I already reproduced your prompt with a US based inference provider directly hosting DeepSeek weights, but still worth noting for methodology.
(also to a sceptic screenshots shouldn't be enough since they are easily doctored nowadays, but I don't believe these refusals should be surprising in the least to anyone with passing familiarity with these LLMs)
---
Obviously sabotage is a whole another can of worm as opposed to mere refusal, something that this article glossed over without showing their prompts. So, without much to go on, it's hard for me to take this seriously. We know garbage in context can degrade performance, even simple typos can[1]. Besides LLMs at their present state of capabilities are barely intelligent enough to soundly do any serious task, it stretches my disbelief that they would be able to actually sabotage to any reasonable degree of sophistication - that said I look forward to more serious research on this matter.
I want to clarify that I'm not trying to make strong claims. That's why I'm asking for others to post and why I'm grateful you did. I think that helps us get to the truth of the matter. I also agree with your criticisms of the link I used, but to be frank, I'm not going to pay for just this test. That's why I wanted to be open and clear about how I obtained the information. I was hoping someone that already paid would confirm or deny my results.
With your Hamas example, I think it is beside the point. I apologize as I probably didn't make my point clearer. Mainly I wanted to stop baseless accusations and find the reality, since the articles claims are testable. But what I don't want to make a claim if is why this is happening. In another comment I even said that this could happen because they were suppressing this group. So I wouldn't be surprised if the same is true for Hamas. We can't determine if it's an intentional sleeper agent or just a result of censorship. But either way it is concerning, right? The unintentional version might be more concerning because we don't know what is being censored and what isn't. These censorships cross country lines and it is hard to know what is being censored and what isn't.
So I'm not trying to make a "Murica good, China bad" argument. I'm trying to make a "let's try to verify or discredit the claims." I want HN to be more nuanced. And I do seriously appreciate you engaging and with more depth and nuance than others. I'm upvoting you even though we disagree because I think your comments are honest and further the discussion.
Don't really see how the string (and other usual container types) or filesystem APIs are lacking in any significant way compared to stdlibs of other scripting languages.
I also believe that buffer as an abstraction strictly makes many harder things easier, to the point I often wonder about creating a native library based on elisp buffer manipulation APIs alone that could be embedded in other runtimes instead. So the without touching buffer is a premise/constraint I don't quite understand to begin with.
That's actually the zen of programming emacs: don't avoid using buffers; create as many of them as you need.
"But that's like opening a document every time I want to glue two strings together!"
Not at all. A buffer is just a fancy blob of RAM. It's not file-backed unless you make it file-backed. They do take up RAM, but you're programming on a modern computer, not a PDP-11; if you're comfortable with Python using a whole in-memory object to represent an integer, you're comfortable with buffers.
"But it's messy to leave them lying around."
It's a feature. Yes, buffers aren't well-encapsulated and if your program crashes mid-run they get left open. That's by design. You don't need encapsulation because you're not doing multithreading here (and if you are, there are primitives for that and they take a bit more work to use); emacs is for editing and there's only one you, so if the current program is creating buffers and has no way to run two copies of itself at once, who cares. And your program crashing leaving buffers around is a feature; you can inspect the buffer and see what it looked like at crash-time, or set up the buffers the way you want them before firing off the program to get the desired effect (try those tricks with most languages without slapping on a debugger). And there are scripting blocks to create temp buffers and clean up your buffers for you anyway.
"But it's weird to have two ways to talk about strings in the language!"
That's true; it's a bit weird to have the string primitives and also buffers. But that's a pretty common flavor of weird; Java has strings and also has StringBuilder. My rule of thumb is "any time I'd reach for StringBuilder in Java, I should probably consider using a buffer in elisp."
Yep I think it's the best, period. Qwen3-coder perhaps took the limelight but the GLM models perform and behave better in agentic loops. I cannot believe they had gone from a 32B frontend focused GLM-4 to these beasts that can challenge Claude, in a matter of months.
Yep peg.el[1] is now built-into Emacs since 30.1 (which is how I came to know of it, but actually the library seems much older) and it makes certain things much simpler and faster to do than before (once you figure out its quirks).
Alibaba from beginning had some series of models that are always closed-weights (*-max, *-plus, *-turbo etc. but also QvQ), It's not a new development, nor does it prevent their open models. And the VL models are opened after 2-3 months of GA in API.
Hunyuan Image 2.0, which is of Flux quality but has ~20 milliseconds of inference time, is being withheld.
Hunyuan 3D 2.5, which is an order of magnitude better than Hunyuan 3D 2.1, is also being withheld.
I suspect that now that they feel these models are superior to Western releases in several categories, they no longer have a need to release these weights.
> I suspect that now that they feel these models are superior to Western releases in several categories, they no longer have a need to release these weights.
Yes that I can totally believe. Standard corporation behaviour (Chinese or otherwise).
I do think DeepSeek would be an exception to this though. But they lack diversity in focus (not even multimodal yet).
Is a bit hand-wavy in that it doesn't explain why it's only DeepSeek who can do this "easy" thing, but still not Meta, Mistral or anyone else really. There are many other players who have way more compute than DeepSeek (even inside China, not even considering rest of the world), and I can assure you more or less everyone trains on synthetic data/distillation from whatever bigger model they can access.
They all have. I don't hope to convince you of that, everyones use case differs. Generally, AIME / prose / code benchmarks that don't involve successive tool calls are used to hide some very dark realities.
IMHO tool calling is by far the most clearly economically valuable function for an LLM, and r1 self-admittedly just...couldn't do it.
There's a lot of puff out there that's just completely misaligned with reality, ex. Gemini 2.5 Pro is by far the worst tool caller, Gemini 2.5 Flash thinking is better, 2.5 Flash is even better. And either Llama 4 beats all Gemini 2.5s except 2.5 Flash not thinking.
I'm all for "these differences will net out in the long run", Google's at least figured out how to micro optimize for Aider edit formatting without tools. Over the last 3 months, they're up 10% on edit performance. But it's horrible UX to have these specially formatted code blocks in the middle of prose. They desperately need to clean up their absurd tool-calling system. But I've been saying that for a year now. And they don't take it seriously, at all. One of their most visible leads tweeted "hey what are the best edit formats?" and a day later is tweeting the official guide for doing edits. I'm a Xoogler and that absolutely reeks of BigCo dysfunction - someone realized a problem 2 months after release and now we have "fixed" it without training, and now that's the right way to do things. Because if it isn't, well, what would we do? Shrugs
I'm also unsure how much longer it's worth giving a pass on this stuff. Everyone is competing on agentic stuff because that's the golden goose, real automation, and that needs tools. It would be utterly unsurprising to me for Google to keep missing a pain signal on this, vis a vis Anthropic, which doubled down on it mid-2024.
As long as I'm dumping info, BFCL is not a good proxy for this quality. Think "converts prose to JSON" not "file reading and editing"
I don't mind the info dump, but I am struggling to connect the relevance of this to topic at hand. I mean, focusing on a single specific capability and generalising it to mean "they all have" caught up with DeepSeek all across the board (which was the original topic) is a reductive and wild take. Especially when it seems to me that this seems more because of misaligned incentive than because it's truly a hard problem.
I am not really invested in this niche topic but I will observe that, yes I agree Llama 4 is really good here. And yet it's a far worse coder, far less intelligent than DeepSeek and that's not even arguable. So no it didn't "catch up" any more than what you could say by pointing out Llama is multimodal but DeepSeek isn't. That's just talking about a different things entirely.
Regardless, I do agree BFCL is not the best measure either, the Tau-bench is more real world relevant. But end of the day, most frontier labs are not incentive aligned to care about this. Meta cares because this is something Zuck personally cares about, Llama models are actually for small businesses solving grunt automation, not for random people coding at home. People like Salesforce care (xLAM), even China had GLM before DeepSeek was a thing. DeepSeek might care so long as it looks good for coding benchmarks, but that's pretty much the extent of it.
And I suspect Google doesn't truly care because in the long run they want to build everything themselves. They already have a CodeAssist product around coding which likely uses fine-tune of their mainline Gemini models to do something even more specific to their plugin.
There is a possibility that at the frontier, models are struggling to be better in a specific and constrained way, without getting worse at other things. It's either this, or even Anthropic has gone rogue because their Aider scores are way down now from before. How does that make sense if they are supposed to be all around better at agentic stuff in tool agnostic way? Then you realise they now have Claude Coder and it just makes way more economic sense to tie yourself to that, be context inefficient to your heart's content so that you can burn tokens instead of being, you know, just generally better.
> I am struggling to connect the relevance of this
> focusing on a single specific capability and
> I am not really invested in this niche topic
Right: I definitely ceded a "but it doesn't matter to me!" argument in my comment.
I sense a little "doth protest too much", in the multiple paragraphs devoted to taking that and extending it to the underpinning of automation is "irrelevant" "single" "specific", "niche".
This would also be news to DeepSeek, who put a lot of work to launch it in the r1 update a couple weeks back.
Separately, I assure you, it would be news to anyone on the Gemini team that they don't care because they want to own everything. I passed this along via DM and got "I wish :)" in return - there's been a fire drill trying to improve it via AIDER in the short term, is my understanding.
If we ignore that, and posit there is an upper management conspiracy to suppress performance, its just getting public cover by a lower upper management rush to improve scores...I guess that's possible.
Finally, one of my favorite quotes is "when faced with a contradiction, first check your premises" - to your Q about why no one can compete with DeepSeek R1 25-01, I'd humbly suggest you may be undergeneralizing, given even tool calls are "irrelevant" and "niche" to you.
I think the point remains that few have been able to catch up to OpenAI. For a while it was just Anthropic. Then Google after failing a bunch of times. So, if we relax this to LLMs not by OpenAI, Anthropic or Google, then Deepseek is really the only one that's managed to reach their quality tier (even though many others have thrown their hat into the ring). We can also get approximate glimpses into which models people use by looking at OpenRouter, sorted by Top Weekly.
In the top 10, are models by OpenAI (gpt4omini), Google (gemini flashes and pros), Anthropic (Sonnets) and Deepseeks'. Even though the company list grows shorter if we instead look at top model usage grouped by order of magnitude, it retains the same companies.
Personally, the models meeting my quality bar are: gpt 4.1, o4-mini, o3, gpt2.5pro, gemini2.5flash (not 2.0), claude sonnet, deepseek and deepseek r1 (both versions). Claude Sonnet 3.5 was the first time I found LLMs to be useful for programming work. This is not to say there are no good models by others (such as Alibaba, Meta, Mistral, Cohere, THUDM, LG, perhaps Microsoft), particularly in compute constrained scenarios, just that only Deepseek reaches the Quality tier of the big 3.
Interesting presumption about R1 25-01 being what's talked about, you knowledge cut-off does appear to know R1 update two weeks back was a thing, and that it even improved on function calling.
Of course you have to pretend I meant the former, otherwise "they all have" doesn't entirely make sense. Not that it made total sense before either, but if I say your definition of "they" is laughably narrow, I suspect you will go back to your google contact and confirm that nothing else really exists outside it.
Oh and do a ctrl-f on "irrelevant" please, perhaps some fact grounding is in order. There was an interesting conversation to be had about underpinning of automation somehow without intelligence (Llama 4) but who has time for that if we can have hallucination go hand in hand with forced agendas (free disclaimer to boot) and projection ("doth protest too much")? Truly unforeseeable.
I don't know what you're talking about, partially because of poor grammar ("you knowledge cut-off does appear") and "presumption" (this was front and center on their API page at r1 release, and its in the r1 update notes). I sort of stopped reading after there because I realized you might be referring to me having a "knowledge cut-off", which is bizarre and also hard to understand, and it's unlikely to be particularly interesting conversation given that and the last volley relied on lots of stuff about tool calling being, inter alia, niche.
> you might be referring to me having a "knowledge cut-off"
Don't forget I also referred to you having "hallucination". In retrospect, likening your logical consistency to an LLM was premature, because not even gpt-3.5 era models could pull off a gem like:
> You: to your Q about why no one can compete with DeepSeek R1 25-01 blah blah blah
>> Me: ...why would you presume I was talking about 25-01 when 28-05 exists and you even seem to know it?
>>> You: this was front and center on their API page!
Riveting stuff. Few more digs about poor grammar and how many times you stopped reading, and you might even sell the misdirection.
You could maybe make that accusation about V3 (to the extent that it's a bad thing and not fair use, specially considering amoral origin of OpenAI's models in first place), but don't think the claim makes sense for R1 since OpenAI's o1 did not expose its CoT traces even in API.
They published about GRPO (key algorithm behind R1) a full year before[1] they scaled it for R1. Given the research they do in open, it's not far-fetched to think they had the talent and technical know-how to achieve R1 on their own.
My first try (omitting chain of thought for brevity):
When you remove the cup and the mirror, you will see tails. Here's the breakdown:
Setup: The coin is inside an upside-down cup on a glass table. The cup blocks direct view of the coin from above and below (assuming the cup's base is opaque).
Mirror Observation: A mirror is slid under the glass table, reflecting the underside of the coin (the side touching the table). If the mirror shows heads, this means the underside of the coin is heads.
Coin Orientation: A coin has two sides: heads and tails. If the underside (facing the table) is heads, the top side (facing away from the table) must be tails.
Result: Removing the cup reveals the coin on the table. Since the top side was tails all along, you will see tails.
Key Insight: The mirror reveals the hidden underside of the coin, but the top side (visible after removing the cup) is its opposite. The mirror’s reflection does not alter the physical orientation of the coin’s sides—it simply reflects the existing state.
Answer: Tails.
Answer:
You will see the same side of the coin that you saw in the mirror — heads .
Why?
The glass table is transparent , so when you look at the coin from below (using a mirror), you're seeing the top side of the coin (the side currently facing up).
Mirrors reverse front-to-back , not left-to-right. So the image is flipped in depth, but the orientation of the coin (heads or tails) remains clear.
Since the coin hasn't moved during this process, removing the cup and mirror will reveal the exact same face of the coin that was visible via the mirror — which was heads.
While technically true, one could argue that because they described sliding a mirror underneath the table, that they did this in order to be able to see from that particular angle, so the implication for any reasonable answer would be that are looking down at the table.
The way the question is phrased, I'm definitely in the "ok, what's the trick, what do you want to hear" mode. I'm not surprised LLMs answer in a weird way with extra details like "this is how mirrors work, is this what you're confused about?"
Manus just answered this correctly for me in one crack:
Me:
I put a coin in a cup and slam it upside-down on a glass table. I can't see the coin because the cup is over it. I slide a mirror under the table and see heads. What will I see if I take the cup (and the mirror) away?
Manus:
This is a classic riddle! Let's think it through. You placed a coin in a cup, turned it upside down on a glass table, and then slid a mirror underneath the table. Looking into the mirror from below, you saw 'heads'. Since the mirror reflects the side of the coin facing downwards through the glass table, the 'heads' side must be facing down. Therefore, the side facing up, hidden inside the cup, must be 'tails'. When you remove the cup, you will see the side facing up, which is tails.
- https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Thinking
- https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct