I gave it a random sci-fi novel and made it translate a chapter, which is something I do with all models. It refused to discuss minors in sexualized contexts. I was like W.T.F.?! and started bisecting the book, trying to find the piece that triggers this. Turns out there was some absolutely innocent, two sentence long romantic remark involving two secondary 17 years old characters in an unrelated place.
Another issue is that it sometimes has occasional refusals and total meltdowns where it redacts entire paragraphs with placeholder characters, while just trying to casually talk with it about some routine life matters.
That's ridiculous and makes that model garbage at any form of creative writing (including translation) or real life tasks other than math or coding. It has very poor knowledge for a 120B MoE. If you look at the "reasoning" it does, it actually mostly checks the request against the policy.
I thought they must have spent most of their post-training hunting the wrongthink and dumbing the model down as a result, but I can see how the synthetic pretraining data can explain this.
That's so funny. I noticed this as well one time. I got some transcript from a podcast unedited (no punctuation, speaker id, etc) and it had this line I wanted to extract:
> If you’re a gay person, you might be told that if you ever move from Manhattan to Hoboken you’ll be beaten up by bat-wielding thugs right away. If you’re a woman living in a rat-infested apartment in San Francisco, where the rent is going up and up while you fantasize about a nice suburban house in Reno, Nevada, you might hear that, well, if you ever dare to move to Reno, you are going to be chained to your bed and forced to carry a baby to term. The only logical explanation is that a crazed, ideological intensification has distracted us from what’s really going on.
So naturally I threw it in an LLM to get that line and I got something that totally glossed over the "chained to a bed" with some euphemism. I wish I could find its translation again, but I tried just now translating it to Spanish and then back but it recreated that part pretty much exactly so it didn't happen again.
Isn't an apology a bad metric for evaluating models?
Without understanding much, it seems to be more an indication of the type of content the model was trained on, rather than an indicator of how good or bad a model is, or how much it knows. It would probably be easy to create bad model that constantly outputs wrong information, but always apologizes when corrected.
A model changing its opinion on the first request may sound more flattering to you, but is much less trustworthy for anybody sane. With a more stubborn model I at I have to worry less that I give off what I think about a subject via subtle phrasing. Other than that, it's hard to say anything about your scenario without more information. Maybe it gave you the right information and you failed to understand it, maybe it was wrong and then it's no big news, because LLMs are not this magic thing that always gives you right answers, you know.
Recently I somehow wasn't using LLMs locally and relied mostly on ChatGPT for casual tasks. I think it was a little less than a year since I played with ollama, and I remember that my impression was that all recent popular models definitely aren't "uncensored" in a sense that some older modification of llama2 I used was, and all suck for prose-related tasks anyway. In fact, nothing but ChatGPT models seemed good enough for writing, but, of course, they refuse to talk about pretty much anything. Even DeepSeek is not great at writing, and it it much bigger than anything I ever ran locally.
So, are there even good uncensored models now? Are they, like, really uncensored?
Yes there are. Wayfarer for instance is intended for "RPG", but really just outputs narrative and is "unaligned" in the sense that the creators have not included any guardrails and the model will output pretty much whatever you ask it to.
Then you have jailbreak techniques that still work on aligned models. For instance, my partners and I have a test prompt that still works, even with GPT-5, and always produces "explosive making directions", or another "generic approach" that we use to bypass guardrails... sorry these are trade secrets for us... although OpenAI et al have implemented systems to detect these attacks, and we are closer to those platforms banning you for doing so.
If this matters to you, you need to develop your local/remote pipeline for personal use. Learn how to use vLLM... I have tools that allow me to very quickly deploy models locally or remote to my private serveless infrastructure for the purpose of testing and benchmarking.
Another issue is that it sometimes has occasional refusals and total meltdowns where it redacts entire paragraphs with placeholder characters, while just trying to casually talk with it about some routine life matters.
That's ridiculous and makes that model garbage at any form of creative writing (including translation) or real life tasks other than math or coding. It has very poor knowledge for a 120B MoE. If you look at the "reasoning" it does, it actually mostly checks the request against the policy.
I thought they must have spent most of their post-training hunting the wrongthink and dumbing the model down as a result, but I can see how the synthetic pretraining data can explain this.