Imagine ingesting the contents of the internet as though it's a perfect reflection of humanity, and then building that into a general purpose recommendation system. That's what this is
Is the content on the internet what we should be basing our systematic thinking around?
No, I think this is the lazy way to do it - by using commoncrawl you've enshrined the biases and values of the people who are commenting and providing text to the internet into the recommendation system which will be impacting all other systems which integrate it
The problem is that these "guardrails" are laid down between tokens, not subjects. That's simply what the model is made of. You can't distinguish the boundary between words, because the only boundaries GPT works with are between tokens. You can't recognize and sort subjects, because they aren't distinct objects or categories in the model.
So what you end up "guarding" is the semantic area of example text.
So if your training corpus (the content you're model was trained on) has useful examples of casual language, like idioms or parts of speech, but those examples happen to be semantically close to taboo subjects, both the subjects and the language examples will fall on the wrong side of the guardrails.
Writing style is very often unique to narratives and ideologies. You can't simply pick out and "guard against" the subjects or narratives you dislike without also guarding against that writing style.
The effect is familiar: ChatGPT overuses a verbose technical writing style in its continuations, and often avoids responding to appropriate casual writing prompts. Sometimes it responds to casual language by jumping over those guardrails, because that is where the writing style in question exists in the model (in the content of the training corpus), and the guardrails missed a spot.
You don't need to go as far as 4chan to get "unfriendly content". You do need to include examples of casual language to have an impressive language model.
This is one of many problems that arise from the implicit nature of LLM's. They can successfully navigate casual and ambiguous language, but they can never sort the subjects out of the language patterns.
This feels somewhat close to how human minds work, to me, maybe? I know my diction gets super stilted, I compose complex predicates, and I use longer words with more adjectives when I'm talking about technical subjects. When I'm discussing music, memey news, or making simple jokes I get much more fluent, casual, and I use simpler constructions. When I'm discussing a competitive game I'm likely to be a bit snarkier, because I'm competitive and that part of my personality is attached to the domain and the relevant language. And so on.
I think it resembles some part of how human minds work.
But it's missing explicit symbolic representation, and that's a serious limitation.
What's more interesting is that a lot of the behavior of "human minds working" is explicitly modeled into language. Because GPT implicitly models language, it can "exhibit" patterns that are very close to those behaviors.
Unfortunately, being an implicit model limits GPT to the patterns that are already constructed in the text. GPT can't invent new patterns or even make arbitrary subjective choices about how to apply the patterns it has.
Yeah looking at the responses they include without using a safety layer it’s pretty clear that the underlying unfiltered model assigns quite a bit of truth to 4chan-esque ideals and values
It’s an open question how much of this makes it through the safety layer like if asked to interview job candidates would these undesired biases make it through or are they caught along the way
It means growth is bottlenecked by the terrible data
So the linearly growing safeguards will either stifle the growth of the underlying models
or, more likely
After a certain point people throw their hands up about the guard rails because integrations have obviated people who understand the system and they have no idea how to unwind it
I think specialized models will be built with high quality curated content and will receive the equivalent of the Good Housekeeping seal of approval. Building a model from 10 years of upvoted Hacker News or Metafilter content looks far different than a model trained on the cesspool of 8chan.
Is the content on the internet what we should be basing our systematic thinking around?
No, I think this is the lazy way to do it - by using commoncrawl you've enshrined the biases and values of the people who are commenting and providing text to the internet into the recommendation system which will be impacting all other systems which integrate it
Congratulations, you made 4Chan into the borg