Imagine ingesting the contents of the internet as though it's a perfect reflecti...

thomastjeffery · on March 14, 2023

It's worse: their solution is "guardrails".

The problem is that these "guardrails" are laid down between tokens, not subjects. That's simply what the model is made of. You can't distinguish the boundary between words, because the only boundaries GPT works with are between tokens. You can't recognize and sort subjects, because they aren't distinct objects or categories in the model.

So what you end up "guarding" is the semantic area of example text.

So if your training corpus (the content you're model was trained on) has useful examples of casual language, like idioms or parts of speech, but those examples happen to be semantically close to taboo subjects, both the subjects and the language examples will fall on the wrong side of the guardrails.

Writing style is very often unique to narratives and ideologies. You can't simply pick out and "guard against" the subjects or narratives you dislike without also guarding against that writing style.

The effect is familiar: ChatGPT overuses a verbose technical writing style in its continuations, and often avoids responding to appropriate casual writing prompts. Sometimes it responds to casual language by jumping over those guardrails, because that is where the writing style in question exists in the model (in the content of the training corpus), and the guardrails missed a spot.

You don't need to go as far as 4chan to get "unfriendly content". You do need to include examples of casual language to have an impressive language model.

This is one of many problems that arise from the implicit nature of LLM's. They can successfully navigate casual and ambiguous language, but they can never sort the subjects out of the language patterns.

AndrewKemendo · on March 14, 2023

This is very insightful perspective thank you, and it's a very intuitive topological explanation that I hadn't considered!

emberfiend · on March 15, 2023

This feels somewhat close to how human minds work, to me, maybe? I know my diction gets super stilted, I compose complex predicates, and I use longer words with more adjectives when I'm talking about technical subjects. When I'm discussing music, memey news, or making simple jokes I get much more fluent, casual, and I use simpler constructions. When I'm discussing a competitive game I'm likely to be a bit snarkier, because I'm competitive and that part of my personality is attached to the domain and the relevant language. And so on.

thomastjeffery · on March 15, 2023

I think it resembles some part of how human minds work.

But it's missing explicit symbolic representation, and that's a serious limitation.

What's more interesting is that a lot of the behavior of "human minds working" is explicitly modeled into language. Because GPT implicitly models language, it can "exhibit" patterns that are very close to those behaviors.

Unfortunately, being an implicit model limits GPT to the patterns that are already constructed in the text. GPT can't invent new patterns or even make arbitrary subjective choices about how to apply the patterns it has.

acc_297 · on March 14, 2023

Yeah looking at the responses they include without using a safety layer it’s pretty clear that the underlying unfiltered model assigns quite a bit of truth to 4chan-esque ideals and values

It’s an open question how much of this makes it through the safety layer like if asked to interview job candidates would these undesired biases make it through or are they caught along the way

subsistence234 · on March 14, 2023

we need to remove empirical data and stats from the training data, to prevent the AI from noticing the wrong things.

thomastjeffery · on March 15, 2023

But what can go in their place?

AndrewKemendo · on March 14, 2023

It means growth is bottlenecked by the terrible data

So the linearly growing safeguards will either stifle the growth of the underlying models

or, more likely

After a certain point people throw their hands up about the guard rails because integrations have obviated people who understand the system and they have no idea how to unwind it

jasondigitized · on March 14, 2023

I think specialized models will be built with high quality curated content and will receive the equivalent of the Good Housekeeping seal of approval. Building a model from 10 years of upvoted Hacker News or Metafilter content looks far different than a model trained on the cesspool of 8chan.

AndrewKemendo · on March 15, 2023

Which just further reinforces the bubbles everyone is in...