Modern Neural Networks are by no means guaranteed to converge on the simplest solution. and examples abound in which NNs are discovered to learn weird esoteric algorithms when simpler ones exist. The reason why is kind of obvious. The simplest solution (that you're alluding to) from the perspective of training is simply what works best first.
It's no secret the order of data has an impact on what the network learns and how quickly, it's just not feasible to police for these giant trillion token datasets.
If a NN learns a more complex solution that works perfectly for a less complex subset it meets later on, there is little pressure to meet the simpler solution. Especially when we're talking about instances where the more complex solution might be more robust to any weird permutations it might meet on the internet. e.g there is probably a simpler way to translate text that never has typos and a LLM will never converge on it.
Decoding/Encoding b64 is not the first thing it will learn. It will learn to predict it first as it predicts any other language carrying sequence. Then, it will learn to translate it, mostly like long after learning how to translate other languages. All that will have some impact on the exact process it carries out with b64.
And like i said, we already know for a fact it's not just doing naive substitution because it can recover corrupted b64 text wholesale that our substitutions cannot.
> examples abound in which NNs are discovered to learn weird esoteric algorithms when simpler ones exist
What examples do you have in mind?
Normally it's the opposite, where one hopes for the neural net to learn something complex, and it picks up on a far simpler pattern and uses that instead (e.g. all your enemy tanks are on a desert background, vs the others on a grass background, so it learns to discriminate based on sand vs grass).
You're anthmorphizing by saying that corrupted b64 text can be recovered. There is no "recovery process", but rather conflicting prediction patterns of b64 encoding predicting the corresponding plain text, and the plain text predicting it's own continuation.
e.g.
"the cat sat on the mat" encodes as dGhlIGNhdCBzYXQgb24gdGhlIG1hdA==, but say we've instead got a corrupted dGhlIGNhdCBzYXQgb24gdGhlIHh4dA== that decodes to "the cat sat on the xxt", so if you ask ChatGPT to decode this, it might start generating as:
dGhlIGNhdCBzYXQgb24gdGhlIHh4dA== decodes to "the cat sat on the" ...
At this point the LLM has two conflicting predictions - the b64 encoding predicting "xxt", and the plain text that it has generated so far predicting "mat". Which of these will prevail is going to depend on the specifics. I haven't tried it, but presumably this "recovery" only works where the encoded text is itself predictable ... it won't happen if you encode a random string of characters.
It's no secret the order of data has an impact on what the network learns and how quickly, it's just not feasible to police for these giant trillion token datasets.
If a NN learns a more complex solution that works perfectly for a less complex subset it meets later on, there is little pressure to meet the simpler solution. Especially when we're talking about instances where the more complex solution might be more robust to any weird permutations it might meet on the internet. e.g there is probably a simpler way to translate text that never has typos and a LLM will never converge on it.
Decoding/Encoding b64 is not the first thing it will learn. It will learn to predict it first as it predicts any other language carrying sequence. Then, it will learn to translate it, mostly like long after learning how to translate other languages. All that will have some impact on the exact process it carries out with b64.
And like i said, we already know for a fact it's not just doing naive substitution because it can recover corrupted b64 text wholesale that our substitutions cannot.