> Most credible machine learning systems work well on unseen data, which by definition isn't memorizing.
Sorry, but no. ML models don't generalise well outside the training data, but they can interpolate inside. This question becomes very interesting in the case of GPT-3 which has had a huge corpus of text to train on, so it's probably seen 'everything'. It's still memorising for GPT-3 but also learning to manipulate data, like software algorithms.
ML models don't generalise well outside the training data, but they can interpolate inside.
I'm unsure if you just misstated this or don't know, but this is wrong.
ML models don't generalise well on data outside the distribution of their training data. But that's an entirely different thing, and doesn't mean at all they are memorising data.
Imagine something training on the US unemployment rate until 2020 being hit with the COVID rate. It wouldn't know what to do, but that doesn't mean it wouldn't work fine on a rate of 5.342% even if it had never seen that rate before.
This is a simplified example, but applies to everything.
GPT-3 generation of text does pull from memorised training data. There's a lot of stuff going on there, and amongst other things there has never really been a system that does textual generation well. It's also hugely overparameterised, so lots of potential for overfitting. I don't think it's a good example of a "good" AI system - it's very interesting, full of potential, but there are lots of issues.
It generalises to almost every face, and the ones it doesn't its failure mode is safe.
Or something like word embeddings. Works incredibly well, and most "failure" modes are around things like bias, where the behavior reflects the real world.
Or something like AlphaZero. Not only is every new game of Go it plays brand new, it learnt to play Chess without knowing the rules. That just isn't memorization.
and note that they all involve looking at a small number of points. It is easy to reproduce plots like that but if you try to increase the number of points the result breaks down completely.
It is a curve fitting problem: for a small enough set of points compared to the number of dimensions, you can find a matrix that projects a set of random points to an exactly specified set of points in the plane. If you relax the problem to something like "put colors on the left side, put smells on the right side" you will get better than random performance from that kind of model, but not that much better than random.
Word embeddings are a strategy that approaches an asymptote. Systems that are destined to low performance will perform better if you use a word embedding, but they throw away information up front that makes high performance impossible.
This is true, but I don't think you are using word embedding like most people use them.
The linear relationship between things like king/queen etc is a cute demo but not really useful or used in practice.
The real usefulness of word embeddings is that similar concepts are close to each other so they make a great representation for other models (vs something like TF-IDF). These days they have been mostly surpassed in terms of state of the art by full language models, but the point is that simple techniques like average embedding of words in sentences generalised really well to unseen data.
And if you add in subword embeddings they generalise to unseen words, too.
We could talk about how context lets language models do this even better, but I'm still back trying to persuade the OP that this isn't just memorisation and good ML models work well on unseen data!
It's not so straightforward to go from a word representation to a query, sentence, or document representation.
If you come from the tfidf direction you can first tune up BM25 or something based on the ks-divergence, then you can use a random matrix, LDA, or the deep-network autoencoder that I worked on that crushed conventional tfidf vectors to 50-d vectors.
(Like many things people want to apply word vectors to, you go from 50% accuracy here to 70%, but we know it because we tested it on TREC gov2)
Today I'm interested in systems that have an input-to-action orientation and there you have to be able to put together a story like: "these 10 messages are parsed correctly and not by accident" and that requires that certain 'king/queen' inferences be done correctly or alternately the system has paths to recover from missing an inference.
Often there is no path to go from "popular models in the new A.I." to "something that can serve customers off the leash" and that's the problem.
Now I do like subword embeddings, but that just points out the problem that there is no such thing as a "word".
Let me justify that.
You can split up English into words like "some text".split() but it is not easy to do it from audio. Speech is punctuated by silences, often in the middle of words whenever you make a "[st]op" sound enough that separating words is equivalent to the whole speech understanding problem.
We can turn words into subwords and mash them together with subwords to make words. (e.g. "Fourthmeal", "Juneteenth", "Nihilego")
Also there are many cases you can replace a phrase with a word or a word with a phrase. Putting 'word' at the center of a model means the system is going to be in trouble w/ linguistic phenomena that happen 30% of the time.
Not sure what you think this shows, but there's a lot of reasons why the results they show don't really matter much - or at least might actually reflect an accuracy approximatly what a human would also achieve.
Their headline claim is a "1 pixel change reduces accuracy by 30%". The test process for that number is this:
We choose a random square within the original image and resize the square to be 224x224. The size and location of the square are chosen randomly according to the distribution described in (Szegedy et al., 2015). We then shift that square by one pixel diagonally to create a second image that differs from the first one by translation by a single pixel.
So... they are taking a random square, downsampling to 224, moving and then predicting on that subset of the original image, and measuring the performance against the accuracy of the original prediction.
What this seems to show is that "CNNs aren't as accurate at making predictions on subsets of an image as on the whole image". This is of course to be expected, and is exactly how a human would perform.
Most credible machine learning systems work well on unseen data, which by definition isn't memorizing.