Fascinating philosophical question: if an AI observes a stream of bytes but doesn't decode a camera's output into an image -- and yet can raise an alarm if the camera is aimed at a certain class of subject -- does the AI "see" the camera image?
They should have named the system "Blindsight"[1].
Apple has a serious corporate concern about proposed legal requirements to detect and report child porno[2], while other legal requirements try to protect user privacy. This appears to be an attempt to say, "We can detect your CP image storage without ever actually looking at your stored images."
It is worth doing some ML courses to understand what is going on.
I am beginning this journey, and it takes a lot of the magic out of it.
Really what is happening is just mathematical operations: a lot of matrix multiplications, some simple non-linear functions (because linear algebra alone cannot represent all logic), some statistical stuff to stop the numbers getting out of control.
Importantly there are millions or billions of "magic numbers" that get updated as the model learns called parameters.
Whether you train the network on representation A of the image, or representation B of the image, doesn't change much philosophically. It is just a function f(x) applied before the function g(x, parameters), so you get g(f(X), parameters).
Now you could argue, well it might mean "something". After all you can reduce our brains to being a big mathematical functions.
Possibly.
But I think in this case it is too simple. The AI is just looking at a slightly different representation. Similar to changing VS code from/to dark mode for humans, or something like that. And the modellers might be changing representations all the time anyway as they tinker with things.
I took Andrew Ng’s famous Coursera ML course and I still find this stuff extremely fascinating and as close to magic as you can get (in the digital realm).
I agree that a higher degree of understanding lessens the emotional impact but often you just need to pause for a second, look back and appreciate what we’ve accomplished as a species.
Sometimes I get on a plane and mid flight I realize “wait a second, I’m in a metal box flying in the sky at 800 km/h and I can breathe, eat, drink and watch tv” and I get goosebumps, even though I’m kind of familiar with the physics of it.
Maybe some things just have an inherent “awe factor” and no matter how well you understand them you still get those butterflies. Or maybe I’m just blabbering :)
Yeah but this is an ML 101 view of the thing, so let me put it this way
The topology of the (real) input data is part of the dimensionality reduction needed for the identification to work, especially at a given number of parameters
Understanding how the data is connected in the input (there are image lines and they form a "shape") helps a lot and CNNs used this to their advantage
Maybe a very powerful RNN could deduce this by itself, but it is wasted processing.
GPT also uses the input topology to its favour when it uses tokens, process text sequentially, etc
If this is the case, and I believe you, then this implies: "bytes are NOT all you need". You need a good representation of your image for the network to work well. It might so happen that one of the popular formats, maybe JPEG does a good job. But a dedicated ML person could do better most likely.
This is probably what you would classically call "feature engineering"? And this is all hyperparameter search stuff in a way.
"I am beginning this journey, and it takes a lot of the magic out of it.
Really what is happening is just mathematical operations: a lot of matrix multiplications, some simple non-linear functions (because linear algebra alone cannot represent all logic), some statistical stuff to stop the numbers getting out of control."
And all that happens in the human brain is "just" chemical reactions.
All that happens on the Earth is "just" chemistry.
Well, it’s not just an aside. It takes all the wind out of your sails. What’s interesting here is the end result, not pointing out the basic elements the same reason reducing love or consciousness to chemical reactions doesn’t take away the mystique and wonder of those things. “Heh sorry to ruin it for you all :b but it’s all just chemical reactions” just isn’t as interesting as you might have been thinking when writing that comment.
Do you "see" the image projected onto your eye's retina, if you just observe a massively parallel stream of electric stimuli going through your optic nerve, but never decode it into a cluster of pixels and the internal reconstruction of the scene in your brain is nothing like that?
(wait until you know that there are single-pixel compressive sensing cameras that bypass the wasteful image reconstruction phase, directly mapping the raw signal to the output representation...)
Models never « see » anything to begin with, it’s all matrices. And since we ditched convents, locality doesn’t even matter anymore (almost). RGB is just convenient for humans, nothing says it’s optimal for deep learning
Yeah, real humans see with a fourier transform in a highly optimized basis for projecting 3d down to the 2d retina. Not cold soulless math like those machines!
Humans see more than 8-bit RGB. Humans can see light polarization and stereo disparity, but more importantly we can interact with things we look at.
XYZ is the "most optimal" 3-channel colorspace but it's a simple transformation from RGB, so it doesn't matter - the model can learn it if it wants to.
Yeah, yeah, we know it is all matrices and numbers. But the Apple group seems to be asserting that there is a difference,
> We also demonstrate ByteFormer's ability to perform inference with a hypothetical privacy-preserving camera which avoids forming full images by consistently masking 90% of pixel channels, while still achieving 71.35% accuracy on ImageNet.
So the way this algorithm smooshes the matrices is "privacy-preserving" because it doesn't actually take in all the bytes. Just some of them. So it is not an invasion of your privacy, if your phone checks all your stored photos for child pornography, because it never actually _looks_ at the images. It just _peeks_ at a bit of them.
If I ask to see just a bit of your private documents - I can choose the corner of the document, let's say, the top left - and I can inferr with 70% accuracy who you're getting services from, did I respect your privacy? This seems a bit like smoke and mirrors. Either scan the image or not!
Inference from partial images is super cool research for example to deal with corrupted image files, but not for privacy imo.
Which is a hardcore effort to pretend the issue was something to do with looking at images, and not indiscriminate motivated surveillance.
No one cares about software touching their bytes: they should care a lot about software which is built with the null hypothesis of "this person has child pornography, notify the authorities".
That sounds like a variant of the Chinese Room problem: if a non-Chinese speaker follows a rule book to text-chat in Chinese with someone on the other side, does s/he, in actuality, “speak Chinese”?
They should have named the system "Blindsight"[1].
Apple has a serious corporate concern about proposed legal requirements to detect and report child porno[2], while other legal requirements try to protect user privacy. This appears to be an attempt to say, "We can detect your CP image storage without ever actually looking at your stored images."
[1] https://en.wikipedia.org/wiki/Blindsight
[2] https://en.wikipedia.org/wiki/Regulation_to_Prevent_and_Comb...