This is a pretty fluffy piece. I'm not trained or versed in the domain, but ML models try to optimize their classification accuracy based on a number of inputs. "Truth" doesn't come into it. There might be some pathological inputs that cause errors, but this has nothing to do with "truth" or "purity of heart".
Someday, a neural network might be compelled to produce suboptimal output to further its own hidden agenda. Interesting sci-fi plot, but that doesn't seem to be what this is about though. That's about the closest I think a machine could get to being able to talk about "the truth".
This is just about adversarial inputs making the machine wrong. It doesn't seem to have the philosophical weight the title suggests.
It’s about the fact that those adversarial inputs can be designed in by whoever creates the model without the existence of those inputs being detectable (within reasonable computational bounds) by analyzing the model. Moreover, apparently any input can be slightly tweaked to become such an adversarial input, if you know the right key. That means that the model can be made to “lie” on roughly any input, without that fact being detectable on the model.
Why is that interesting though? I can just as easily put a backdoor in preprocessing before passing it to the algorithm. Outside of machine learning, you can do the same thing anywhere. This doesn't appear to be anything new, it's citing an article that's not even peer reviewed yet. It's just not good writing, in my opinion.
It’s interesting if someone supplies you a model which you build an application around yourself (and thus control any preprocessing), because they basically prove that you have no way to check that the model doesn’t contain any backdoors, even though you can inspect the model (it’s not a black box to you). It’s as if someone gives you an software component as source code but you still can’t detect that it has a backdoor.
ML models aren’t turing machines (unless you loop their output back as input). The paper is about simple classifiers, which run in a predetermined, finite number of steps.
I almost never compile the compiler I use, so I'm implicitly trusting that the compiler actually spits out what I expect and not some kind of backdoor[1].
I thought the issue was that you get some premade model from a company, feed it input and it classifies for you. With a compiler you feed it input and it produces a binary.
If you don't have access to the source, meaning model training data or source code for the compiler, then you can't be sure the model won't intentionally misclassify or the compiler won't insert trojan code.
The difference I see is that an ML model is at first glance not a compiled binary with hidden mechanics: It’s a network graph with weights on the edges and where all nodes work in the same easy-to-understand way. The model also isn’t a unique function of the training data in the way that the compiler binary is a function of the compiler source — you can get slightly differently behaving models from the same training data, so you can’t totally predict the model’s behavior from the training data like you can predict the compiler’s behavior from the compiler source. The model itself is generally the better “source” for predicting (well, simulating) its exact behavior. That’s why it is surprising that the presence of a backdoor can remain undetectable by inspecting the model. There would be somewhat of an analogy if there was a backdoored compiler where the backdoor cannot be detected by analyzing the compiler binary’s machine code.
What's remarkable is that anyone thinks it's remarkable that a machine, or a person for that matter, or a person operating a machine, can be wrong.
A person can give a wrong answer or perform a wrong action, as a result of bad input. So what? That input can be crafted specifically to confuse them and trick an honest person into performing some bad act. So what?
Alk the same is exactly the same true for an ai. So what?
And lastly, aside from a person or ai being in error, an operator/user of an ai (or person) can be in error (believing the ai's output is good when it's not). So what?
The novel result is not "code can be wrong," it's " code can be wrong in a way that cannot be detected via any sort of audit or review, even when said code is restricted to some class less complex than Turing machines."
I thought that was always true of any ai? You only know the input data, weights, and starting conditions/code, but know nothing about the actual workings once started.
You can only audit that by duplicating the results, corroboration, and consensus, like with scientific research. IE, other ais doing the same job but using other code and run by other people, do they produce the same output, or the same pattern of output.
I'm not in ml/ai so I'm not stating that as something I know, just something I always assumed.
I would be stunned if you said that people actually thought they could audit ai inner workings after kick-off.
Spot-testing usually gives you a representative picture of what the ML model will produce in general. Of course there can always be outliers (and usually there are), but they are just that, outliers, and they can’t be systematically exploited by an attacker with normal-looking inputs. The present paper however basically shows that those outliers can be systematically and deliberately spread throughout input space in such a way that any given input can be slightly tweaked by the attacker (in ways that the input still looks unsuspicious) to get the desired “lying” output, without that fact being detectable either by spot-checking or any other practically feasible analysis on the model. The fact that this is possible to do in such a general fashion (any given model can be modified to contain such a backdoor) is a new finding.
One thing I’ve always wondered is what would happen if, for example, every Tesla driver in a neighborhood agreed to run a very specific stop sign every single time.
"Truth" is a pretty nebulous concept at the best of times anyway. Humans don't generally know the "truth", they just have a best-guess hypothesis based on their experience so far.
Philosophy is interesting and all but ultimately it's all just linear (or not-so-linear) algebra.
Someday, a neural network might be compelled to produce suboptimal output to further its own hidden agenda. Interesting sci-fi plot, but that doesn't seem to be what this is about though. That's about the closest I think a machine could get to being able to talk about "the truth".
This is just about adversarial inputs making the machine wrong. It doesn't seem to have the philosophical weight the title suggests.