So basically - You take a picture. Apple encrypts it and uploads it to their server. The server matches the (still encrypted) picture to a database and tells your device "this picture contains the Eiffel Tower". Later when you search for Eiffel Tower on your device the photo pops up.
Is the complexity and security risk really worth it for such a niche feature?
It's also funny that Apple is simultaneously saying "don't worry the photo is encrypted so we can't read it" and "we are extracting data from the encrypted photo to enhance your experience".
They don’t send the photo. They send some encrypted metadata to which some noise is added. The metadata can be loosely understood as “I have this photo that looks sort of like this”. Then the server takes that encrypted data from the anonymized device and responds something like “that looks like the Eiffel Tower” and sends it back to the device. The actual photo never goes to the server.
With the added caveat that HE is magic sauce - so the server cannot see the metadata (cropped/normalized image data), and doesn't know how much it does or does not look like the Eiffel Tower.
Because it turns out that mathematicians and computer scientists have devised schemes that allow for certain computational operations to be performed on encrypted data without revealing the data itself. You can do a+b=c and it doesn’t reveal anything about what a and b are is the intuition here. This has been mostly confined to the realm of theory and mathematics until very recently but Apple has operationalized it for the first time.
The phone has intelligence to detect things that look like landmarks, and does cropping/normalization and converts to a mathematical form.
Apple has a database trained on multiple photos of each landmark (or part of a landmark), to give a likelihood of a match.
Homomorphic encryption means that the encrypted mathematical form of a potential landmark from the phone can be applied to the encrypted set of landmark data, to get an encrypted result set.
The phone can then decrypt this and see the result of the query. But anyone else sees this as noise being translated to new noise, including Apple's server.
The justification for this approach is storage - the data set of landmarks can only get larger as the data set gets more comprehensive. Imagine trying to match photos for inside castles, cathedrals and museums as examples.
I don't completely understand the maths of how this works, but no, they don't.
Here's a theoretical way I wrote in another comment:
> I think they have more efficient ways, but theoretically what you could do is apply each row in your database to this encrypted value, in such a way that the encrypted value becomes the name of the POI of the best match, or otherwise junk is appended (completely changing the encrypted value) Again, the server has not read the encrypted value, it does not know which row won out. Only the client will know when it decrypts the new value.
They do something like this, using homomorphic encryption. Whatever they do, there is no doubt they incur serious performance hits.
> They do something like this, using homomorphic encryption. Whatever they do, there is no doubt they incur serious performance hits.
Right, I've seen similar engineering efforts to target this sort of functionality fail because of the computational cost and resulting latency. I'm curious to read the paper for the tradeoffs they made toward practicality at Apple's scale of users.
Not really. It's more like apple runs a local algorithm that takes your picture of the Eiffel tower, and outputs some text "Eiffel tower, person smiling", and then encrypts that text and sends it securely to apples servers to help you when you perform a search.
Locally, a small ML model identifies potential POIs in an image.
Another model turns these regions into a series of numbers (a vector) that represent the image. For instance, one number might correlate with how "skyscraper-like" the image is. (We don't actually know the definition of each dimension of the vector, but we can turn an image that we know is the eiffel tower into a vector, and measure how closely our reference image and our sample image are located)
The thing is, we aren't storing this database with the vectors of all known locations on our phone. We could send the vector we made on device off to Apple's servers. The vector is lossy, after all, so apple wouldn't have the image. If we did this, however, apple would know that we have an image of the eiffel tower.
So, this is the magic part. The device encrypts the vector using a private key known only to it, then sends this unreadable vector off to the server. Somehow, using Homomorphic Encryption and other processes I do not understand, mathematical operations like cosine similarity can be applied to this encrypted vector without reading the actual contents of the vector. Each one of these operations changes the value, which is still encrypted, but we do not know how the value changed.
I don't know if this is exactly what Apple does, I think they have more efficient ways, but theoretically what you could do is apply each row in your database to this encrypted value, in such a way that the encrypted value becomes the name of the POI of the best match, or otherwise junk is appended (completely changing the encrypted value) Again, the server has not read the encrypted value, it does not know which row won out. Only the client will know when it decrypts the new value.
Is the complexity and security risk really worth it for such a niche feature?
It's also funny that Apple is simultaneously saying "don't worry the photo is encrypted so we can't read it" and "we are extracting data from the encrypted photo to enhance your experience".