Agree it would be cool to be "untouched" by the compression algorithm, but that's nearly impossible with YouTube. YouTube encodes down to several different versions of a video and on top of that, several different codecs to support different devices with different built-in video hardware decoders.
For example, when I upload a 4K vid and then watch the 4K stream on my Mac vs my PC, I get different video files solely based on the browser settings that can tell what OS I'm running.
Handling this compression protection for so many different codecs is likely not feasible.
Yes, but nothing is saying this has to work for every codec. Since you want to retrieve the files using a special client, you could pick the codec you like.
But (almost) nothing prevents YouTube from not serving that particular codec anymore. This still pretty much falls under the "re-encoding" case I mentioned which would make the whole thing brittle anyway.
How about Fourier transform (or cosine, whichever works best), and keep data as frequency components coefficients? That’s the rough idea behind digital watermarking. It survives image transforms quite well.
Just as an aside, it's absolutely astounding how much hardware Google must throw at YouTube to achieve this for any video anybody in the world wants to upload. The processing power to reencode to so many versions, and then to store all of those versions, and then make all of those accessible anywhere in the world at a moments notice. Really is such an incredible waste for most YouTube content.
I didn't know that, and am glad that they do transcode on the fly when appropriate. Very impressive that it's seamless to the end user, I've never sat waiting for a YouTube clip to play even when it's definitely something from the 'back catalog' like a decade old video with a dozen views.
What if you have an ML model that produces a vector from a given image. You have a set of vectors that correspond to bytes - for a simple example you have 256 "anchor vectors" that correspond to any possible byte.
To compress data an arbitrary sequence of bytes, for each byte, you produce an image that your ML model would convert to the corresponding anchor vector for that byte and add the image as a frame in a video. Once all the bytes have been converted to frames you then upload the video to YouTube.
To decompress the video you simply go frame by frame over the video and send it to your model. Your model produces a vector and you find which of your anchor vectors is the nearest match. Even though YouTube will have compressed the video in who knows what way, and even if YouTube's compression changes, the resultant images in the video should look similar, and if your anchors are well chosen and your model works well, you should be able to tell which anchor a given image is intended to correspond to.
Why go that way. I’m no digital signal processing expert, but images (and series thereof, i.e videos) are 2D signals. What we see is spatial domain and analyzing pixel by pixel is naive and won’t get you very far.
What you need is going to frequency domain. From my own experiment in university times most significant image info lays in lowest frequencies. Cutting off frequencies higher than 10% of lowest leaves very comprehensible image with only wavey artifacts around objects. You have plenty of bandwidth to use even if you want to embed info in existing media.
Now here you have full bandwidth to use. Start with frequency domain, set expectations of lowest bandwidth you’ll allow and set the coefficients of harmonic components. Convert to spatial domain, upscale and you got your video to upload. This should leave you with data encoded in a way that should survive compression and resizing. You’ll just need to allow some room for that.
You could slap error correction codes on top.
If you think about it, you should consider video as - say - copper wire or radio. We’ve come quite far transmitting over these media without ML.
We started with that approach, by assuming that the compression is wavelet based, and then purposefully generating wavelets that we know survive the compression process.
For the sake of this discussion, wavelets are pretty much exactly that: A bunch of frequencies where the "least important" (according to the algorithm) are cut out.
But that's pretty cool, seems like you've re-invented JPEG without knowing it, so your understanding is solid!
That's essentially a variant of "bigger pixels". Just like them, your algorithm cannot guarantee that an unknown codec will still make the whole thing perform adequately.
Even if you train your model to work best for all existing codecs (I assume that's the "ML" part of the ML model), the no free lunch theorem pretty much tells us that it can't always perform well for codecs it does not know about.
(And so does entropy. Reducing to absurd levels, if your codec results in only one pixel and the only color that pixel can have is blue, then you'll only be able to encode any information in the length of the video itself.)
It's not guaranteed to perform well with unknown or new codecs - true. But, the implicit assumption is that YouTube will use codecs that preserve what videos look like - not just random codecs. If that assumption holds then the image recognition model will keep working even with new codecs.
That's the thing though, "looks like" breaks down pretty quickly with things that aren't real images. It even breaks down pretty quickly with things that are real, but maybe not so common, images: https://www.youtube.com/watch?v=r6Rp-uo6HmI
So one question would be: Does your image generation approach preserve a higher information density than big enough pixels?
Why would you assume that the images in my algorithm aren't real images? For example, you could use 256 categories from imagenet as your keys. Image of a dog is 00000000, tree is 00000001, car 00000010, and so on.
I'm not assuming they are not real images, I'm questioning whether to get to any information density that even out-performs "big enough colorful pixels", you might get into territories where the lossy compression compresses away what you need to unambiguously discriminate between two symbols.
And to get to that level of density I do wonder what kind of detailed "real image" it would still be.
If your algorithm were for example literally the example you just noted, then the "big enough colorful pixels" from the example video already massively outperform your algorithm. Of course it won't be exactly that, but you have the assumption that the video compression algorithm somehow applies the same meaning of "looks like" in its preservation efforts that your machine learning algorithm does, down to levels where differences become so minute that they exceed what you can do with colorful pixels that are just big enough to pass the compression unscathed, maybe with some additional error correction (i.e. exactly what a very high density QR code would do).
For example, when I upload a 4K vid and then watch the 4K stream on my Mac vs my PC, I get different video files solely based on the browser settings that can tell what OS I'm running.
Handling this compression protection for so many different codecs is likely not feasible.