Does Xiph have an official position on objective metrics for video codecs?
I'd imagine that what you choose to measure might make a big difference at this early design stage, and from a purely practical point of view, an "outsider" codec needs to prove itself against the incumbents via benchmarks unless it is night and day better (and even then the incumbents can muddy the waters with carefully chosen benchmarks).
The online peanut gallery debate on this seems to have degenerated into a sports fan like clash between SSIM afficiandos and people who think PNSR is good enough. But since the two measures are highly correlated with each other and both are based on treating video as a series of still images it sometimes it seems like two very short twins arguing about who is taller. The Xiph write-ups have a good track record of cutting through the BS and illuminating good engineering practice so I'd be interested in hearing their take.
A recent paper suggested that all such frame-based metrics heavily underestimated the improvements between H264 and HEVC compared with subjective MOS ratings. As an "odd" codec design that intentionally strays from the MPEG orthodoxy it seems like Daala could have a correspondence between objective and subjective quality measures that differs radically from existing codecs.
Quick comment: don't expect Daala to be a mature codec for a while yet. To quote Monty "Writing a complete new encoder from scratch is a small task compared to the time required to then tune that new encoder into efficient operation. Incremental changes allow demonstrable, steady progress."
The downside is that incremental changes leave you stuck in a sufficiently large local maximum, which is where Daala comes in.
I'm not, but I am expecting it to be 2x as efficient (with the same quality) as VP9 and HEVC, if it's going to arrive that late in the game. I don't think anything less than that will work for them (won't be adopted), because then the switch won't be considered worth the trouble. Hopefully they pull it off.
The next generation of video codecs won't be as expensive a switch as the mpeg2 or h264 era. If a codec shows enough space efficiency, you could just implement an opencl or compute shader variant of it to run GPU side, which will outperform cpu rendering in power efficiency and time by an order of magnitude, and properly tuned would be in the ballpark for dedicated hardware decode time.
In the same way dedicated audio decoding hardware went out the window when cpus got fast enough dedicating die area to it just because excessive even if it was more efficient and faster, I think the same thing will happen (soon) to video decoding, where the extra dedicated decode hardware just isn't worth the hassle when well optimized gl 4.3 compute shaders or opencl enable efficiency and performance if not close to the dedicated hardware, close enough to not justify having it.
It almost already wins the efficiency gain, because just by making dies larger to include that extra circuitry increases the power ceiling of the device. Being able to genericize die area doesn't give at-runtime power gains but overall you can save juice not wasting die (though power management has gotten so sophisticated it can outright shutoff parts of a chip, so it might be able to eliminate that downside).
But simultaneously the gpu hardware in phones (tegra 4 / snapdragon 600 era) is reaching the same threshold cpu audio decoding reached - it becomes silly to waste the die when the performance is close enough.
And even if this isn't the generation where dedicated hardware goes out the window, it will be the next one, and this one will be close enough it will be like the late 90s with audio where the experimentation begins.
Codecs aren't parallel enough to work well on GPUs; parallelism in general hampers compression efficiency.
Anyway, audio resolution hasn't increased anywhere near as much as video resolution. 48 kHz sample rate, 16 bit sample depth per channel is still the highest reasonable, and we've had basically that since CDs.
Whereas internet video has gone from CIF to 1080p, a resolution increase of over 20x. And 4k is being pushed now for another 4x increase.
The point is that audio resolution peaked, and additional returns were negligible at best. Video has the same effect occur somewhere around 300 PPI at 6" view distance, 200 PPI at 12", etc - and between 90 and 150hz refresh rate. Color fidelity is also near its limits on some high end IPS panels.
Past those points, most people don't notice the difference, just like how most people don't notice the difference between 16 and 24 bit audio at 44.1 or 48khz sample rate. Once the vast majority of people no longer see a difference, the technology peaks. I think video is (finally) approaching that territory in the next 5 years, at least in 2 dimensions. I feel holographic 3d video will see a boon after that, and not the eye trick 3d crap we have now.
TL;DR: Firstly I highly recommend that you read the entire article (starting with the first part https://people.xiph.org/~xiphmont/demo/daala/demo1.shtml ) because it's a very good introduction to some technical details of how Daala is discarding the received wisdom of the last 20+ years of video encoding with the hopes of escaping the local maxima we currently find ourselves in.
First, let me say that I am not an expert on video codecs - I've just worked with VP9 a little. However, here is what I got from that.
Video codecs overall use two tricks to reduce video sizes: first, they drop unnecessary information; second, they predict the information that's left. (I'm not going to write about why prediction reduces size here, but look up arithmetic coding in Wikipedia if you want to know.)
If you look at a raw video, dropping information might seem hard - it's a bunch of pixels, and you don't really want to drop a pixel. Instead, you transform the video into a representation where some of the information seems less important. Daala, like VP9, uses the Fourier transform. That means it writes a group of pixels (in this case, either a 4x4, 8x8, 16x16, or 32x32 block) in a different basis than normal (look up "change of basis" for more information here).
The ordinary basis has a vector that corresponds to each pixel (or really, each color in each pixel, but I'm not getting into that). In the basis they're using (the Fourier basis), the first basis vector is the average of all of the pixels, and the later vectors drill down into more and more specific information until the final vector gives you information about the difference between adjacent pixels. Since the difference between two adjacent pixels is probably fuzz, you feel fine dropping it - this is the "dropping information" part of the codec.
But what do you do with the information you can't drop? You try to predict it, and that is what this blog post is about. When we're decoding a frame, we go from left to right and from top to bottom. So when we're looking at a normal 32x32 block somewhere in the middle of the frame, we already know the blocks directly above, left above, right above, and directly left. The idea is to use what we know about those blocks to predict the one that we're decoding. This is going to work because nearby pixels in videos are probably pretty similar - for instance, in the background. (Note that you can also predict a block based on a nearby region in previous frames. The blog post doesn't talk about that, but Daala will have a way to do it.)
The author's idea for prediction is both simple and clever. Each value is going to be a weighted linear combination of old values - weight_1 * old_val_1 + weight_2 * old_val_2 + ... . Now, having decided the outline of the predictors, they're not actually choosing predictors. Instead, they've narrowed the search space enough that they can have their computers search for optimal predictors within that space. It's a cool idea, and I hope it will produce good results. One notable difference between them and other codecs is that after doing a Fourier transform, they're predicting the transformed data, rather than the plain data. We don't know yet what effect that will have.
This is still early-stage work, but it does look cool. One big thing they only briefly mentioned is error. Any video codec will produce some sort of error - places where the compressed video doesn't look like the original video. The trick is to make sure that the errors aren't things that people mind, and the only real way to test that is to show videos to people and see what they think. They're not at the point where they can do that, but it will be cool to see the results when they can. And I really really hope that performance is as good as they want - a 40% reduction over VP9 would be great for Internet video.
The human eye is more sensitive to intensity ("average RGB") than to color. You can drop most of the color info from a picture without significantly degrading its viewing quality.
One way to do that is to transform from (r,g,b) to (intensity, chroma1, chroma2) and then downsize the chroma1 and chroma2 channels by half. When you then transform back into (r,g,b), humans can't hardly tell the difference. Whereas if you tried to do that with the intensity channel, the picture would look awful.
I think the point of this article is to show off some of the highly technical insides of Daala, so there's not much a TL;DR can do for you. The goals of Daala can be found here: https://xiph.org/daala/
It worries me that On2 has said in a private conversation that this is a dead-end. They aren't stupid there, and they've almost certainly partially explored this space before.
It certainly depends on what they tried - Loren Merritt of x264 fame looked at lapped transforms years ago as well, and concluded that they were inferior due to the unfeasibility of spatial prediction. I remember he also concluded that frequency-domain prediction was inferior to spatial, probably based somewhat on how useless the frequency prediction was in mpeg4.
Personally, the amount of blur in those images worries me more than whatever On2 might have concluded, simply because we all know what happened with wavelets. But it's just the prediction we're seeing, not the result of the transform, so time will tell.
15 years ago they were the next big revolution in image coding. Then people made codecs based on them, which never beat DCT codecs perceptually. They did well in PSNR though, since they blurred instead of blocked.
Of course, their main problem was that fine detail wound up in every decomposition band, so you had to code it multiple times and no one came up with good enough prediction to offset that. Lapped DCTs shouldn't have that issue any more than traditional DCTs.
For now I'm going to give more weight to the project that keeps making interesting progress in public. There's the old saying -- if an expert says something is possible, they're probably right; if they say something is impossible, they are probably wrong.
OTOH lapped encoding is nothing new, and it may be that once you increase the complexity of the encoder a lot of the advantage disappears. Contrariwise, it may be that with the targeted decoding complexity of a decade ago, lapped transforms were a net loss, but with the higher complexity that is allowable today it's a net win.
I agree that their progress in public is quite interesting, and completely aside from that, Monty keeps on generating high-quality, free, accessible articles and videos about various topics involved in digital signal processing that I love being able to point interested students at.
I'm kind of surprised that we've not heard more from Google about machine learning applied to video codecs. After all some people claim compression is basically AI, and other people claim that Google is basically an AI company so you'd expect some fruitful collaboration is possible. Or maybe they're just keeping quiet about it?
So this is obviously way over my head, but is it a coincedence the Daala predictors produce results that look very similar to the high pass filters in Photoshop?
Partly; as they say, you can't do exactly what the spacial predictors from AVC/H.264 do.
Look at the figure labeled AVC/H.264 Prediction modes (the text is part of the image, so you cant C-F for it unfortunately). Notice that these filters don't work in frequency space, but are spacial, and just extend the neighboring pixels into the block as stripes; that will tend to be smooth on the upper and/or left edges, and blocky on the lower and right edges.
The Daala predictors happen in frequency space, after the lapped transform so you end up with something that is not necessarily a block of stripes, and you also get more smoothness at the lower-right boundaries.
I'd imagine that what you choose to measure might make a big difference at this early design stage, and from a purely practical point of view, an "outsider" codec needs to prove itself against the incumbents via benchmarks unless it is night and day better (and even then the incumbents can muddy the waters with carefully chosen benchmarks).
The online peanut gallery debate on this seems to have degenerated into a sports fan like clash between SSIM afficiandos and people who think PNSR is good enough. But since the two measures are highly correlated with each other and both are based on treating video as a series of still images it sometimes it seems like two very short twins arguing about who is taller. The Xiph write-ups have a good track record of cutting through the BS and illuminating good engineering practice so I'd be interested in hearing their take.
A recent paper suggested that all such frame-based metrics heavily underestimated the improvements between H264 and HEVC compared with subjective MOS ratings. As an "odd" codec design that intentionally strays from the MPEG orthodoxy it seems like Daala could have a correspondence between objective and subjective quality measures that differs radically from existing codecs.