Can someone give a summary or TL;DR for us less technical types? This looks very...

aidenn0 · on July 25, 2013

TL;DR: Firstly I highly recommend that you read the entire article (starting with the first part https://people.xiph.org/~xiphmont/demo/daala/demo1.shtml ) because it's a very good introduction to some technical details of how Daala is discarding the received wisdom of the last 20+ years of video encoding with the hopes of escaping the local maxima we currently find ourselves in.

noahl · on July 25, 2013

First, let me say that I am not an expert on video codecs - I've just worked with VP9 a little. However, here is what I got from that.

Video codecs overall use two tricks to reduce video sizes: first, they drop unnecessary information; second, they predict the information that's left. (I'm not going to write about why prediction reduces size here, but look up arithmetic coding in Wikipedia if you want to know.)

If you look at a raw video, dropping information might seem hard - it's a bunch of pixels, and you don't really want to drop a pixel. Instead, you transform the video into a representation where some of the information seems less important. Daala, like VP9, uses the Fourier transform. That means it writes a group of pixels (in this case, either a 4x4, 8x8, 16x16, or 32x32 block) in a different basis than normal (look up "change of basis" for more information here).

The ordinary basis has a vector that corresponds to each pixel (or really, each color in each pixel, but I'm not getting into that). In the basis they're using (the Fourier basis), the first basis vector is the average of all of the pixels, and the later vectors drill down into more and more specific information until the final vector gives you information about the difference between adjacent pixels. Since the difference between two adjacent pixels is probably fuzz, you feel fine dropping it - this is the "dropping information" part of the codec.

But what do you do with the information you can't drop? You try to predict it, and that is what this blog post is about. When we're decoding a frame, we go from left to right and from top to bottom. So when we're looking at a normal 32x32 block somewhere in the middle of the frame, we already know the blocks directly above, left above, right above, and directly left. The idea is to use what we know about those blocks to predict the one that we're decoding. This is going to work because nearby pixels in videos are probably pretty similar - for instance, in the background. (Note that you can also predict a block based on a nearby region in previous frames. The blog post doesn't talk about that, but Daala will have a way to do it.)

The author's idea for prediction is both simple and clever. Each value is going to be a weighted linear combination of old values - weight_1 * old_val_1 + weight_2 * old_val_2 + ... . Now, having decided the outline of the predictors, they're not actually choosing predictors. Instead, they've narrowed the search space enough that they can have their computers search for optimal predictors within that space. It's a cool idea, and I hope it will produce good results. One notable difference between them and other codecs is that after doing a Fourier transform, they're predicting the transformed data, rather than the plain data. We don't know yet what effect that will have.

This is still early-stage work, but it does look cool. One big thing they only briefly mentioned is error. Any video codec will produce some sort of error - places where the compressed video doesn't look like the original video. The trick is to make sure that the errors aren't things that people mind, and the only real way to test that is to show videos to people and see what they think. They're not at the point where they can do that, but it will be cool to see the results when they can. And I really really hope that performance is as good as they want - a 40% reduction over VP9 would be great for Internet video.

sillysaurus · on July 25, 2013

The human eye is more sensitive to intensity ("average RGB") than to color. You can drop most of the color info from a picture without significantly degrading its viewing quality.

One way to do that is to transform from (r,g,b) to (intensity, chroma1, chroma2) and then downsize the chroma1 and chroma2 channels by half. When you then transform back into (r,g,b), humans can't hardly tell the difference. Whereas if you tried to do that with the intensity channel, the picture would look awful.

atondwal · on July 25, 2013

Also note: Daala is [lossless](https://git.xiph.org/?p=daala.git;a=blob;f=doc/design.tex;h=...)

lifthrasiir · on July 25, 2013

Chroma subsampling is not the only way to achieve the compression, and noahl's description equally applies to the monochromatic video.

nitrogen · on July 25, 2013

I think the point of this article is to show off some of the highly technical insides of Daala, so there's not much a TL;DR can do for you. The goals of Daala can be found here: https://xiph.org/daala/