Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've always wondered if better multi-core performance can come from processing different keyframe segments separately.

IIUC all current encoders that support parallelism work by multiple threads working on the same frame at the same time. Often times the frame is split into regions and each thread focuses on a specific region of the frame. This approach can have a (usually small) quality/efficiency cost and requires per-encoder logic to assemble those regions into a single frame.

What if instead/additionally different keyframe segments are processed independently? So if keyframes are every 60 frames ffmpeg will read 60 frames pass that to the first thread, the next 60 to the next thread, ... then assemble the results basically by concatenating them. It seems like this could be used to parallelize any codec in a fairly generic way and it should be more efficient as there is no thread-communication overhead or splitting of the frame into regions which harms cross-region compression.

Off the top of my head I can only think of two issues:

1. Requires loading N*keyframe period frames into memory as well as the overhead memory for encoding N frames.

2. Variable keyframe support would require special support as the keyframe splits will need to be identified before passing the video to the encoding threads. This may require extra work to be performed upfront.

But both of these seem like they won't be an issue in many cases. Lots of the time I'd be happy to use tons of RAM and output with a fixed keyframe interval.

Probably I would combine this with intra-frame parallelization such as process every frame with 4 threads and then run 8 keyframe segments in parallel. This way I can get really good parallelism but only minor quality loss from 4 regions rather than splitting the video into 32 regions which would harm quality more.




This definitely happens. This is how videos uploaded to Facebook or YouTube become available so quickly. The video is split into chunks based on key frame, the chunks are farmed out to a cluster of servers and encoded in parallel, and the outputs are then re-assembled into the final file.


I know next to nothing about video encoders, and in my naive mind I absolutely thought that parallelism would work just like you suggested it should. It sounds absolutely wild to me that they're splitting single frames into multiple segments. Merging work from different threads for every single frame sounds wasteful somehow. But I guess it works, if that's how everybody does it. TIL!


Most people concerned about encoding performance are doing livestreaming and so they can't accept any additional latency. Splitting a frame into independent segments (called "slices") doesn't add latency / can even reduce it, and it recovers from data corruption a bit better, so that's usually done at the cost of some compression efficiency.


> Most people concerned about encoding performance are doing livestreaming

What make you think that? I very much care about encoding performance (for a fixed quality level) for offline use.


your idea also doesn't work with live streaming, and may also not work with inter-frame filters (depending on implementation). nonetheless, this exists already with those limitations: av1an and I believe vapoursynth work more or less the way you describe, except you don't actually need to load every chunk into memory, only the current frames. as I understand, this isn't a major priority for mainstream encoding pipelines because gop/chunk threading isn't massively better than intra-frame threading.


It can work with live streaming, you just need to add N keyframes of latency. With low-latency livestreaming keyframes are often close together anyways so adding say 4s of latency to get 4x encoding speed may be a good tradeoff.


Well, you don't add 4s of latency for 4x encoding speed though. You add 4s of latency for very marginal quality/efficiency improvement and significant encoder simplification, because the baseline is current frame-parallel encoders, not sequential encoders.

Plus, computers aren't quad cores any more, people with powerful streaming rigs probably have 8 or 16 cores; and key frames aren't every second. Suddenly you're in this hellish world where you have to balance latency, CPU utilization and encoding efficiency. 16 cores at a not-so-great 8 seconds of extra latency means terrible efficiency with a key frame every 0.5 second. 16 cores at good efficiency (say, 4 seconds between key frames) means terrible 64 second of extra latency.


You can pry vp8 out of my cold dead heands. I'm sorry, but if it takes more than 200ms including network latency it is too slow and video encoding is extremely CPU intensive so exploding your cloud bill is easy.


4s of latency is not acceptable for applications like live chat


As I said, "may be". "Live" varies hugely with different use cases. Sporting events are often broadcast live with 10s of seconds of latency. But yes, if you are talking to a chat in real-time a few seconds can make a huge difference.


Actually, not only does it work with live streaming, it's not an uncommon approach in a number of live streaming implementations*. To be clear, I'm not talking about low latency stuff like interactive chat, but e.g. live sports.

It's one of several reasons why live streams of this type are often 10-30 seconds behind live.

* Of course it also depends on where in the pipeline they hook in - some take the feed directly, in which case every frame is essentially a key frame.


> except you don't actually need to load every chunk into memory, only the current frames.

That's a good point. In the general case of reading from a pipe you need to buffer it somewhere. But for file-based inputs the buffering concerns aren't relevant, just the working memory.


Video codecs often encode the delta from the previous frame, and because this delta is often small, it's efficient to do it this way. If each thread needed to process the frame separately, you would need to make significant changes to the codec, and I hypothesize it would cause the video stream to be bigger in size.


The parent comment referred to "keyframes" instead of just "frames". Keyframes—unlike normal frames—encode the full image. That is done in case the "delta" you mentioned could be dropped in a stream ending up with strange artifacts in the resulting video output. Keyframes are where the codec gets to press "reset".


> That is done in case the "delta" you mentioned could be dropped in a stream ending up with strange artifacts in the resulting video output.

Also to be able to seek anywhere in the steam without decoding all previous frames.


Oh right. For non realtime, if you're not IO bound, this is better. Though I'd wonder how portable the codec code itself would be.


The encoder has a lot of freedom in how it arrives at the encoded data.


Isn't that delta partially based on the last keyframe? I guess it would be codec dependent, but my understanding is that keyframes are like a synchronization mechanism where the decoder catches up to where it should be in time.


Yes, key frames are fully encoded, and some delta frames are based on the previous frame (which could be keyframe or another delta frame). Some delta frames (b-frames) can be based on next frame instead of previous. That's why sometimes you could have a visual glitch and mess up the image until the next key frame.

I'd assume if each thread is working on its own key frame, it would be difficult to make b-frames work? Live content also probably makes it hard.


In most codecs the entropy coder doesn't reset across frames, so there is enough freedom that you can do multithreaded decoding. ffmpeg has frame-based and slice-based threading for this.

It also has a lossless codec ffv1 where the entropy coder doesn't reset, so it truly can't be multithreaded.


There's already software that does this: https://github.com/master-of-zen/Av1an Encoding this way should indeed improve quality slightly. Whether that is actually noticeable/measurable... I'm not sure.


I've messed around with av1an. Keep in mind the software used for scene chunking, L-SMASH, is only documented in Japanese [1], but it does the trick pretty well as long as you're not messing with huge dimensions like HD VR where you have video dimensions that do stuff like crash quicktime on a mac

[1] http://l-smash.github.io/l-smash/


ffmpeg and x265 allow you to do this too. frame-threads=1 will use 1 thread per frame addressing the issue OP mentioned, without big perf penalty, in contrary to 'pools' switch which sets the threads to be used for encoding.


IIUC - International Islamic University Chittagong?


IIUC - If I understand correctly.


If I Understand Correctly




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: