Does anyone do that? The average developer likely would not think to do this because it is too computationally intensive to splice things into A/V streams on the fly.
A more clever developer could splice the ad into the video at an I frame, but then the ad needs to be a multiple of the number of frames that are both the I frame and follow the I frame. This also would mess with metadata on the length of the video that would need to be adjusted in advance. It is doable, but you give up flexibility and your HTTP sessions cease to be stateless. Then there is the need to handle splicing into audio and I do not know offhand if there is a cheap way of doing that at the server like you can do with video through I frame splicing.
It seems to me that they have lower server costs by doing things the current way.
SSAI (server side ad insertion) is not uncommon for premium streaming video; Twitch and Hulu have had the technology in use for years. It's also practically just a checkbox option to enable the feature for all major ad serving tech platforms, including Google's DoubleClick.
They're not using it simply because it increases server and bandwidth costs. YouTube is still positioned as part of Google's "moat" by driving down video ad price so no one else can build an ad empire off video instead of being a profit generating division on its own.
Youtube do their own re-encoding on upload to different quality levels, so they could theoretically hook that and make sure to provide suitable splice points and record them in the metadata.
> your HTTP sessions cease to be stateless
There's already pretty heavy magic around preventing people from simply grabbing all the HLS blocks, I think? All the work that yt-dlp does.
YouTube Videos are a stream, not a file you download. I’m not sure what the major technical nurdle is injecting ads directly into the stream. Also H.264 has key frames typically a few sounds apart anyway
I think that it should be sufficient to create content identifiers of all unitary parts of the video, e.g. parts between keyframes, and skip over the ones which are not supposed to be there.
These identifiers could be collected automatically by plugins like SponsorBlock in a community effort and then combined together to identify parts which are common for every viewer, i.e. the ones representing the original video content.
In other words, it seems to me that even putting ads directly into a video stream would not prevent people from being able to block these ads.
A more clever developer could splice the ad into the video at an I frame, but then the ad needs to be a multiple of the number of frames that are both the I frame and follow the I frame. This also would mess with metadata on the length of the video that would need to be adjusted in advance. It is doable, but you give up flexibility and your HTTP sessions cease to be stateless. Then there is the need to handle splicing into audio and I do not know offhand if there is a cheap way of doing that at the server like you can do with video through I frame splicing.
It seems to me that they have lower server costs by doing things the current way.