Now? OK, you need to screencap and upload to LLM, but that's well established tech by now. (Where by "well established", I mean at least 9 months old ;)
Same goes for "navigating HTTP sites via LLM prompts". Most LLMs have web search integration, and the "Deep Research" variants do more complex navigation.
Video chat is there partially, as well. It doesn't really pay much attention to gestures & expressions, but I'd put the "earliest possible" threshold for that a good chunk closer than 5 years.
Yeah, all these things are possible today, but getting them well polished and integrated is another story. Imagine all this being supported by "HTML6" lol. When apple gets around to making this part of safari, then we know it's ready.
But kidding aside - I'm not sure people want this being supported by web standards. We could be a huge step closer to that future had we decided to actually take RDF/Dublin Core/Microdata seriously. (LLMs perform a lot better with well-annotated data)
The unanimous verdict across web publishers was "looks like a lot of work, let's not". That is, ultimately, why we need to jump through all the OCR hoops. Not only did the world not annotate the data, it then proceeded to remove as many traces of machine readability as possible.
So, the likely gating factor is probably not Apple & Safari & "HTML6" (shudder!)
If I venture my best bet what's preventing polished integration: It's really hard to do via foundational models only, and the number of people who want to have deep & well-informed conversations via a polished app enough that they're willing to pay for an app that does that is low enough that it's not the hot VC space. (Yet?)
Crystal ball: Some OSS project will probably get within spitting distance of something really useful, but also probably flub the UX. Somebody else will take up these ideas while it's hot and polish it in a startup. So, 18-36 months for an integrated experience from here?
Same goes for "navigating HTTP sites via LLM prompts". Most LLMs have web search integration, and the "Deep Research" variants do more complex navigation.
Video chat is there partially, as well. It doesn't really pay much attention to gestures & expressions, but I'd put the "earliest possible" threshold for that a good chunk closer than 5 years.