If you can predict with high enough accuracy what resource is going to be reques...

pas · on April 25, 2018

For that it's easier to parse the pushed content. If it's HTML, then catch stlyesheets, JS, and some other static <img src=.../> things. It doesn't have to be flawless, after all it's just a speed-up. (And if you want a speed up write nice markup.)

Similarly, it should be the backend behind the reverse-proxy that knows what's the page that has been just rendered, and knows about the user's session (is it brand new, or maybe it's not new, but still needs to push things because it's too old, and since then that particular page's background changed, etc.).

And in case of an Angular/React/SPA thing, then the "bundler/compiler" should create a list of things to push for various URLs. Or the Angular/React team should talk with the Nginx team to figure out how to speed up things. (In case of SSR - server side rendering - the NodeJS server can emit the necessary Link headers, for example.)

zzzcpan · on April 25, 2018

Common, how parsing things is easier, than gathering some very basic stats?

pas · on April 25, 2018

Gathering stats requires keeping them somewhere. Making inferences. Documenting the inference engine. Explaining the magic to users. Sounds a lot more complicated than explaining that what HTML tags will be parsed.

Proxies are already complicated as is. Caching proxies more so. (Think of how Varnish has a - probably Turing complete - DSL to decide what to serve and/or cache and when, and how.)

vlovich123 · on April 25, 2018

Parsing HTML content won't get you the full benefit an inference engine would. An inference engine could easily learn that 90% of your users getting to your landing page are going to login & end up on their home screen so it would push the static resources for the home screen too. Similarly, it might know that it already pushed those resources previously in the session & only push the new static resources that are unique to you once you login (saving the round-trip of the client nacking the resource). Doing it via stateless HTML parsing is never going to work because you have no idea of the state of the session. That doesn't mean there's not a place for a mixture of approaches (& yes you could teach the HTML parsing about historical pushes but then you get back to the concern you raised about storing that data somewhere).

The HTML parsing approach is probably great from a 80% of the benefit for 20% of the effort on small-scale websites (i.e. majority). A super accurate inference engine might use deep learning to train what to serve on a very personalized level if you have a lot of users & the CPU/latency trade-off makes sense for your business model (i.e. more accuracy for a larger slice of your population). A less accurate one might just collect statistics in a DB & make cheap less accurate guesses from that (or use more "classic ML" like Bayes) if you have a medium amount of users or the CPU usage makes more sense and you're OK with the maintenance burden of a DB. It's a sliding scale IMO of tradeoffs with different approaches making sense depending on your priorities.

pas · on April 26, 2018

Yes, I agree, that of course a hypothetical ML/AI outperforms any naive and simple solution. But usually magic technology is required to do that, otherwise it wouldn't be magic :)

That said a simple heuristic like "after requesting an URL the server got these requests on the same HTTP/2 connection in less than 1 second, and those were static assets served with Expires headers" could work.

vlovich123 · on April 26, 2018

Yes, like I said there's a sliding scale of effort/reward & HTML parsing is on the extreme of one end.