Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

its not useful in those cases, but usually for those js rendered sites you can replicate the ajax requests which happen and get nicely formed json documents to parse through instead.


Or the data is stored in js objects within script tags in the html and can be extracted programmatically. It's getting common with SSG sites using SPA frameworks.

For example, the new Google Play Store website stores the data in AF_initDataCallback calls and can be extracted with re.findall(r"<script nonce=\"\S+\">AF_initDataCallback\((.*?)\);", html_string).


I used to do that when I was responsible for a set of web crawlers to extract public records data, but the problem is that changes happen and these sorts of things become out of date fairly quickly.

Getting this working in a headless browser driven by Selenium would probably be easier for maintainability.


nowadays you usually have to submit http headers and cookies too, that's always a fun process of elimination




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: