Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Having built a ranktracker before, and still being good friends with people who run one of the biggest rank trackers on the market, using the google API's simply isn't an option.

Simply put, the API never returned the results in the exact same order as the actual search results did.

The best / most reliable rank trackers did (and still do) simply use proxies to get around the google captchas. I've scraped millions of pages from google over the years, and with enough proxies, the correct proxy delays/timeouts, and other little tricks you pick up along the way; you can actually scrape google pretty easily.

This is especially true with the dropping cost of proxies. I'm obviously an exception considering I run several scraping SaaS and have generally specialized in it for years, but I can he hitting x.com with 30,000 different IP's within the hour.



Any additional tricks you want to share and/or cheap proxy services you'd recommend? Do you make use of selenium and/or phantomjs?


Haven't all the web search APIs been gimped? In a past life this interested me greatly and I turned to both Google's API and to Yahoo's BOSS and they both basically sucked. You could tell you were not getting the type of results a proper search would return.


> with enough proxies, the correct proxy delays/timeouts, and other little tricks you pick up along the way; you can actually scrape google pretty easily

Try putting a site: in the search and it becomes suspicious fast.


You're correct. However for Rank Tracking these types of queries aren't needed.


That is probably why some rank trackers give different results. Some probably use the API and others scrape Google's result page.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: