What's interesting about this is that the API has been officially deprecated since 2010 (and "offline" since September 2014), but today is the first day that it's actually become unavailable.
Edit: Additionally, there are probably going to be big repercussions to the web/average person's browsing experience as a result of this. A massive number of sites (and programs, bots, etc) were using this API because of its simplicity (and absence of registration/authentication).
This is actually the second time they have reduced their Web API for search results. Google used to offer a very full featured REST API, but then took it offline and replaced it with a weak AJAX one:
You could actually go cool things with the original API. I used it to power a predecessor to https://friskr.com, but we had to switch web search engines after that change.
That's interesting, I had no idea.
It's a shame that the trend at Google has been towards closing down APIs like these in favor of specialized/pay-as-you-go endpoints.
I feel like that's a larger trend. APIs are introduced to drive platform growth and then pulled or cut back once that growth has been achieved. Twitter is another example of that phenomenon.
> Additionally, there are probably going to be big repercussions to the web/average person's browsing experience as a result of this. A massive number of sites (and programs, bots, etc) were using this API because of its simplicity (and absence of registration/authentication).
Yep. My harmless little IRC bot can no longer Google search anymore :(
We need a search engine that allows for deep search. It should be an open and cooperative project, and users could possibly run an instance of the spider/indexer as payment for executing searches, so it could be like a cross between bittorrent and Tor.
A free search engine would enable API calls and also boost privacy and freedom from the likes of Google. We have built a lot of experience about search engines since 2000, we have access to scientific papers, cheap cloud servers and a huge interest in freeing search, so I think the open source community do it.
Writing an open, and distributed web crawler / indexer is a nice programming exercise.
Writing an "objective" ranking function (for any values of "objective") in an open, and distributed manner is structurally not favoured by humanity's current incentive structure. As in:
* a dev team have to agree on signals, and weights: "SERP quality" has dedicated teams of people assigned for specific verticals @ Google; replicating this in a distributed manner will be played politically
* Assuming any significant usage, the second you submit ranking code to public github repo, the algo will be played by thousand SEO scammers to their advantage
* Executing custom ranking function on other people's computer not only introduces security risks, but will have scammers setting up honeypots for collecting other people's ranking signals, and playing accordingly.
only open source the framework for the server and client
then companies / communities / etc, can make their own algos, and buy their own servers. The reward for proving your servers / crawlers is more people use your algo (higher chance of hitting your nodes)
and then allow the client to have configurable automatic node filtering, along with manual node filtering, so if a person feels that a specific node set is just full of bs, they can filter them out (and also prefer certain node sets in turn, to which they can donate if they are consistently happy with the results)
> submit ranking code to public github repo, the algo will be played by thousand SEO scammers to their advantage
Just a thought :
Ranking code could itself learn & adapt to each individual user (the learned "weights" could be sync'd online across your devices). Weighted signals from users can be fed back to the mother ranking algorithm (un-customized one). Basically millions of distributed deep minds[1], instead of a single one.
I can imagine there are a lot of holes in my theory, but we can't simply accept that open sourcing the algorithm implies that it can't be done.
> Writing an "objective" ranking function (for any values of "objective") in an open, and distributed manner is structurally not favoured by humanity's current incentive structure.
Why not just take humanity out of that picture, then?
With the current AI/deep learning hype everywhere, why not start developing an AI-driven search system?
I think for it to produce the most relevant results, it will need access to your browser (or be a browser) or better yet, work at the OS level, so it can have a better idea of the current context you're working in, and learn from your habits and preferences. Say I'm coding and have an IDE and a bunch of dev-related websites already open, so the AI gives more weight to development-related results. If I've been playing a certain game a lot then it should assume that I'll be looking for stuff related to that game. And so on.
So, the index would be globally accessible to all computers, but the ranking will be unique to each individual user.
Something like this could very well be the actual beginning of a true A.I. "butler," more so than Siri and whatnot.
We had some trouble when the Youtube API was disabled for non-registered users. We were using it for some shared & local web-playlist, which is based on youtube-dl/mpv.
Sadly we couldn't use DDG because their API won't let you filter for video results only (Source: Asked on IRC, got a reply from DDG staff + haven't found any documentation mention this either).
We ended up using searx [0] which takes little more time for search results (about 1-3 seconds) but we gain more video-sources (such as vimeo or dailymotion).
I'm aware that youtube-dl got a search functionality build in as well, but the requirement that the client needs to get results without being prepared by our embedded computer wouldn't fit.
What are the alternatives for a full web search API? Yahoo BOSS closed down a few weeks ago. Bing seems the last man standing. And, unfortunately, none of the ones I've tried are very good; one project I'm working on needs to prioritize newer results over older ones, and only Google had a sort by date option (which I think was also removed in the past).
Can you use Google Custom Search for a full index web search?Last I looked, I thought you could only use it configured with a certain list of domains to include in the search.
Bing has a decent search api? Interesting, somehow I missed that. Want to provide a URL to a page for the service you're talking about? MS is really bad at documenting their services and prices. I eventually found this page, at this ridiculous URL: https://datamarket.azure.com/dataset/5BA839F1-12CE-4CCE-BF57... . I think that's what you're talking about? Note the link on that page to "Bing API FAQ" -- is broken. If this wasn't MS, the poor docs would make me think it was surely a terrible product.
Hi all, I am a program manager in the Bing API team. I am providing some pointers for people that are interested. We released our latest Search APIs in March via the Microsoft Cognitive Services site. You can find them here: https://www.microsoft.com/cognitive-services. All APIs are currently offered for free with limited calls per month (for most APIs,the quota is 1,000 calls per month). We will announce soon a pricing plan if you want to extend the monthly quota.
Some folks mentioned this page: https://datamarket.azure.com/dataset/bing/search. I want to clarify that the page includes the old Bing Search APIs that are still in use, but will be deprecated in the future.
Hi, just a quick bit of feedback -- 1k API calls / month is so low as to be a nonstarter. Of the half dozen hobby (non-commercial) projects I am currently working on, this amount is tragically low. Per day would give me pause.
I don't mean to rain on your efforts, but I would personally never consider using an API with such limits, even for trivial hobby projects.
Thanks for the feedback. We do not have any immediate plans to change the call limit for the preview offering. We've seen from our data that 1K calls / month are sufficient for most developers who want to get started using the APIs. We we will certainly consider revising this in the future if there's strong demand. I encourage you to file this as a request here: https://cognitive.uservoice.com. We review user ideas and feedback there on a regular basis to improve our services. Thanks again!
This is very interesting functionality to me, that I would use if it works well. Thanks for trying to clarify what's going on with it.
But when between one fairly poorly documented API that you've told me will soon be deprecated, and another fairly poorly documented API that doesn't have pricing beyond 1K calls a month....
...my response is still "Okay, that could be good in the future, I guess I'll sit back some more and wait for pricing and better docs to show up before paying more attention to it, I hope it does soon!"
I need to know pricing before deciding to use something. And I need good (and google-able! Sorry, even if I'm using Bing API, I'm using Google to look for info on it) docs, along with of course a well-designed API that works well, but I need the first two things before I even get to evaluating the next.
I keep checking out the Bing API every couple months, hoping for sort-by options (I need last changed date, most recent first). Any hope of ever seeing sort-by-date from Bing? Google has it, I don't think anyone else does.
As others pointed out, you can find information about the new Bing Image API here: https://www.microsoft.com/cognitive-services/en-us/bing-imag.... I cannot provide an official deprecation date yet, but here's a rough timeline: we will first announce the pricing plan for the new APIs in early-to-mid summer. We will then provide a notice to existing users of the old APIs to switch to the newer version within 6-12 months - note again that the dates are not final.
>Can you use Google Custom Search for a full index web search?Last I looked, I thought you could only use it configured with a certain list of domains to include in the search.
Thanks. One of the comments on that SO answer from Aug 2015 says:
> WARNING: we did development using the free version, but to upgrade to the paid version (to do more than 100 searches), google forces you to turn off the "search the entire web but emphasize included sites"
I remain not very confident that you can really search the whole web with google custom search, or if you can that it's not some kind of a loophole that google might close without warning.
But if someone has actually done this succesfully, with a paid account for more than 100/queries a day, I would definitely be interested in hearing about it!
Should I chalk up to conspiracy that this page is nearly impossible to find on, um, Google, searching terms that you would reasonably use to find it?
But if you have an in with MS, it would probably help an awful lot if they removed the old pages with broken links, unhelpful content, and possibly out of date inaccurate content too.
The page you link to has "coming soon" for pricing for Search (which was kind of confusing to find) -- does that mean that the pricing info on the page I found is not accurate, or soon won't be, with no indication on the page I found that that will be the case?
MS is really really bad at documenting this service. Only because it's MS am I willing to entertain that the quality of docs may not reflect the quality of service, generally I'd give up on something that markets and explains itself so poorly and figure that if they can't get that right, they probably didn't get the service right either.
I'm not seeing how to make GCS perform a full web search. It won't allow creation of a search without at least one schema defined, and I need schemaless data (or rather, I need the whole web, including a variety of schemas). Am I missing something?
Hoolie's search API hasn't been returning any articles critical of Gavin Belson any more, but it sure is faster now that they're using that new ranking system!
I wrote a PageRank library a while back and just got a report yesterday that it also no longer works.
(PageRank never actually had an official API, but it was exposed for the Google toolbar. They had announced that this would go away a while back but just dropped the axe in the last day or two.)
Does anyone know of an API that returns pages related to any given URL? I'm looking for an alternative to Google's "related:..." search, which was only reliably accessible through Google Web Search (Google Custom Search is very expensive and has a limit of 10K queries per day).
It returns pages related to any given page, and works at the specific page level (not just at the domain level, like SimilarWeb's API).
I've migrated to Bing for all other web search needs (way cheaper and without volume limits like Google Custom Search), but Bing does not offer a "related:..." equivalent as far as I can tell.
While you're looking for this feature in the meantime, ping the Bing team and ask for it as an added function of their API. Microsoft is a lot more open to feedback these days.
Hi! I work on the Microsoft Cognitive Services team, which includes the Bing Web Search APIs in our family.
For feature requests, it would be awesome if you posted them to: https://cognitive.uservoice.com/ (I'm also forwarding your post to Bing).
If you want to get in touch with the team directly about a question/issue, the "Contact Us" widget at the bottom of the Cognitive Services page is the way to go: https://www.microsoft.com/cognitive-services
I don't know a great way, that's not intrusive, but try bingdevcenter@outlook.com or the MSDN forum. I know that Microsoft employees frequent HackerNews, so I'm sure someone will see this as well.
I ran into this problem several years ago while writing SEO software. There are a bunch of companies who provide google search results, but there's usually at least a couple minute delay between when the data is requested and when it arrives at your callback page.
I had success with Authority Labs. You might want to check them out. They'll raise the query limit if you ask them.
Slightly OT: Does anyone know how RankScanner works? My guess was this but it still seems to work. Their site seems to suggest they have loads of servers, but seems unlikely, I'd have thought Google would have captached them to hell by now.
Very good product. I'm curious how it works though.
I also built a rank tracker (WhooshTraffic at the time - no longer around) and we scraped the Search Result Pages as the user would see them, spoofed the session cookie and used human captcha solvers. We would then cache the cookies generated from the captcha solve.
It was highly effective and very scalable because you can stimulate the captcha pages easily and out-of-band generate a very large cookie pool by captcha solving. Then when you go to actually scrape google you can balance your pool of IPs and cookies, chilling them before they trigger another captcha, to handle spiking demand. Contant, sustained demand was very easy to plan for.
Anyway, we stopped using the API after the third day and realized it was not only not accurate and you couldn't turn many knobs (want to scrape results as they look for a different geographical region?)
This is super interesting to me, is there anything else you can share about how you approached this? In my scraping Google experience I have found roughly the same thing where once you've passed the captcha test, you can scrape a lot more.
Were you scraping with real browsers or something like Mechanize/Curl? Rate limiting at all? Proxies or real servers?
We had a pool of IPs we were leasing on our own, proxy services get abused and are poisoned.
We didn't rate limit, we would just increase the size of the cookie pool if a captcha was hit, which was rare because we would scrape n-pages till a threshold was met to prevent that session from being captcha'ed so we wouldn't have to captcha solve it. We had two pools, the primary pool and the "chilling" pool, cookies near their captcha life would cool off for a few hours before returning to the active pool which behaves just like any other resource pool, every page scraped would "borrow" a cookie out of the pool, customize the encrypted location key, and make the request with a common user agent string.
Scaling it was difficult but once we had it figured out, Erlang was invaluable to us and our dependence on IPs dropped once we figured out the cookie methodology.
When you set the location in Google it customizes the cookie with a named field that is a capital L I believe. That field is encoded or encrypted and I could never figure it out so I just constructed a rainbow table by using phantomjs to set a location and scrape the cookie out, pairing the known location value with the encrypted value so that we could customize "the location of the search".
God, page scraping is horrible at the best of times but, as anyone who's ever viewed source on one of their pages must be thinking, scraping Google markup must be hell-on-earth.
It was indeed pretty rough it wouldn't surprise me if Google moves to js generated dom elements to combat rank trackers, at the time it was fine because they want to service non-js browsers but that might change.
Wouldn't it be easier if they generated the dom elements via JS? That would imply that they're getting a JSON or something like, parsing it and creating the DOM.
No because then you'd have to use a headless browser that can execute js. That increases time and cost when scraping, though it wouldn't surprise me if it ends up going that way.
Def not that easy with Google. It is not uncommon to see 20+ serp variations on a given day if you are crawling at high volume, changing user agents, etc. The whole process is fairly terrible to parse consistently.
Having built a ranktracker before, and still being good friends with people who run one of the biggest rank trackers on the market, using the google API's simply isn't an option.
Simply put, the API never returned the results in the exact same order as the actual search results did.
The best / most reliable rank trackers did (and still do) simply use proxies to get around the google captchas. I've scraped millions of pages from google over the years, and with enough proxies, the correct proxy delays/timeouts, and other little tricks you pick up along the way; you can actually scrape google pretty easily.
This is especially true with the dropping cost of proxies. I'm obviously an exception considering I run several scraping SaaS and have generally specialized in it for years, but I can he hitting x.com with 30,000 different IP's within the hour.
Haven't all the web search APIs been gimped? In a past life this interested me greatly and I turned to both Google's API and to Yahoo's BOSS and they both basically sucked. You could tell you were not getting the type of results a proper search would return.
> with enough proxies, the correct proxy delays/timeouts, and other little tricks you pick up along the way; you can actually scrape google pretty easily
Try putting a site: in the search and it becomes suspicious fast.
I'm not acquainted with RankScanner, but my educated guess is that they use one of Google's (many) authenticated APIs[1] for search/content categorization/ranking analysis.
I believe the problem was that if you did /google or !google on your bots for various chat platforms, you were able to find the top hit or two without ever looking at Google's Ads.
So the question becomes, how much potential money were they losing through the API? Does Google think that by removing the API, people are now going to go to another window, perform a google search, and intently study the advertising before reviewing the search results?
You just made a textbook strawman. It doesn't do anyone any good to theorize about what the dumbest possible reason might be for Google to do this. It is infinitely more useful to think about what rational, informed reasons they might have for doing it (regardless of whether you agree with those reason). People reflexively jumping to strawman arguments is probably #2 or #3 on the top ten list of Why We Can't Have Nice Things, in tech, politics, and just about anything else in the world where different groups of people have imperfect knowledge about each other's activities and motivations.
Because client coders don't always have access to the headers. Don't think of how you would write a client; think about how someone in the most messed-up programming environment that you can imagine is forced to write clients.
Probably because HTTP status codes are perceived to be at a different layer of the response than the body. A lot of developers (myself included, for trivial things) treat HTTP bodies as strings with no associated metadata.
I think it's to simplify as much as possible for clients. If a dev wants to use the API, it is in the interest of the API maintainer to make it easy, including pitting all the useful info in a single place. I was just watching an old talk about this a week ago from StormPath (iirc). They also recommend at least offering a trigger for including other redundant data like the path for subsequent sets of paginated data and an additional method for adding data via an URL flag instead of using PUT in case it isn't available.
The YouTube API used to be open too - no regstration or authentication. That ended years ago. I don't really need an "API" to do searches and transform the results into a more useful format than HTML. These API's make it easier and are addictive, for sure. It's foolish to rely on them however.
I believe Google will replace it — the Google Search Appliance is going away too. Something will pick up those customers. An announcement at I/O, which is pretty soon?
Afaik this has been broken for me for December 8, 2015 when I first received: {"responseData": null, "responseDetails": "This API is no longer available.", "responseStatus": 403}
With more people placing value on privacy, the api was used by disconnect search along with tor and other privacy tools, google has just planted the seed that will become their biggest competitor.
This is terrible news and with Pagerank also disabled recently, now 2 features of our domain health testing service has been crippled. Pagerank was easy to replace with moz rank but this data was irreplaceable due to the fact that it was specific and unique to google. Rip one of our main tools.
That depends on your reason of usage, as a guide to web ranks its been obsolete as you say for years, but as a cheap and quick measure of a domains potential (for passing link juice or to compare link juice between two domains) in comparison against another, Pagerank had some value.
What's interesting about this is that the API has been officially deprecated since 2010 (and "offline" since September 2014), but today is the first day that it's actually become unavailable.
Edit: Additionally, there are probably going to be big repercussions to the web/average person's browsing experience as a result of this. A massive number of sites (and programs, bots, etc) were using this API because of its simplicity (and absence of registration/authentication).