Hacker News new | past | comments | ask | show | jobs | submit login
How to use undocumented web APIs (jvns.ca)
239 points by pingiun on March 10, 2022 | hide | past | favorite | 91 comments



When they have a GraphQL API with introspection enabled, it feels like discovering a pot of gold.

This happens more often than you would expect, even without any auth sometimes. At that point you're basically developing with the same DX as internal developers.

My theory is people just turn off the GraphiQL endpoint on their GraphQL server and think they have hidden the schema, not realizing any external tool can do the introspection. Either that or it's developers slipping a little something under the radar for other developers (same thing with source maps).

Another tip: If the service in question has a mobile app, sniffing the traffic on that with a MITM proxy can yield more interesting results than a web app.


I've always thought it's a bit silly have a publicly accessible GraphQL API but then turn off introspection. If the only thing you're relying on to prevent someone from knowing about a certain field is that none of your web client code currently requests that field, you're already in a pretty flimsy predicament. And even then, people could trivially check for common or expected field names, or even brute force a lot of short field names.

If you really intend for your GraphQL API to be used only internally and from your official web client, and you consider any fields not currently requested in your web client to be highly sensitive, you should really turn off public access to the full GraphQL API and use something like GraphQL's persisted queries where your web client requests queries by an opaque unique identifier rather than the fully text of the query.


Well, the gist of the op article is kind of "they can't prevent you from using their internal API", so most services shouldn't try. I think there's still a difference between making people scan your entire frontend code/traffic to find all the edge cases and making them reverse engineer your auth/headers/cookies (hours of work) vs handing them database access after 2 minutes of work. But I appreciate it, and it might be engineers that know this (that preventing access is futile) leaving it on intentionally. I certainly have done that.


So it's not only security through obscurity, it's very weak obscurity.


Security through the obscurity of a wedding veil.

Not only is it translucent, but your audience tends to have a better idea than they can directly see at the moment as to what it's hiding.



Some GraphQL APIs do this on purpose if the API is meant to be completely public and if they want to allow self discovery and documentation of things through introspection. Allows anyone to point their own instance of GraphiQL or GraphQL playground and the endpoint and find things out. We even include comments in the schema to help with this as another form of documentation.


I saw a website that exposed the results of a very expensive paid Linkedin API + the enrichment they did to those results in their GraphQL endpoint. Seemed like an expensive oversite


Which site was that??


Just be ready for mitm proxying on some mobile apps to be a bust if they use certificate pinning. I’m not aware of anything that can get you past that besides patching the app itself.


https://httptoolkit.tech/blog/frida-certificate-pinning/ has a good guide and Frida script that will disable certificate pinning automatically in most cases.


There's plenty of Frida scripts that can disable app certificate pinning


To be fair I’m sure that uses patching but I didn’t know about that tool and how easy it is to use. Thanks for another thing to put in the ol’ bag o’ tricks.


I love you.


I said this on a thread complaining about SPAs a little bit ago, but I love that the SPA trend has caused all kinds of web apps to open up APIs to their users. It's not as fun as pure screen scraping, but it is very exciting when you figure out whatever weird behavior they're expecting, and it starts working.

If you get stuck, look at their javascript, see what it is doing. double check your network requests in developer tools, some of them might be more important than you think, plus it's so nice that we don't have to use burp for this anymore. Some sites check referrers, and user agents, or expect a field from a specific server rendered page to be added to a header. More than one expected a javascript style timestamp on every request.

The weirdest behavior comes from older apps that started as purely server rendered, and slowly added a dynamic frontend. I always cringe when it's obvious that different developers were given tasks over the years, and completed them without bothering to learn the rest of the system.


Sometimes is is when two or three different products were acquired and then clumsily glued together in a single web UI. You can often spot these by jarring changes in conventions and behavior in different areas of the app.


As I started to try to track down government sources of Covid data a couple years ago, I soon discovered this approach was generally much more efficient than consulting any official documentation.


lol you think it's different developers but it's just me evolving over 8 years and not bothering to update the old shit


We didn't need SPAs for that. Ajax, XML-RPC and SOAP exist since around 1999.


Yes, SPAs aren't required but they make it more likely that the site will use AJAX to request data instead of rendering the page with server data PHP-style.


It is pretty handy.

I work on some SPAs and some server side rendered systems.

It's so nice to fire up the network tab and see some of the requests right away to troubleshoot.

Server side rendered stuff, not so easy. Not impossible and you can always add some debugging, but the nature of SPAs to just call all the things that are easily seen, very nice. And I can use that elsewhere.


Including the people like myself doing security checks as part of SecDevOps workflows.

99% of the time, not great, everyone codes their connection points as if their beloved JavaScript function is the only caller.


What is the recommended way to not code like this?

I always take care to not trust raw user input, and add rate limiting and request size limits. I'm sure there is more I can be doing, but this is not my area of expertise.


- Never trust any kind of input without validation.

- Assume that the caller might not be a browser, so don't assume anything regardling workflows between calls.

- Keep up to date with OWASP, https://owasp.org/


> I think there’s literally no way for the backend to tell that the request isn’t sent by my browser and is actually being sent by a random Python program.

Oh my sweet summer child. Unfortunately, there is a whole industry built around this. This a great blog discussing different detection methods: https://incolumitas.com/


The point still stands. The server can use any number of heuristics to try and figure out traffic source but (1) it is still an approximation, since they can all be spoofed, and (2) the more strict you make the detection logic, the more regular users are affected as well.


There's a way to make your point without being rude and infantilizing.


I can't resist a rhetorical flourish :) The situation between defenders of the API and users of the API is extremely weird. If you were an alien from another planet and witnessed what was going on you would be shocked. People are being paid to prevent people from accessing an API and people are being paid to defeat these countermeasures. This is similar to people being paid to dig holes and other people are being paid to fill in the holes. I think the aliens would conclude there is something wrong with the economic system. I don't know what the solution is but seeing as I'm a digger I really don't want rock the boat :)


> People are being paid to prevent people from accessing an API and people are being paid to defeat these countermeasures. This is similar to people being paid to dig holes and other people are being paid to fill in the holes

Let me tell you about this little thing called tax accountants and the IRS...

> I think the aliens would conclude there is something wrong with the economic system.

These aliens are unfamiliar with adversarial systems?

Reminds me a a SciFi short story I read once where some aliens came to earth. Everyone thought they were amazingly smart, but it turned out they had just been working on their tech for a lot longer and were very dumb. The protagonist in the story figured this out, and sold them the Brooklyn Bridge...


> there is something wrong with the economic system. ... I'm a digger I really don't want rock the boat

My advice, based on experience: when you find yourself in this situation immediately start looking for a way off the boat. Urgently. It is rare for people on the boat to notice this before it sinks, and those few who do always seem to overestimate the supply of lifeboats.


I think that interpretation is more on you than anything -- "sweet summer child" is not literally referring to a child but someone whose innocence or blissful ignorance hasn't been ruined by the can of worms they just opened.


It means those things precisely because the person is a child. In A Song of Ice and Fire winter can last over a decade hence "summer child", a young one that has never experienced the hardship of winter https://en.wiktionary.org/wiki/sweet_summer_child

Regardless, starting out with any variation of "you're blissfully ignorant" isn't needed either. I get offense usually isn't intended but the use of that phrase has always stuck me as a very condescending way to respond.


that definition misplaces the origin of the phrase, "sweet summer's child" has been in use since the mid-19th century or earlier:

https://books.google.co.uk/books?id=mGQSAAAAYAAJ&pg=PA52#v=o...


It may not be the first time the combination of words was ever written but it's certainly the first time it caught on and based on the timings where it caught on from (book then show) https://books.google.com/ngrams/graph?content=sweet+summer+c...


It is simply a common phrase to refer to a naive and innocent person, it has nothing to do with Game of Thrones. Although I like how he used it, it immediately makes its meaning clear.



As a counterpoint, I take the phrase as a funny way of saying "you lucky person". I definitely see how it can come off as condescending, I just find the reference obscure enough that it's more funny than offensive to me.


> I get offense usually isn't intended but the use of that phrase has always stuck me as a very condescending way to respond.

You recognize it's not meant to be offensive, but you want to treat it as offensive? Wouldn't it feel better to assume good faith and not treat it as offensive?

The same phrase spoken from one individual to another can be interpreted as offensive or not. Tom might say "Fuck you!" to Bill, but Bill is best friends with Tom and knows he's kidding, and laughs. If they didn't know each other, Bill would get offended. But either way it's just words. Tom's intended meaning doesn't change despite what Bill wants to assume the meaning was. It's up to us to decide to get offended or not.


Even if that were so, it strikes me as odd to characterise the author — one of the world's more accomplished software professionals — as blissfully ignorant.


Browser calls (and sessions) are indeed tricky to emulate - you'll generally have much better luck with reverse engineering mobile client API calls.


Totally agree. I used to work on a load testing product that spent many, many dev hours attempting to achieve a high degree of fidelity on web recordings at the HTTP and sometimes even the socket level of emulation. It was extremely tricky. We employed alot of regex matching mechanisms and used to keep a regression test bucket of thousands of example HTTP traffic recordings to avoid messing up cookies, headers, post data and query strings to name a few things.

In the early days, the developer abuses around ASP view state payload were an absolute nightmare to deal with. I used to half-joke that I could speak HTTP after staring at the raw traffic and how 5 page loads could generate 100+ requests which had dependencies on one another.

Interestingly, there were also an interesting class of client-server bugs that only were obvious in recordings (e.g. multiple repeat HTTP head requests to check if a resource existed). Each object or library dev clearly had no knowledge that the function triggered just before also wanted to check if that resource exists. This resulted in a huge amount of redundant unnecessary calls because nobody coordinated and optimized at this level.

Fun stroll down memory lane.


any resources for that just like this post? I am going to google as well, but wanted something if people already had on the top of their head


Try the iOS Charles proxy app, it allows you to scan and inspect network traffic for most apps.


Her example code doesn't even set a user agent header, making it trivial to distinguish these requests from ones an actual browser would make.


Yup, she is using the Requests library which has a default header that explicitly states the request is coming from Python Requests library.


Perhaps the fact that there's a whole industry built around this instead of, you know, a couple of well-cooked solutions that everyone uses, implies that it's not possible and isn't going to be for the foreseeable future.


Well, in this case the existence of a whole industry built around this is a reflection of the fact that it's impossible to implement a single static solution to the problem. It will forever be a cat and mouse game.


Yes. And I think the great thing about her observation is the naive view is actually true in practice (to some degree...). It is very difficult to distinguish between a 'legitimate' user that is a human and a bot. At least without reverting to captcha's and if the benefit to subverting the captcha's is high enough then you can just get human's to solve the captchas.


Really great techniques listed in this thread! I wanted to point out though that it's generally nicer to the website owner if you enable `Accept-Encoding: gzip, deflate`. The difference in the amount of bandwidth charges for the site owner is quite significant, especially should you want to do comprehensive crawls.

Yes, go ahead and disable that header when piping curl's output into `less`, however when converting the curl request into python just remember to re-add that header. Pretty much every python library I've used to handle web requests will automatically unzip the response from the server so you don't need to futz about with the zipping/unzipping logic yourself.


Your HTTP client library is likely to set that by itself to a value it can understand. Setting it manually risks setting it to something your library can’t actually decode when it gets the response.


No, some HTTP clients actually require you to set it - you wouldn't set the header directly, sure, but you would enable gzip/etc. Their point is super valid.


There have been some very popular websites that ignore Accept-Encoding and only send compressed data. Sometimes I want uncompressed responses. I always have the urge to complain about these websites on HN but I sense that HN commenters/voters would be unsympathetic. (I do not use curl nor python.)


I love doing this, especially to liberate content that is locked away in a app-only world otherwise. That's one important usecase that I'd love more people to work on - it is a great way to start with reverse-engineering, and building simple websites.

Pro-tip: If the undocumented API has a "CORS:*" header, you can call these APIs directly from the browser on your domain, without having to proxy them or using curl

As an example, I published https://captnemo.in/plugo/ this week that calls the Plugo.io private API (the ones used by the mobile app) to fetch the data, and publish it using GitHub Pages. The data is just a list of places where Plugo provides powerbanks on rent (500+ locations, mostly concentrated across 3 Indian cities, and 2 places in Germany somehow). I'm running a simple curl command on a scheduled GitHub Action that commits back to itself so the data remains updated.

I similarly did this to make a nocode frontend for another "clubhouse-alternative" which would keep recordings, but only provide them in-app. A friend wanted to listen to his prior recordings, but the app was too cumbersome, so I made a alternative frontend that would call the private API, and render a simple table with MP4 links for all recordings.

I even use this as a "nocode testing ground"[1] for many of the new nocode apps in the market - seeing if they are feasible enough to build fully functional frontends on top of existing APIs (which would be great for someone like me).

As a bonus, this works as a alternative-data stream for i)Plugo's Growth Metrics, if you were a investor, or interested in the "rent-powerbank" space as well as ii)Finding out cool new places to visit around you.

[1]: https://news.ycombinator.com/item?id=29243536


They can still prevent you from sending requests from another domain by looking at the origin header. AFAIK, origin inspection is actually more secure since no OPTIONS request is sent for GET requests. If CORS doesn't allow a GET request, what typically happens is the request is still made, but the browser tells the requestor that the request failed. Therefore, you could get timing attacks or something and you have to deal with additional load. Just inspecting the origin header can be done with a lot less resources than looking up a bunch of data in the database to service some request.


"The answer is sort of yes - browsers aren't magic! All the information browsers send to your backend is just HTTP requests. So if I copy all of the HTTP headers that my browser is sending, I think there's literally no way for the backend to tell that the request isn't sent by my browser and is actually being sent by a random Python program."

There is a way.^1 One might need to copy the static elements of the TLS Client Hello in addition to certain HTTP headers.

1. https://blog.squarelemon.com/tls-fingerprinting/

See, e.g., https://github.com/refraction-networking/utls

"problem 1: expiring session cookies

One big problem here is that I'm using my Google session cookie for authentication, so this script will stop working whenever my browser session expires.

That means that this approach wouldn't work for a long running program (I'd want to use a real API), but if I just need to quickly grab a little bit of data as a 1-time thing, it can work great!"

Sometimes Google keeps users logged in. For example, session cookies in Gmail will last for months or more. This makes it easy to check Gmail from the command line without a browser. It also means if someone steals a session cookie and the user never logs out, e.g., she closes the browser without logging out first,^2 then the thief can access the account for months, or longer.

2. Of course, it is also possible to logout and disable specific session cookies from the command line, without a browser.


The first problem can be solved with curl-impersonate: https://github.com/lwthiker/curl-impersonate

"A special compilation of curl that makes it impersonate Chrome & Firefox", and it now can also impersonate Edge and Safari.

Previously discussed: https://news.ycombinator.com/item?id=30378562 _Show HN: Curl modified to impersonate Firefox and mimic its TLS handshake_ (21 days ago, 58 comments)


I would also add that any search boxes are typically keys to the kingdom if you're scraping shops/job boards or similar things. They are often not hardened, so you can file e.g. an empty query (even if frontend doesn't allow it), or effectively disable pagination by requesting 1000000 results per page.


I need to point something out to people doing that kind of thing to other people wesites/webapps/whatever:

Having done this multiple times be aware that you can break other people stuff by messing up requests. Most web APIs suck and some won't behave nicely on unexpected failures.

1. When trying to automate a process on an energy management platform I ended up creating resources under some kind of master account, some things broke and they had to manually clean the DB.

2. When trying to access an operation I couldn't do via the provided API I reverse engineered the API of their admin dashboard. It sucked really bad, with a lot of strange sync tokens that felt like going back to 20 years ago. Anyway my implementation wasn't perfect, it grinded their platform to a halt.

I could go on, so please just do stuff like that if you're in contact with the people on the other side. If you're not limit yourself to GETs.


Bold of you to assume GET requests do not have side effects


It’s amusing to think the same devs who cobbled together a pile of otherwise fragile excrement were somehow careful to make sure that GETs were side-effect free.


> there’s literally no way for the backend to tell that the request isn’t sent by my browser and is actually being sent by a random Python program.

This is wrong, and the fact that somebody clearly experienced in web development is totally unaware that it is wrong should be a clear sign of the danger.

For starters: TLS fingerprinting, ETAG fingerprinting (including subtle browser-to-browser changes in how ETAGs are cached and evicted), JS VM fingerprinting, timing side channels, there is a massive list here. And then there's wasm...


Undocumented APIs are great when you only need to use them for a short amount of time, but if you try to build anything long term on top of them you should keep in mind that there could be changes that completely break your stuff, unannounced, at any time.


You should, at least, build your own shim between your app and the API. That way, if there are changes, hopefully the fixes (if they are possible) are at least confined to one place.


It's more robust not to remove the extra headers IMO. Otherwise you give an unnecessary signal to the backend that the traffic's not coming from the expected sources.

It also makes the process of writing your code more mechanical, which is useful since you'll likely have to redo the process when the API changes.


"I usually just figure out which headers I can delete with trial and error - I keep removing headers until the request starts failing. In general you probably don't need Accept, Referer, Sec-, DNT, User-Agent, and caching headers though."

IME, this "header minimisation" works for almost any website, or "endpoint". IOW, it is useful outside of "APIs". As a matter of practice, I minimise headers automatically with a forward proxy.^1

Thus, one can send less data to "tech" companies and still receive the same results. We know that data received by "tech" companies is used at every opportunity to support surveillance and online advertising. The most well-known example is perhaps "fingerprinting". Given a choice between sending more data or less data to "tech" companies, what is the choice that, in the aggregate,^2 lends itself better to increased survelliance and online advertising.

If the author here can send fewer headers and still get the desired result, then it stands to reason sending those extra headers benefits someone else besides the user. Send more data, not less, to make surveillance and online advertising easier. "Tech" companies will often defend data collection by suggesting that data supplied in headers are being used to "improve the user experience" or some such, and this may well be true for many cases, but the "fingerprinting" example exemplifies how there can also be another purpose. Data can be multi-purpose.

1. An added benefit is one does not need to fiddle with the browser to copy HTTP headers^3 as they are all easily accessible in the proxy logs.

2. Here, "in the aggregate" means "if every user makes the same choice".

3. The online advertising company or its business partner (e.g., Mozilla) could change the browser, without notice, at any time.


I do a bit of scraping for hobby projects, and much of that comes down to basically this (but I do it in node instead of python). Sometimes you need to use jsdom or puppeteer, but the second step (after checking if there are official data dumps made available or some official API) is always checking the full data flow in devtools if there is some undocumented way to more quickly get the raw data I want.


A trick that works great for me: filter the browser network pane by XHR, then sort by size - this usually ends up with the most interesting JSON responses listed at the top.


I found puppeteer very nice to script against if you need a real headless browser:

https://github.com/puppeteer/puppeteer


I've just started switching from Puppeteer to Playwright - pretty much the exact same functionality, but in a more actively maintained, tighter package (and with great language bindings for JavaScript, Python, .NET and Java.

I wrote a bit about that here: https://simonwillison.net/2022/Mar/10/shot-scraper/#how-it-w...


[Playwright](https://playwright.dev/) (Node / Python) is my current preferred - mainly because I seem to have less reliability issues with the browser starting/stopping cleanly (although it's never perfect with any of the tools I've tried).



I used selenium. Really like it and very well maintained.


You would be surprised to find out that some web servers are capable of detecting browser emulation through curl or Python’s requests lib. Try programmatically scrolling through Instagram photos. It will work if you use curl, but it will not work using Python’s requests lib. Not sure how they detect it - maybe related to timing of packets.


The most trivial check a website owner can do is checking the user-agent, which Python requests automatically sets to show its name, unless you configure your own. Trivial way to work around is to set your own user-agent to one that looks like a browser.

Specifically regarding Instagram, you can take a look at the implementation of https://github.com/dilame/instagram-private-api to understand more workarounds, as Instagram is getting better and better at working against the workarounds.


In the particular case of Instagram as GP mentions, I'm guessing the devs don't go off of user agent since curl's default user agent is "curl/<installed version num>". Even if they are going off user agent, seems strange to block requests but not curl. GP doesn't mention if they tried to changing the user agent, would be interested to know if Instagram can guess if the client is curl or requests based off other heuristics


I wonder if they have some internal tooling or monitoring that use curl. And thus blocking it would break things.


This tool claims to replicate Firefox/Chrome's TLS handshake signature:

https://github.com/lwthiker/curl-impersonate

I haven't tried it, haven't really come across a service that blocks curl, but I'll be keeping an eye on it in case I need it.


The clowns who run the Seattle Times's website block all non-browser user-agent requests to their RSS feeds.

Except curl.

You can "curl" their RSS feed. You can open it in a browser. Anything else that doesn't lie about its User-Agent will fail.

W T F.

Somebody please go strangle those people. I had to set my RSS feed reader to impersonate curl's User-Agent.


If you still use the website via browser, I find https://github.com/richardpenman/browsercookie/ is great for working around the expiring cookie problem


While I've successfully used this method for public APIs, I ran into an interesting one not long ago: where authentication is performed _by IP address_.

I have a switch (I think a TP-Link TL-SG1016PE) with PoE - and a finnicky PoE device that periodically needs a reboot, so I figured I'd replay turning the port on and off in the Web interface. Notably, logging in does not issue me any authentication token, but I can still turn the port on and off - and can still do it via `curl`, too. But as soon as I try it on another machine? Access denied!

(Yes, I could just fake the login process the same way, but that was more work than I had time for.)


>>If I’m using a small website, there’s a chance that my little Python script could take down their service because it’s doing way more requests than they’re able to handle. So when I’m doing this I try to be respectful and not make too many requests too quickly.

What is a reasonable rate to send requests? I've done a little scraping and I wanted to do the same thing but I realized I had no idea what would be considered acceptable use and what would be unacceptable. If anyone has a heuristic they like to use I'm all ears.


If you're going to be doing it manually regardless (if not automated), then as far as I'm concerned, you could definitely just use a "normal clicking speed" rate - so a second or two between clicks is probably just fine and non-parallel requests. Usually if it's likely to overload the server, it's probably slower to return, too, so the server itself will slow the requests down naturally if you're not using parallel requests.

Usually you're automating these things not to get the job done that much faster, but instead just to do it without all the tedium, so a slow but asynchronous scrape is fine.


So I just checked WhatsApp web app. No network activity whats so ever on full loaded page that has incoming messages. And then a bunch of error messages in console about we sockets and source maps. How did they pull that off? does chrome not show web socket activity or service worker activity on the network tab?


You can definitely see the incoming/outgoing messages in the DevTools! Since it's a WebSocket connection, you must have the DevTools open and reload the page. Filter for the WebSocket network activity (you can quickly do this by selecting "WS"), and you'll find the WebSocket connection. Clicking on it and selecting the "Messages" sub-tab will let you see the live list of binary messages sent and received by the connection. Not too meaningful though, unfortunately.


ah ur right! thanks.


It's always a joy when you start to reverse engineer an undocumented API and find out it is cleaner/nicer than some paid APIs you've used. Paprika (Cloud sync for the recipes/other data) was an example of that for me. Their API is (was, it's been a minute since I last looked at it) super RESTful and really easy to reason about, more less just simple CRUD.


The copy as cURL is a great idea! That makes it easy to get a succinct summary of the components to the request including how they are doing auth. If the API in question is a desktop app, Fiddler can be a great alternative. Obviously WireShark can see more, but Fiddler is a lot easier to use and setup in my experience.


Small nitpick on the comments about removing the headers that the browser request had made.

You probably don't want Accept: */*. If the value of Accept is anything other than */*, then you probably want it.


Julia is really an excellent teacher.


I used this approach last year to run a Twitter bot that would report when local pharmacies had 'rona vaccine appointments open up. I scraped the API's of CVS, Rite-Aid, Walgreens, and a few other chains this way. Although I didn't go fancy and try to distill the API down to the bare minimum headers, I just called into cURL from Python with that giant command as a string.


gobuster is an effective way to enumerate subdomains and their directories quickly.

https://github.com/OJ/gobuster




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: