I recently wrote about the limits of these kinds of fingerprinting tests. They tend to overly focus on uniqueness without taking into account stability. Moreover sample size is often really small which tends to artificially make a lot of users unique
This is great, and exactly the kind of nuance I almost never see when this topics come up. Thanks for posting this. Far too often, the pro-privacy crowd is much more _upset_ than they are precise, and to the point of your article are spending extra effort without really accomplishing much.
Interesting article. I’ve been curious for a while about how residential proxy IPs are collected too. Many come from shady browser extensions or mobile apps, especially free VPNs (wink wink Hola VPN). People often don’t realize they are turning their device into an exit node.
Some time ago I started to track this as a side project (I work in bot detection and was always surprised by how many residential proxies show up in attacks). It started just out of curiosity. Now I collect proxy IPs, which provider they belong to, and how often they are seen. I also publish stats here:
https://deviceandbrowserinfo.com/proxy-api/stats/proxy-db-30...
For example, in the last 30 days I saw more than 120K IPs from Comcast and nearly 100K from AT&T.
I also maintain an open IP (ranges) blocklist, mostly effective against data center and ISP proxies. Residential IPs are harder since they are often shared with legit users:
https://github.com/antoinevastel/avastel-bot-ips-lists
Even if you can’t block all of them, tracking volume and reuse gives useful signal.
Hola/Luminati rebranded as “Bright Data” and now pays mobile developers to embed their proxy SDK into mobile apps. Apple and Google should put a stop to this practice.
hola vpn is such an interesting case of a money printer, host a simple vpn and present it as free, give the users datacenter ips that are easy to detect. meanwhile you get their precious residential ip's and print millions a month
Thanks for the great read, so much to unpack from that article the click fraud stuff is to be expected, keeping track of everything that goes through their proxy is also expected, but copying files is crazy and this could unravel to a class action
but with that being said, if you are doing something shady/grey area to get ahead you best give everyone a cut of the pie, especially your blood brother
I would add that your chances of having a proxy node increase by 1% with each free app you install these days. We catch them easily at visitorquery.com but the residential proxy business in rampant and probably half are infected devices, android TVs, routers and, ofc, mobile apps.
Author here: I work in bot detection, and wrote this post to explain why privacy-conscious users (VPNs, Brave, LibreWolf, etc.) often get flagged or blocked by anti-bot systems.
I’ve seen a lot of frustration in threads here, so I wanted to offer a technical perspective on why these false positives happen, and how detection systems interpret signals from non-mainstream setups.
Author here: There’ve been a lot of HN threads lately about scraping, especially in the context of AI, and with them, a fair amount of confusion about what actually works to stop bots on high-profile websites.
This post uses TikTok’s obfuscated JavaScript VM (recently discussed on HN) as a case study to walk through what modern bot defenses look like in practice. It’s not spyware, it’s an anti-bot measure designed to make life harder for HTTP clients and non-browser automation.
Key points:
- HTTP-based bots skip JS, so TikTok hides detection logic inside a JavaScript VM interpreter
- The VM computes signals like webdriver checks and canvas-based fingerprints
- Obfuscating this logic in a custom VM makes it significantly harder to reimplement outside the browser (and so to scale an attack)
The goal isn’t to stop all bots, it’s to push attackers into full browser environments, where detection is more feasible
The post covers why simple solutions like "just require JS" don’t hold up, and why defenders use techniques like VM-based obfuscation to increase attacker cost and reduce replayability.
The attacker had fully reverse engineered the signal collection and solved-state flow, including obfuscated parts. They could forge all the expected telemetry.
This kind of setup is pretty standard in bot-heavy environments like ticketing or sneaker drops. Scrapers often do the same to cut costs. CAPTCHA and PoW mostly become signal collection protocols, if those signals aren’t tightly coupled to the actual runtime, they get spoofed.
Yeah, not (too) surprising after a few years in the anti-bot industry. Last week I looked into a Binance CAPTCHA solver that didn’t use a browser at all, just a basic HTTP client. The attacker had reverse engineered the entire signal collection and response flow, including how the CAPTCHA was marked as solved. They were able to forge the expected telemetry despite some obfuscation.
https://blog.castle.io/what-a-binance-captcha-solver-tells-u...
This is pretty standard now in bot-heavy spaces like ticketing or sneaker drops. CAPTCHA often just ends up being a protocol to collect signals, and if those aren’t tightly bound to the browser/runtime, they get spoofed.
Also not surprised PoW isn’t holding up. Someone reverse engineered the PerimeterX PoW and converted it to CUDA to accelerate solving:
https://github.com/re-jevi/PerimiterXCudaSolver/blob/main/po...
At some point, it’s hard to make PoW slow enough for bots without also killing UX for humans on low-end devices.
Author here. A few weeks ago, someone posted a link on Reddit to an open-source CAPTCHA solver made for Binance’s slider challenge. It’s written in Python and works without using a browser. Just a custom HTTP client, some image matching, and basic reverse engineering.
I was curious and decided to dig into it. I wrote a long breakdown of how it works, how it solves the challenge, and what this says about how bots are built today. Many bots use headless browsers, but this one doesn’t, and it still gets through.
One of the main takeaways is how effective this kind of non-browser approach can be when CAPTCHA is deployed in isolation, without other layers like continuous behavioral checks.
Hi, author here.
I wrote a blog post where I analyze Hidemium, a popular anti-detect browser. I break down the techniques it uses to spoof fingerprints and show how JavaScript feature inconsistencies can reveal its presence.
Of course, JS feature detection isn’t a silver bullet, attackers can adapt. I also discuss the limitations of this approach and what it takes to build more reliable, environment-aware detection systems that work even against unfamiliar tools.
So far I have ~ 3M distinct IP addresses per 30 days, with a lot of fresh proxy IPs, 1.7M. The DB contains only verified IP addresses through which I've been able to route traffic. It DOESN'T rely on 3rd party/open-source data sources.
The DB contains different types of proxies:
- Residential
- ISP
- Data center
I don't include mobile proxies since they're heavily shared, so knowing that an IP address was used as a proxy at some point is basically useless.
Regarding your remark, indeed, there are several shared residential IPs, including IPs of legitimate users who may have a shady app that routes traffic through their device. That's why I don't recommend blocking using IP addresses as is. It's supposed to be more of a datapoint/signal to enrich your anti-fraud/anti-bot system.
However, regarding the block list, I analyze the IPs on bigger time frames, the percentage of IPs in the range that were used as proxies, and generate a confidence score to indicate whether or not it is safe to block.
I’m working on a scraping project at the moment so looking at this too but from the other end. Super low volume though so pretty tame - emphasis on success rate more than throughput
I bought a 4G dongle for use as last resort if nothing else gets through. And also investigating ipv6
Using a 4G dongle makes it easier to hide in the crowd indeed. Since your traffic will go through heavily shared mobile IPs, probably with thousands of users behind them, anti-bot vendors won't/shouldn't block per IP, but per fingerprint/session cookie instead.
https://blog.castle.io/what-browser-fingerprinting-tests-lik...