I recently wrote a large article on IP geolocation and it turns out that the main source for IP geolocation is still WHOIS data: https://ipapi.is/geolocation.html
There are tools such as https://ipapi.is/ that can be used to geolocate any IPv4 and IPv6 address in the world. Sometimes those tools are not extremely accurate, but usually they are accurate to the city level.
I know about geolocation by IP address. But as I mentioned, geolocation based on my previous ISPs IPs were never this fine-grained. What could my current ISP be doing different from these other ISPs?
In case anyone needs a all around good IP api with excellent threat intelligence that is not that expensive as IPinfo is, I suggest to use https://ipapi.is/
https://ipapi.is/ is not as good in regards to geolocation, but our hosting detection is more advanced and https://ipapi.is/ is much cheaper...
Hey, I am happy to see you here! What do you think of the event? I think you might find it interesting how these IPs were submitted. I would love to hear your thoughts.
In our previous interaction, I mentioned that I learned about CGNAT-based web scraping from your article on your personal blog. In this event, people took it to a whole other level. People were churning carrier IPs by the hundreds to win a pair of socks.
I am not sure if I would say that your hosting detection is more advanced than ours because we have already productized every aspect discussed in your steps.
Step 1: "Download a List of Top 1 Million Domain Names" → 10 million domain names available for free download here: https://host.io/rankings
Step 6: Crawl the Website and Classify the Website Text → https://host.io
This is just scratching the surface; a lot is happening internally. We have a team of talented engineers and a decade of experience. I am not trying to be disrespectful; I just wanted to share the information, that's all.
Our pricing approach is that we have an uncompromising approach to IP data, and organizations pay for the quality. However, for developers, small businesses, and students, we provide a generous yet highly accurate database and API for geolocation, which is free to access. We offer 50k reqs/month, tokenless API access, and a free IP to Country ASN database.
In general ipinfo.io has the best data and the best product out there in the IP API niche. It is what it is. Your data is the most accurate and most updated.
Hosting detection is a finite process, meaning that there is a finite amount of hosting providers out there to detect after all. Challenge lies in staying up to date.
So maybe we can make a compromise and state that ipinfo.io is likely a bit better than ipapi.is in hosting detection.
But ipapi.is for sure detects some hosting provider's that you don't. Examples:
Thank you. I hope you don't think I am pushing back or being sarcastic in any way because I personally like your work and really appreciate your comment. This is a typical conversation in a very "HN" way. Sometimes, it's hard to determine tone in written communication.
I appreciate that you make a wide variety of data accessible, and that has a positive impact. Our goal is to be the most accurate IP data provider, with geolocation being our foremost priority. It requires a "significant" amount of effort to simply be better than the rest of the industry. If customers want that bump in quality they will pay for our services. The ProbeNet has about 700 servers now, and nobody is investing in that level of infrastructure for IP data
accuracy.
The whole game for us is accuracy. Consider our free IP to Country database. It is so accurate that it goes down to individual IP address (`/32`) levels for country-level locations. This high level of accuracy, however, results in a bigger file size. Some developers compromise on accuracy over file size. While rounding up the ranges to `/24` and updating them twice a week would reduce cost, it does not align with our philosophy.
> ipapi.is for sure detects some hosting provider's that you don't
For the last one, it is not associated with an ASN, so the company field can be used as a proxy. As that company's IP data is a bit vague labeling it as hosting or business is difficult.
We have four types of "types": business, education, hosting, and ISP. This applies to both ASN and companies/organizations. For us, it is not simply a binary classification of hosting and non-hosting. Instead, we categorize IP ranges based on a statistical model with assigned weights. These ranges are then aggregated to determine AS types. When our customers say why is that this or that, you can show them the underlying reasoning.
But still, I grok and understand you. I really appreciate your feedback.
But I picked just one faulty classification of ipinfo.io, that's not fair, I know. I only wanted to point out that what you are doing is exactly the same as https://ipapi.is/ is doing and that we both make mistakes
----
You are using the 700 measuring servers to interpolate geolocations of IP addresses, right?
That works sometimes, but more than often it does not. It does not scale either.
Active latency triangulation of every IPv4 address (let's not even speak about IPv6) is simply not possible. The reasons are manyfold:
- Most hosts don't reply to ICMP
- Many routers block ICMP traffic, or they throttle / downgrade it, thus skewing measurements
- Traffic from your probing servers is probably not handled in the same way as is normal residential ISP traffic
- You have to constantly measure all IPv4, since IPs are constantly reassigned, which is simply not possible with only 700 servers
Latency triangulation works in theory, but in practice it is just not applicable to the full IP space.
Having said that, active geolocation with probing servers is still better than not doing it :D
Latency triangulation works much better in a passive way, meaning that a client is visiting a server that is under your control and you triangulate the client with JS for example (web sockets).
But I doubt that ipinfo.io has a significant share of the Internet's traffic...
You're not missing anything - those are all real problems! We've done a lot of work to overcome many of them, and others are active areas of research and development for us.
We do scan and traceroute all IPv4 and (known) IPv6 space, once a week, so our measurement data can be out of date by at most a week. We have other signals that an IP might have moved within that timeframe though.
We definitely don't have a significant share of total internet traffic - but we do get 6BN API requests a day.
Great idea with latency triangulation, I used latency information for a lot of things, especially VPN and Proxy detection.
But I didn't assume you can obtain that accurate location. I am honestly impressed. But latency triangulation with 600 servers gives some very good approximation. Nice man!
Some questions:
- ICMP traffic is penalised/degraded by some ISP's. How do you deal with that?
- In order to geolocate every IPv4 address, you need to constantly ping billions of IPv4's, how do you do that? You only ping an arbitrary IP of each allocated inetnum/NetRange?
- Most IP addresses do not respond to ICMP packets. Only some servers do. How do you deal with that? Do you find the router in front of the target IP and you geolocate the closest router to the target IP (traceroute)?
I used to do freelance web scraping, and that article felt like some kind of forbidden knowledge. After reading the article, I went down the rabbit hole and actually found a Discord server that provided carrier-grade traffic relay from a van which contained dozens of phones.
For the questions..... we have to kinda wait a bit, someone from our engineering team might come here and reply.
By the way, as I have you here have you considered converting the CSV files to MMDB format? I was planning to do that with our mmdbctl tool later today.
I'm very curious why you'd do VPN/proxy detection...
But at a previous company I worked at that ran a very large chunk of the internet, we did indexing of nearly the entire internet (even large portions of the dark web) approximately every two weeks. There were about 500 servers doing that non-stop. So, I think it is relatively reasonable if you have 600 servers to do that.
In the business of media streaming, rightholder will require that you check for vpn and proxies in addition to countries when deciding if a given viewer will be able to stream a given media.
Does that actually work? That could explain an issue with a particular streaming service I use. There are currently some ongoing routing issues in BGP land and my ISP. When trying to stream, it says I’m using a proxy, so due to the incredible route my packets are taking, that might be it. What’s funny is that the only way to watch this service is to use a vpn right now.
Routing should not impact the detection, it's usually based on maxmind's anonymous/datacenter database using your IP. Accuracy won't be 100% of course but you have to show compliance.
I doubt it. According to that database my ip is in a totally different country but I'm served the correct content. Despite my efforts to fix this for years...
Why is this getting downvoted? It seems to me that a lot of the media-focused anti-piracy tooling is essentially a performance of toughness to make rightsholder execs comfortable. Everybody accepts you can't stop piracy entirely, and nobody's willing to say, "Fuck it, we'll compete on convenience and strong consumer relationships," so we all put up with this weird middle ground of performative DRM and the like. With only the rare occasional bit of honesty, as from Weird Al: https://sfba.social/@williampietri/110906012997848549
This is correct. Imagine in the days of yore, some two decades and change ago, when I was charged with implementing putting some music reserves "online" for streaming ...
[Harp music, progressive diagonal wave distortions through the viewport ...]
We had two layers of passwords (one to get to the webpage for the class, one when actually streaming via the client, which was RealPlayer) as well as an IP range restriction to campus (you live off campus? So sorry) because our lawyers were worried about what the RIAA's lawyers would find sufficient in the wake of a bunch of Napster-baited lawsuits launched at universities. The material itself was largely limited to snippets.
I wanted to say, "Calm down, have a martini or something. College students are just not going to go wild to download 128 kbps segments of old classical music," but alas I was not in charge.
I’m duplicating my comment elsewhere in this thread, so each serves as a direct reply to the different geolocation providers in this thread, in the hope that it will be recognized as a problem with data that implies that it’s more precise than it really is:
> On one hand, I love that there’s some good alternatives in the geolocation space, but misleading geolocation precision can lead to very undesirable side effects[0].