It'd be so nice to have a super simple DHT crawler CLI tool, in both implementat...

the8472 · on Oct 5, 2023

These things need uptime of hours and days to do it properly and also to stay up to date. There are millions of nodes and torrents and to be non-abusive you have to issue requests at a somewhat sedate pace. And activity kind of moves with the sun due to people who run torrent clients on their home machines. And there are lots of buggy or malicious implementations out there that you have to deal with. So you'd want to run it as a daemon. The CLI would have to be a frontend to the daemon or its database. The UI could be simple. I'm skeptical whether an implementation could be both good and simple.

derefr · on Oct 5, 2023

That's if you're imagining a single node to discover the whole DHT. What if you want to fire off a map-reduce of limited-run DHT explorations starting from different DHT ring positions, where each agent just crawls and emits what it finds on stdout as it finds it?

(In a sense, I suppose this would still be a "daemon", but that daemon would be the map-reduce infrastructure.)

the8472 · on Oct 5, 2023

I don't quite understand what you're proposing here. Generally you only control and operate ~1 node per IPv4 address or per IPv6 /64.

All other nodes are operated by someone else, so they don't cooperate on anything beyond what the protocol specifies. Which means everyone is their own little silo. If you want a list of all currently active torrents (millions) then you have to do it with 1 or a handful of nodes, depending on how many IPs you have. DHTs are not arbitrary-distributed-compute frameworks, they're a quite restrictive get/put service.

BEP51[0] does let you query other nodes for a sample of their keys (infohashes) but what they can offer is limited by their vantage point of the network so you need to go around and ask all those millions of nodes. And since it's all random you can't really "search" for anything, you can only sample. And that just gives you 20-byte keys. Afterwards you need to do a lot of additional work to turn those into human-readable metadata.

[0] http://bittorrent.org/beps/bep_0051.html

derefr · on Oct 5, 2023

I mean, what I'm describing is the same thing that BEP51 mentions as a motivation:

> DHT indexing already is possible and done in practice by passively observing get_peers queries. But that approach is inefficient, favoring indexers with lots of unique IP addresses at their disposal. It also incentivizes bad behavior such as spoofing node IDs and attempting to pollute other nodes' routing tables.

If you have a lot of IP addresses (from e.g. AWS Lambda) then you can partition DHT keyspace across a large-N number of nodes and then very quickly discover everything in the keyspace.

The trick is that, since BEP51 exists, you don't need to have all these nodes register themselves into the hash-ring (at arbitrary spoofed positions) to listen. You can just have all these nodes independently probing the hash-ring "from the outside" — just making short-lived connections to registered nodes (without first registering themselves); handshaking that connection as a spoofed node ID; and then firing off one `sample_infohashes` request, getting a response, and disconnectting. The lack of registration shouldn't make any difference, as long as they don't want anyone to try connecting to them.

Which is why I say that these are just "crawler agents", not "nodes" per se. They don't start up P2P at all — to them, this is a one-shot client/server RPC conversation, like a regular web crawler making HTTP requests!

the8472 · on Oct 5, 2023

Oh, I already have implemented something[0] like that. It doesn't need lambdas or anything "cloud scale" like that. You "just" need a few dozen to a hundred IP addresses assigned to one machine and run a multi-homed DHT node on that to passively observe traffic from multiple points in the keyspace.

But neither of these approaches is what I'd call a "super simple DHT crawler CLI tool" that the initial comment was asking about. BEP51 is intended to make crawling simple enough that it can run on a single home internet connection, but a proper implementation still isn't trivial.

[0] https://github.com/the8472/mldht

lost_tourist · on Oct 6, 2023

nah, that's tmux is for. I've had sessions running for a month+