I had an idea for a file transfer system based on digits of pi. You'd give everybody DVDs of digits of pi (or they could calculate them themselves) , and then transfer files faster by just sending them the offset into pi.
At the time I thought it could work with a big enough bank of digits of PI on both sides. If transfer was expensive, and calculating digits was cheap then you could give everyone an infinite supply of digits of pi and have a nearly infinite compression system.
I discovered that often the offset into pi is much larger than the data you are sending. Turns out it's an expensive way to sent things.
Also, it turns out that this area was already well understood. There are no free lunches with entropy.
I think a lot of people had similar ideas, I know I did.
The idea of using Pi is essentially the same as having a shared dictionary as used in both regular compression and in lossy as waveforms (in layman's, terms - technically it's a best match) both ends have the dictionary and you simply index it.
In fact whole network protocols use this concept too such as protocolbuffers which are part of grpc.
The difference here being the dictionary is infinite and until we get quantum computers on the desktop indexing into pi will always be slow. Some of the address space may not be feasible too, e.g. your file may require a billion bit address/index.
It may still be feasible to do partial matches for a file, 50% one index, 50% another for performance improvement.
I guess the trade off is a balance between performance and number of indexes Vs the original file length, and where in Pi the address space becomes unfeasible.
zlib also supports dictionaries, as does the deflate algorithm. The problem is that zlib doesn't provide any tooling to generate an optimal dictionary from a large sample, whereas zstd does provide such a utility.
> I discovered that often the offset into pi is much larger than the data you are sending.
Shouldn't the offset be roughly the same size? If you look for a 1000 digit sequence, you need to try about 10^1000 starting points, which makes your offset about 1000 digits.
> Also, it turns out that this area was already well understood. There are no free lunches with entropy.
If I ever go mad and I breach the 'crankpot line' this will be it. There is something in entropy that I just can't fully grasp even though I understand (or I can convince myself that I do) Shannon, Kolmogorov etc.
I love the idea of this, it has certainly boggled my mind a handful of times when thinking of content-addressed storage. Obviously it can't work, there is no loophole to infinite space or compression. Entropy is a very real thing. But it did lead me down an interesting train of thought.
> maximise performance, we consider each individual byte of the file separately, and look it up in π.
Ok, so we store a byte and look it up in π. Now we get an offset. The exact offset will depend on the byte of course. But to simplify let's assume that π is "optimal". We will assume that the fist 256 offsets contain the first 256 bytes.
So our offset will be in the range 0-255. Storing our offset will then take 1 byte of storage.
Oh, I have found the problem.
So yes, you can find any data in π. But storing the location of that data will on average take the same amount of space as the data itself.
But I suspect that the analysis would look very similar for the entire file. Just calculate the average offset of the file of a given size. It won't be smaller than the file itself.
I guess that is mostly true. In theory you could think of a more efficient way to store the indexes such as run-length encoding. So you could store 1M of whatever the first byte of Pi is and then run-length encode the 1M zero addresses. You can also imagine a scheme such as 2 bit varint encoding the indexes or a tally system that only uses 1 bit per offset to store a zero.
...of course you are still better off to just compress the file consisting of a single repeated byte.
> So yes, you can find any data in π. But storing the location of that data will on average take the same amount of space as the data itself.
Not on average. Not even at minimum. Since the leading values of pi are definitely _not_ so compact (and we have yet to identify any contiguous segment that is), the absolute minimum is N + plus the overhead to support the optimization of sometimes only using single-byte offsets by declaring the offset size as a header.
I don't think this is true. If I coincidentally want to store the first million of digits of Pi then my offset is 0 which is certainly smaller than the data. So you can get lucky.
Not here, because this mechanism stores offsets to individual bytes not large blobs. Storing the first N bytes of Pi (it makes more sense to talk about bytes than digits) requires N offsets plus either a header indicating how big each offset needed to be or variable-length delimited offsets, which necessarily likewise increases the offset size. In the absolute smallest case that's N+1 bytes.
You can encode every book and information in a small rod. Just take a 1 meter long rod and encode your book into binary, eg 01011110111. Now, take your binary string, and transform it into a decimal number in base 2 by prepending 0. , E.g. x=0.01011110111. As this number is finite and smaller than 1, you can just take your rod and cut it at length x.
Now when you want to retrieve your information you just need to measure x from your rod with a ruler, take the decimal part and convert it into binary, and voila.
You can encode almost infinite amount of data into a simple rod. Assuming you can measure and cut very precisely
Small problem of how to make cuts at a subatomic level ;)
(edit for pedantism and type of atom) at an optimistic 100 billion carbon atoms in length for a metre stick, the maximum distinct cuts you can make are 100 billion, or 100 gigabytes of info. We do much better these days with thumb drives.
maybe you're measuring by counting atoms with a scanning electron microscope? 'course at that point why not just encode the information in the atom layer directly :)
That 's only if you can make multiple cuts (or slits), and it's actually 8x less because we're talking about bits, not bytes.
With only one cut as the parent comment describes, you can only store the log base 2 of 100 billion, which is 36, so about 4 bytes of info, or one long integer.
"That's right! Every file you've ever created, or anyone else has created or will create! Copyright infringement? It's just a few digits of π! They were always there!"
I didn't think that was actually mathematically proven yet. Was some proof accepted recently that makes that quoted sentence true?
There was an examination of this sort of thing in the webcomic Freefall --- shortest story is 6 words, 100,000 words for English vocabulary, order matters --- it worked out to a data storage unit the size of a small moon.
No one has proved that pi is normal, true. But pi is normal. I mean come on! Like just between us, two friends talking, we both know that pi is normal. It'll be a nightmare to prove but I'd put extremely good odds on it. Probably better odds than RSA being secure, which is also not strictly proven but pretty likely.
Look we know that pi is normal. You know that pi is normal, I know that pi is normal. I'll give you 10,000:1 odds that we find any string you want with exactly the probability you'd expect from pi being normal, and before long you'll have forked over your life savings. It's annoying to prove, we barely know if any numbers are normal, but the idea of a normal number was invented to discuss what numbers like pi or e seem to be like. In math that's not enough, but this is CS! All encryption is built on weaker grounds than pi being normal.
Okay, so I just have to find a sequence that hasn't yet been discovered to exist in pi? Or do I have to wait for a proof that pi isn't normal (if one exists) to get my money?
My sequence is a trillion consecutive 0s, followed by a trillion consecutive 1s, etc, all the way up through a trillion consecutive 9s. Who wins the bet?
> with exactly the probability you'd expect from pi being normal
In this case you win it when you check enough digits of pi that, if pi were normal, the expected probability of finding your string. Probably not gonna happen, but if you ever do hit me up! If you do that work I'll have the opportunity to win lots of smaller bets
edit: Wait that loses me catastrophic amounts of money. Keep checking until P<1/11,000, I think that's fair.
As a fairly simple example, consider the number wirtten in decimal notation as:
0.101001000100001000001... (with an increasing-by-one sequence of zeroes between each successive 1.) It doesn't contain all sequences of digits, or even all digits, yet it cannot be written as a fraction.
Yes. It can quite easily (depending on your knowledge of basic number theory) be proven that a number is rational if and only if its digits (in any base) eventually become periodic.
My understanding is that it is a trivial (or at least well-known) result that a rational will always have a finite, infinitely-repeating terminal sequence in any base.
This is a fallacy. Infinity does not contain all possibilities.
There are an infinite number of integers. You can start at 10 and count up forever, never running out of integers. But no matter how high you count, you’ll never count to “orange” - “orange” is not contained in the sequence of infinite integers.
You’ll need to first prove that every sequence of integers is contained somewhere in pi, since the number of possible integer sequences grows faster than the “space” for sequences in pi. In other words, I can always pick a digit that creates a valid, non repeating, integer sequence from the pool of possible sequences while never creating the integer sequence “123456789123456789123456789.” You’d need to prove that pi doesn’t do this.
Even if pi does contain every sequence of integers and you could map that to bytes which, in turn, maps to a file, this would not compress.
Your metadata directory would be larger than the raw files unless you get very lucky and your file is very early in the sequence of pi.
A byte can represent 256 unique values. 256 unique values can not compress to less than a byte. So if your index is a digit of pi where your file starts, your file starts after some other number of files. Your index is going to be the index inside of the address space of “all possible files.” This will get large very quickly.
> This is a fallacy. Infinity does not contain all possibilities.
From the link: One of the properties that π is conjectured to have is that it is normal, which is to say that its digits are all distributed evenly, with the implication that it is a disjunctive sequence, meaning that all possible finite sequences of digits will be present somewhere in it. If we consider π in base 16 (hexadecimal) , it is trivial to see that if this conjecture is true, then all possible finite files must exist within π. The first record of this observation dates back to 2001.
> Your metadata directory would be larger than the raw files unless you get very lucky and your file is very early in the sequence of pi.
This is very very clearly a tongue in cheek project and not intended as a practical way to store files
Being normal requires not only the correct frequencies of single digit substrings, but of k-digit substrings as well, for all k. One number that is normal to base 10 is Champernowne's number [1].
.(0123456789) is rational, though. I think the implication is that if pi is normal in addition to its other properties — specifically irrationality — that it has to contain all possible digit sequences.
Edit: oops, I forgot about n-length sequences of digits, .(0123456789) is definitely not normal, this is why I’m not a mathematician.
Pi may contain any sequence of digits but it's not proven that it contains all sequences of digits. IMHO, it's possible to construct a sequence of digits which will not be in Pi. It's easy to construct such sequence for first N digits of Pi of any length (for example, N zeroes).
Constructing things for N digit subsets can be misleadingly easy.
It's easy to construct a sequence of digits that is larger than any N digit number. But it's obviously impossible for any such number to be the largest number.
This reasoning is true of every real number, yet it's been proven that almost all real numbers are absolutely normal and therefore contains every finite sequence of digits
It's a slight mistatement of what a normal number is that changes it significantly. It isn't that each digit is evenly distributed, but that for every n, every string of digits n long is evenly distributed.
Ok but I don't see how that relates to the idea of storing sequences of numbers (the basic function of a filesystem) in Pi. Orange is not a number so it doesn't make sense to look for it in any sequence of numbers. "Orange" can however be represented as a sequence of ASCII characters, and if every sequence of numbers is contained in Pi, then an ascii representation of any string can also be found.
> This will get large very quickly.
Right, but here's a thought...
Suppose the place in Pi where your sequence is located is huge. You could maybe (probably) find a sequence earlier in Pi that is a pointer to your actual location :) So the metadata would need like 3 things to make this work. 1) length of your data, 2) checksum of your data and 3) pointer into Pi. If hash of (pointer+length) doesn't match checksum, it's a pointer (which may point to a different pointer etc).
That might make it compress a bit better. Although you may need huge amount of RAM just to dereference the pointers.
> Suppose the place in Pi where your sequence is located is huge. You could maybe (probably) find a sequence earlier in Pi that is a pointer to your actual location
If the sequence of bytes necessary to represent the file encodes to an integer index into pi, and that integer index encodes to a sequence of bytes larger than the original file, I have no reason to believe trying to repeat that process would result in anything other than an even larger integer index.
> But no matter how high you count, you’ll never count to “orange” - “orange” is not contained in the sequence of infinite integers.
Well, certainly not with that attitude!
Meanwhile, in my highly efficient, proprietary fruit string compression scheme, "orange" is represented by the integer 8. (Don't confuse it with "blood orange" though; that's 26.)
You can encode orange in RGB and HSL too, but the set of all integers still does not contain the concept of an orange. You’ve just assigned meaning to an integer.
In the same way you can’t count to orange, you can never start at 1 and count up to -1. There are an infinite number of different integers greater than 1, and that infinity does not contain -1.
SciFi likes abusing this. Just because there are an infinite number of universes doesn’t mean there is a universe that contains anything you can dream up Rick and Morty style.
> You can encode orange in RGB and HSL too, but the set of all integers still does not contain the concept of an orange.
Nobody claimed "the concept of an orange" can be found. The claim that the word "orange" can't be found, however, is provably false.
If we're going by this logic, then nothing but 0 and 1 can ever be stored on any medium, because a hard drive can't possibly store "the concept of an orange" either.
> Just because there are an infinite number of universes doesn’t mean there is a universe that contains anything you can dream up Rick and Morty style.
And you know this with certainty.... how? With enough branches in a truly infinite timeline, the likelihood of there being a combination which produces a specific outcome is quite high.
> And you know this with certainty.... how? With enough branches in a truly infinite timeline, the likelihood of there being a combination which produces a specific outcome is quite high.
It is not. It is quite low. The number of possible states grows at a faster rate than the rate necessary to maintain “infinity.” It’s only true if you assert that all possible states at a branch are added to the set, which is a tautology.
A fun puzzle. Let’s say you have an empty set whose members are sets. You add a set to it of infinite size (like the set of all positive integers). Next, you add another set to it whose size is also infinite but whose members are not contained in the first set. Now you repeat this process an infinite number of times.
You have an infinite set of infinite sets. The question is: is such a set possible and, if so, does your infinite set of infinite sets necessarily contain all possible infinite sets?
Most pitches for Everett branches (which are a specific enough concept that it's silly to get all technical, but whatever) hold that there is an amplitude distribution over the entire configuration space of the universe. While the proposed mechanism of decoherence implies a sort of discreteness, every single possible configuration would have non-zero amplitude. Of course there's no specific evidence for this over, say, collapse theories (or something stranger, like Bohmian mechanics), but there's certainly no philosophical issue.
Not sure what you think ZFC has to do with physics, but an empty set with members is a tad nonexistent.
Did you reply to the wrong person? I actually said just as much in my comment.
> If we're going by this logic, then nothing but 0 and 1 can ever be stored on any medium, because a hard drive can't possibly store "the concept of an orange" either.
What if I created a new system of encoding words, where the digit "1" represents the word "orange"? Does that disprove the claim, just as encoding it with bytes does?
> the set of all integers still does not contain the concept of an orange. You’ve just assigned meaning to an integer.
We all do, all day, every day. Isn’t that exactly what computer science and mathematics is?
> Just because there are an infinite number of universes doesn’t mean there is a universe that contains anything you can dream up Rick and Morty style.
Sure, there are many different kinds of infinities, but what does that have to do with pi and file systems?
What if a digit (say, 9) only appears in pi up to a certain decimal point? If your sequence contains that digit and doesn't appear by that point, pi won't contain your sequence.
If pi is indeed normal (whether proven or not), that would never happen. A normal number, by definition, contains an even distribution of digits. There would be no "last 9" because there would have to be more after to satisfy the distribution. Basically, the count of all nines must be the same as the nine other numbers, as the number of digits goes to infinity.
Even if the first million digits of some unnamed number were all nines (with no more afterward), it would not be normal because the distribution of nines, as the number of digits goes to infinity, would go down.
Come on the length of the index into Pi is anywhere between 0 and infinity. I don't think anyone feels defrauded. If anything this is a gambler's dream!
If π is a normal number, then it contains all possible sequences. However, it is currently unknown whether π is a normal number. I agree with you regarding the fact that this might not be a great compression algorithm, but it is certainly fun.
> Even if pi does contain every sequence of integers and you could map that to bytes which, in turn, maps to a file, this would not compress. [...] This will get large very quickly.
If I were to dumb this down (so I can understand it), is this a fair analogy to the adage "give 1 million monkeys 1 million typewriters and they'll eventually type the entire works of William Shakespeare"?
Somewhere in pi (at some insane offset) is the entire work of William Shakespeare.
I don't think an infinite amount of monkeys with an infinite amount of time are guaranteed to write a work of Shakespeare. If they acted completely randomly it would be, but they don't. They're monkeys. They follow certain behavioural patterns. Just because they have an infinite amount of time doesn't mean they're going to try everything there is to try.
It's a theorem. And yes, it requires the monkeys to be random. And yes, it was tested with monkeys and the results were mainly the letter 's' repeated.
Really all this proves is that if you had infinite Shakespeares and infinite time, one of them would write his complete works using only the letter 's'.
Technically 1 million monkeys with 1 million typewriters could type for a million years and yet still not produce a work of William Shakespeare. In theory (as I understand it), pi at some index contains all permutations of sets of values. I'm not a mathematician and only know this by hearsay.
> Technically 1 million monkeys with 1 million typewriters could type for a million years and yet still not produce a work of William Shakespeare
You have to calculate the probability of a monkey coming up with a work of Shakespeare, and then subtract that from 1 to get the probability of a monkey coming up with anything other than a work of Shakespeare, and then figure out what power N to raise that to (N attempts) for it to cross below some confidence value, like 0.05 ... then subtract that from 1 in order to get the probability of a monkey coming up with a work of Shakespeare after that many attempts.
(no matter how high N is, you will never get 0, so no matter how many times you attempt, you always "could" still not produce a work of Shakespeare, because there's always a chance that none of those attempts worked)
Then calculate how much time that might take (actually, you might even have to account for failed attempts that are shorter or longer than a work of Shakespeare, so you might even have to start with time rather than number of attempts, but I digress) and you have a possible figure for "how much time it would take to have a certain probability of a monkey having come up with a work of shakespeare" and oh dear I may have missed the joke
Wake me up when it's self-bootstrapped. This repo should contain the metadata for the source code of itself in pi. Or at the very least the metadata for "Everything can be stored in pie" in pi.
I personally thought this during my college days. Never pursued this idea but kept it my mind as a way to compress files and share just index to first meta data, which will then link to next and next and so on.
At the time I thought it could work with a big enough bank of digits of PI on both sides. If transfer was expensive, and calculating digits was cheap then you could give everyone an infinite supply of digits of pi and have a nearly infinite compression system.
I discovered that often the offset into pi is much larger than the data you are sending. Turns out it's an expensive way to sent things.
Also, it turns out that this area was already well understood. There are no free lunches with entropy.
But it was a fun idea to kick around.