15 years ago when I built a deduplication file storage system, rolling hash was on the table during design but there were some patents on it. Ended up using fixed size chunking which working less well but still gave incredible storage saving.
Hah! I also built a similar storage system, optimized for whole disk images, for work, around 2007!! I used fixed size chunks as well. I called it "data block coalescion", having never heard of anyone else doing so we figured I invented it and we were granted the patent(!). I used it to cram disk images for I think 6 different fresh install configurations onto a single DVD. :D
Later on I used it and vmware to build a proof of concept for a future model of computing where data was never lost. (it would snapshot the vm and add it to the storage system daily or hourly. look ma, infinite undo at a system level!)
The next version of the algorithm was going to use what I now know is essentially rolling hash (I called it "self delimiting data blocks"), but then the company went under.
> but there were some patents on it.
The patent system is quite silly and internally inconsistent. I'm older now, and suspect someone thought of saving disk space through coalescion of identical blocks before I did in 06/07 but not according to the USPTO!
EMC had a disk based deduplication storage at the time. NetAppliance had a competing product. They had patents in the area. I believed that’s in the early 2000’s. One of the household name big techs had an internal product with similar design. ZFS has similar design.
Mine was at the block device level. The advantage is you can format it to whatever file system of your choice, with read/write support and deduplication just works.
Same! :) Originally I wrote it with an interface kinda similar to `tar` -- you add or extract huge blobs to/from what I called a coalesced archive. I could re-image a machine about 8x faster than Norton Ghost.
After $WORK went under, I kept the code and toyed around with it, making it speak NBD so instead of extracting a huge blob from the archive to a destination block device you could also access it directly. I feel like I never Properly solved write support though.
I'm curious, did you think of anything better than refcounting the data blocks and then keeping a list when the count goes to zero, then adding the next unique block to the zero list? That's all I could think of, and it adds at _least_ one additional layer of indirection which I didn't like bc it would have a performance impact.
> EMC had a disk based deduplication storage at the time. NetAppliance had a competing product. They had patents in the area.
I know this _NOW_ but certainly didn't know back then. :) And still doesn't take away the fact that according to the USPTO, "compression via coalescion" is miiiiine. ;-)
Again, I interpret this NOT as evidence of "how clever I am", but as evidence of how silly and broken the patent system is.
Yes. Depending on how the claim languages are phrased, patents on the same idea can be approved.
For reclaiming deleted blocks, I just had a garbage collection phase to run from time to time. Like you've mentioned on refcount, I've considered it but it amplified writes 2X~3X and worse they were random access writes. Garbage collection was not so bad since it's only going through the virtual file control blocks containing the content-address-hash.
The storage layout was: file block -> virtual file control block -> dedup-data-block.
The virtual file control block contained the dedup block hash entries where one control block hosted N file blocks. GC only needed to scan the control blocks to find out which dedup-data-blocks were in use.
Freed dedup-data-blocks remained in place and were linked to the free list; the first couple bytes of the free block were cooped to store the pointer to the next free block.
At the end, brand new file write performance degraded about 10% compared to normal file write, which I considered acceptable. M block writes -> M dedup block writes + M/N control block writes + M/K db index updates, where N was the number of hash entries hosted in the control block and K is the number of hashes stored in one db index page. Repeated file writes were much faster due to deduplication.