I use the SingleFile extension to archive every page I visit. It's easy to set u...

internetter · 2025-05-28T21:34:28 1748468068

storage is cheap, but if you wanted to improve this:

1. find a way to dedup media

2. ensure content blockers are doing well

3. for news articles, put it through readability and store the markdown instead. if you wanted to be really fancy, instead you could attempt to programatically create a "template" of sites you've visited with multiple endpoints so the style is retained but you're not storing the content. alternatively a good compression algo could do this, if you had your directory like /home/andrew/archive/boehs.org.tar.gz and inside of the tar all the boehs.org pages you visited are saved

4. add fts and embeddings over the pages

ashirviskas · 2025-05-28T22:34:08 1748471648

1 and partly 3 - I use btrfs with compression and deduping for games and other stuff. Works really well and is "invisible" to you.

bombela · 2025-05-29T03:29:46 1748489386

dedup on btrfs requires to setup a cronjob. And you need to pick one of the dedup too. It's not completely invisible in my mind bwcause of this ;)

windward · 2025-05-29T08:38:27 1748507907

>storage is cheap

It is. 1.1TB is both:

- objectively an incredibly huge amount of information

- something that can be stored for the cost of less than a day of this industry's work

Half my reluctance to store big files is just an irrational fear of the effort of managing it.

IanCal · 2025-05-29T09:28:34 1748510914

> - something that can be stored for the cost of less than a day of this industry's work

Far, far less even. You can grab a 1TB external SSD from a good name for less than a days work at minimum wage in the UK.

I keep getting surprised at just how cheap large storage is every time I need to update stuff.

davidcollantes · 2025-05-28T19:54:46 1748462086

How do you manage those? Do you have a way to search them, or a specific way to catalogue them, which will make it easy to find exactly what you need from them?

nirav72 · 2025-05-29T02:36:57 1748486217

KaraKeep is a decent self hostable app that has support for receiving singlefile pages via singlefile browser extension and pointing to karakeep API. This allows me to search for archived pages. (Plus auto summarization and tagging via LLM).

dotancohen · 2025-05-29T05:45:05 1748497505

Very naive question, surely. What does KaraKeep provide that grep doesn't?

nirav72 · 2025-05-29T22:26:36 1748557596

jokes aside. It has a mobile app

dotancohen · 2025-05-30T04:02:26 1748577746

I don't get it aside. How does that help him search files on his local file system? Or is he syncing an index of his entire web history to his mobile device?

nirav72 · 2025-05-30T22:00:39 1748642439

GP is using SingleFile browser extension. Which allows him to download the entire page as a single .html file. But SingleFile also allows sending that page to Karakeep directly instead of downloading it to his local file system. (if he's hosting karakeep on a NAS on his network). He can then use the mobile app or Karakeep web UI to search and view that archived page. Karakeep does the indexing. (Including auto-tagging via LLM)

dotancohen · 2025-05-31T19:44:56 1748720696

I see now, thank you.

snthpy · 2025-05-29T05:20:19 1748496019

Thanks. I didn't know about this and it looks great.

A couple of questions:

- do you store them compressed or plain?

- what about private info like bank accounts or health issuance?

I guess for privacy one could train oneself to use private browsing mode.

Regarding compression, for thousands of files don't all those self-extraction headers add up? Wouldn't there be space savings by having a global compression dictionary and only storing the encoded data?

d4mi3n · 2025-05-29T14:03:01 1748527381

> do you store them compressed or plain?

Can’t speak to your other issues but I would think the right file system will save you here. Hopefully someone with more insight can provide color here, but my understanding is that file systems like ZFS were specifically built for use cases like this where you have a large set of data you want to store in a space efficient manner. Rather than a compression dictionary, I believe tech like ZFS simply looks at bytes on disk and compresses those.

genewitch · 2025-05-29T07:18:02 1748503082

By default, singlefile only saves when you tell it to, so there's no worry about leaking personal information.

I haven't put the effort in to make a "bookmark server" that will accomplish what singlefile does but on the internet because of how well singlefile works.

shwouchk · 2025-05-28T23:34:22 1748475262

i was considering a similar setup, but i don’t really trust extensions. Im curious;

- Do you also archive logged in pages, infinite scrollers, banking sites, fb etc? - How many entries is that? - How often do you go back to the archive? is stuff easy to find? - do you have any organization or additional process (eg bookmarks)?

did you try integrating it with llms/rag etc yet?

eddd-ddde · 2025-05-29T13:12:25 1748524345

You can just fork it, audit the code, add your own changes, and self host / publish.

shwouchk · 2025-06-03T07:30:55 1748935855

yes, you right. im not helpless and all the new ai tools make this even easier.

nyarlathotep_ · 2025-05-29T14:13:59 1748528039

Are you automating this in some fashion? Is there another extension you've authored or similar to invoke SingleFile functionality on a new page load or similar?

dataflow · 2025-05-29T14:39:57 1748529597

Have you tried MHTML?

RiverCrochet · 2025-05-29T15:28:56 1748532536

SingleFile is way more convenient as it saves to a standard HTML file. The only thing I know that easily reads MHTML/.mht files is Internet Explorer.

dataflow · 2025-05-29T15:31:55 1748532715

Chrome and Edge read them just fine? The format is actually the same as .eml AFAIK.

RiverCrochet · 2025-05-29T18:31:03 1748543463

I remember having issues but it could be because the .mht's I had were so old I think I used Internet Explorer's Save As... function to generate them.

dataflow · 2025-05-29T21:28:28 1748554108

I've had such issues with them in the past too, yeah. I never figured out the root cause. But in recent times I haven't had issues, for whatever that's worth. (I also haven't really tried to open many of the old files either.)

90s_dev · 2025-05-28T20:14:23 1748463263

You must have several TB of the internet on disk by now...