Hacker Newsnew | past | comments | ask | show | jobs | submit | more unlog's commentslogin

Yes! You know, I was considering this the previous couple of days, was looking around on how to construct a `mhtml` file for serving all the files at the same time. Unrelated to this project, I had the use case of a client wanting to keep an offline version of one of my projects.

> Although UNIX philosophy posits that it's good to have many small files, I like your idea for its contribution to reduceing clutter (imagine running 'tree' in both scenarios) and also avoiding running out of inodes in some file systems (maybe less of a problem nowadays in general, not sure as I haven't generated millions of tiny files recently).

Pretty rare for any website to have many files, as they optimize to have as few files as possible(less network requests, which could be slower than just shipping a big file). I have crawled react docs as a test, and it's a zip file of 147mb with 3.803 files (including external resources).

https://docs.solidjs.com/ is 12mb (including external resources) with 646 files


trying to use this for mirroring a document site. disappointed at 1. it running quite slow, 2. it kept outputing error messages like "ProtocolError: Protocol error (Page.bringToFront): Not attached to an active page". not sure what reason


If the URL is public you may post it here or in a GitHub issue, so I can take a look to what's wrong with it.


not reproduce it, but 'wget -m --page-requisites --convert-links <url>' did a good job for me. never mind


Big fan of HTTrack! reminds me of the old days and makes me sad of the current state of the web.

I am not sure if HTTTrack progressed from fetching resources, long time since I used it for last time, but what my project does, is spin a real web-browser(chrome in headless mode which means it's hidden) and then it lets the JavaScript on that website execute, which means it will display/generate some fancy HTML that you can then save it as is into an index.html. It saves all kind of files, it doesn't care the extension or mime types of files, it tries to save them all.


> It saves all kind of files, it doesn't care the extension or mime types of files, it tries to save them all.

That’s awesome to know, I will give it a try. One website I remember I tried to download and has all sorts of animations with .riv extension and it didn’t work well with HTTrack, will try it with this soon, thanks for sharing it!


let me know how that goes I am interested!


Status codes, I am displaying the list because mostly on a JavaScript driven application you don't want other codes than 200 (besides media).

I thought about robots.txt but as this is a software that you are supposed to run against your own website I didn't consider it worthy. You have a point on speed requirements and prohibited resources (but is not like skipping over them will add any security).

I haven't put much time/effort into an update step. Currently, it resumes if the process exited via checkpoints(it saves current state every 250 URLs, if any is missing then it can continue, else it will be done)

Thanks, btw what's your project!? Share!


I agree with your points.

You might be interested in reddit webscraping thread https://www.reddit.com/r/webscraping/

My passion project is https://github.com/rumca-js/Django-link-archive

Currently I use only one thread for scraping, I do not require more. It gets the job done. Also I know too little to play more with python "celery" threads.

My project can be used for various things. Depends on needs. Recently I am playing with using it as a 'search engine'. I am scraping the Internet to find cool stuff. Results are in https://github.com/rumca-js/Internet-Places-Database. No all domains are interesting though.


> Status codes, I am displaying the list because mostly on a JavaScript driven application you don't want other codes than 200 (besides media).

What? Why? Regardless of the programming language used to generate content, the standard, well known HTTP status codes should be returned as expected . If your JS served site, gives me a 200 code when it should be a 404, you're wrong.


I think you are misunderstanding, your application is expected to give mostly 200s codes, if you get a 404, then a link is broken or a page misbehaving which is exactly why that page url is displayed on the console with a warning.


That's something I haven't explored, sounds interesting. Right now, the zip file contains a mirror of the files found on the website when loaded in a browser. I've ended with a zip file by luck, as mirroring to the file system gives predictable problems with file/folder names.



Sure, I forgot about that detail, what license do you suggest?


MIT and BSD seem to be by far the most common these days (I generally do MIT personally)


added


AGPL


That the page HTML is indexable by search engines without having to render in the server. Such unzipping to a directory served by nginx. You may also use it for archiving purposes, or for having backups.


I'm a big fan of modern JavaScript frameworks, but I don't fancy SSR, so have been experimenting with crawling myself for uploading to hosts without having to do SSR. This is the result


for a long crawling task, if exited/broken for any reason, does it save and restore at the next run?


The README says:

> Can resume if process exit, save checkpoint every 250 urls


nice, better make it as a commandline option with default value. 250 is too many for large files and slow connection.


I have ported multiple tab handler from piro to seamonkey back in the day, I miss xul so much, the browser used to be a very powerful tool


The first link is broken, can you please fix it, thanks!


Oh, thank you for pointing that out! I shortened the commit hash to 7 letters (as I often did) but that was not enough to disambiguate them, a new link should work.


A hidden gem of WinRar is that the internal file viewer can open multi gigabit text files faster than many editors.


And back in win98 and maybe xp days WinRar was a great way to escape file system access restriction imposed by the admin, you could use its internal browser to do access just about anything on the machine, even tough explorer and many other windows components wouldn't :)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: