Hacker News new | past | comments | ask | show | jobs | submit login

I have not used this specific library but its far from unrealistic and hardly a money pit. A LLM can fit in nicely with scraping libraries. Sure if you are crawling the web like google, it makes no sense, but if you have a hit list, this can be a cost effective way to not have engineering hours spent maintaining the crawler.



Which LLM do you use? Because I can't see an scraper running daily without being very expensive.


Llama-3 70B on my local MacBook works wonderfully for these tasks.


How's the Pipeline? Do you pass all the html to the LLM? Isn't the context window a problem?


There are phenomenal web scraping tools to crudely "preprocess" the document a bit, slashing outer HTML fluff while preserving the small subset of actual data. From there, 8k tokens (or whatever) goes really far.


At a very generous 50 tokens per second doesn't that still leave you with more than two and a half minutes (160s) processing time per document?


GPT-3.5/GPT-4 ain't the only LLMs available. A Flan-T5/T5 or Llama2/3 8B models may be finetuning for this use case and used for much cheaper.


How do you handle the context window limit? If you push the entire Dom to the LLM it will exceed the context window by far in most cases, isn't it?


My guess is you do some preprocessing on the DOM to get it down to text but still retains some structure.

Something like https://github.com/Alir3z4/html2text.

I'm sure there are other (better?) options as well.


I wrote https://markdown.download as a general helper for this


Trim unwanted html elements + convert to markdown. Significantly reduces token counts while retaining structure.


Again, depends on the volume of the scraping and the value of the data within it. Even 3.5 can be cost effective for certain workflows and data value.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: