I have not used this specific library but its far from unrealistic and hardly a ...

spaniard89277 · 2024-05-07T22:47:38 1715122058

Which LLM do you use? Because I can't see an scraper running daily without being very expensive.

a_wild_dandan · 2024-05-07T23:00:40 1715122840

Llama-3 70B on my local MacBook works wonderfully for these tasks.

spaniard89277 · 2024-05-07T23:45:34 1715125534

How's the Pipeline? Do you pass all the html to the LLM? Isn't the context window a problem?

a_wild_dandan · 2024-05-08T01:23:20 1715131400

There are phenomenal web scraping tools to crudely "preprocess" the document a bit, slashing outer HTML fluff while preserving the small subset of actual data. From there, 8k tokens (or whatever) goes really far.

LunaSea · 2024-05-08T09:26:50 1715160410

At a very generous 50 tokens per second doesn't that still leave you with more than two and a half minutes (160s) processing time per document?

mrbungie · 2024-05-07T22:56:44 1715122604

GPT-3.5/GPT-4 ain't the only LLMs available. A Flan-T5/T5 or Llama2/3 8B models may be finetuning for this use case and used for much cheaper.

spaniard89277 · 2024-05-07T23:44:19 1715125459

How do you handle the context window limit? If you push the entire Dom to the LLM it will exceed the context window by far in most cases, isn't it?

aleksiy123 · 2024-05-08T00:00:19 1715126419

My guess is you do some preprocessing on the DOM to get it down to text but still retains some structure.

Something like https://github.com/Alir3z4/html2text.

I'm sure there are other (better?) options as well.

tarasglek · 2024-05-08T07:16:02 1715152562

I wrote https://markdown.download as a general helper for this

msp26 · 2024-05-08T00:07:52 1715126872

Trim unwanted html elements + convert to markdown. Significantly reduces token counts while retaining structure.

infecto · 2024-05-07T23:45:42 1715125542

Again, depends on the volume of the scraping and the value of the data within it. Even 3.5 can be cost effective for certain workflows and data value.