Converting websites to markdown comes with 3 distinct problems: 1. Throughly scr...

scary-size · on April 15, 2024

Thoroughly scraping is challenging, especially in an environment where you don’t have (or want) a JavaScript runtime.

For content extraction, I found the approach the Postlight library takes quite neat. It scores individual html nodes based on some heuristics (text length, link density, css classes). It the selects the nodes with the highest score. [1] I ported it to Swift for a personal read later app.

[1] https://github.com/postlight/parser

Kikobeats · on April 20, 2024

For getting the HTML, you can use microlink, just passing the URL to https://html.microlink.io/{url}, like https://html.microlink.io/https://example.com

justech · on April 15, 2024

This is pretty cool. Care to share your Swift port?

scary-size · on April 15, 2024

Not planning to. It’s my first Swift/iOS project. I neither want to polish it nor maintain it publicly. Happy to share it privately, email is in the bio. I’m planning on a blog post describing the general approach though!

rismay · on April 15, 2024

Care to share the Swift port?

msp26 · on April 15, 2024

Thanks for the links I had no idea those existed.

For my article web scraper (wip) the current steps are:

- Navigate with playwright + adblocker

- Run mozilla's readability on the page

- LLM checks readability output

If check failed

- Trim whole page HTML context

- Convert to markdown with pandoc

- LLM extracts from markdown

privatenumber · on April 15, 2024

Mozilla has released Readability as a standalone package so you can avoid spinning up a browser entirely: https://github.com/mozilla/readability

msp26 · on April 15, 2024

I still wanted the browser for UBlock Origin and handling sites with heavy JS. I was using the standalone Readability script already but today I ended up dropping it for Trafilatura. It works a lot better.

The inefficiency of using a browser rather than just taking the html doesn't really matter because the limiting factor is the LLM here.

And yes the LLM is essential for getting clean data. None of the existing methods are flexible enough for all cases even if people say "you don't need AI to do this".

asadalt · on April 15, 2024

you would still need to run. For js based websites.

cchance · on April 15, 2024

Those are pretty damn cool i didnt know those even existed.