> Web scraping is usually pretty tedious, but I found that I could send the minimised HTML to GPT-3 and get (almost) perfect JSON back: the prompt includes the Typescript definition.
Could you share the prompt? Or, if OP can't share, does anyone have ideas for a prompt to do something like this?
I extract the main content div, which includes various other divs and assorted HTML cruft from 25 years of content management systems.
Then convert that to Markdown, which GPT groks happily, and it preserves the right balance of discarding meaningless structure but preserving some semantics (italics, headings, etc).
The best tool I've found for that process is aaronsw's html2text, amazing that it's still so valuable after all these years.
Could you share the prompt? Or, if OP can't share, does anyone have ideas for a prompt to do something like this?