Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Web scraping is usually pretty tedious, but I found that I could send the minimised HTML to GPT-3 and get (almost) perfect JSON back: the prompt includes the Typescript definition.

Could you share the prompt? Or, if OP can't share, does anyone have ideas for a prompt to do something like this?



+1, and the OP mentions a wrapper to handle invalid JSON from GPT-3. I’d be interested in that too

OP’s other write up:

https://interconnected.org/home/2023/02/07/braggoscope



The prompt is probably simple, but the bigger challenge is that even a minified html of a typical web page would be more than the 4k gpt token limit


I extract the main content div, which includes various other divs and assorted HTML cruft from 25 years of content management systems.

Then convert that to Markdown, which GPT groks happily, and it preserves the right balance of discarding meaningless structure but preserving some semantics (italics, headings, etc).

The best tool I've found for that process is aaronsw's html2text, amazing that it's still so valuable after all these years.


Thanks for explaining — very helpful!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: