We might be able to help you get the content from URLs like these in one step. We have quite a bit of power in the Urlbox API that url2text isn't using.
Drop us an email: support@urlbox.com and we'll see what we can do.
Great idea to offer image downloads and filtering with GPT!
I built a similar tool last year that doesn't have those features:
https://url2text.com/
Apologies if the UI is slow - you can see some example output on the homepage.
The API it's built on is Urlbox's website screenshot API which performs far better when used directly. You can request markdown along with JS rendered HTML, metadata and screenshot all in one go:
https://urlbox.com/extracting-text
You can even have it all saved directly to your S3-compatible storage:
https://urlbox.com/s3
I've been running over 1 million renders per month using Urlbox's markdown feature for a side project. It's so much better using markdown like this for embeddings and in prompts.
If you want to scrape whole websites like this you might also want to checkout this new tool by dctanner:
https://usescraper.com/
Looks nice, but url2text doesn't seem to have an API, and urlbox doesn't seem to have an option to skip the screenshot if you only want the text. And for just the text, it looks to be really expensive.
Sorry the pricing isn't a good fit for you. Urlbox has been running for over 11 years. We're bootstrapped and profitable with a team of 3 (plus a few contractors). We're priced to be sustainable so our customers can depend on us in the long term. We automatically give volume discounts as your usage grows.
It sounds like this is as advanced as DocRaptor[1]. They have what I consider to be the best PDF generation API, giving complete control over the documents you need to create. The pricing is similar.
If you'd rather do it for free weasyprint[2] is the best open source alternative.
Another more affordable option you might want to consider is Urlbox[3]. (Disclosure: I work on this)
Urlbox's rendering engine is based on Chrome. It's been refined over the last 11 years to render pages as images or PDFs[4] that look great. I was a customer for 5 years before I joined the team. Everything we'd tried before Urlbox was a disappointment.
Urlbox probably can't match the power of either Onedoc or DocRaptor, but pricing starts at less than $0.01 per document and drops significantly with scale. If your PDF looks great when saving as PDF in Chrome it should look identically brilliant with Urlbox.
I send the URLs I want scraped to Urlbox[0] it renders the pages saves HTML (and screenshot and metadata) to my S3 bucket[1]. I get a webhook[2] when it's ready for me to process.
I prefer to use Ruby so Nokogiri[3] is the tool I use for scraping step.
This has been particularly useful when I've want to scrape some pages live from a web app and don't want to manage running Puppeteer or Playwright in production.
Disclosure: I work on Urlbox now but I also did this in the five years I was a customer before joining the team.
Does it save the whole page or just the viewport? Just checked the landing page it looks targeted to a specific case of saving “screenshots” and this is also obvious from limitations in the pricing page so it would be unfeasible for larger projects?
It's primarily purpose is to render screenshots full-page or limited to viewport or an element. To do that well as it does the HTML has to be rendered perfectly first.
It's not as cheap as other solutions but we have customers who render millions of pages per month with us. They value the accuracy and reliability that's come from over a decade of refinements to the service.
Larger projects can request preferential pricing based on the specifics of the kinds of pages they are rendering.
Couldn't agree more. Also look into Alan Weiss's books. Avoiding the hourly billing trap via fixed fee pricing (preferably with value-based pricing) will the best decision you make both for your quality of life and wallet.
Urlbox helps web developers render the web with precision. We've been focused on generating screenshots, images and PDFs from HTML or URLs for over a decade. Our customers include over 500 design or compliance led organisations. They depend on us to get the intricacies of browser rendering right so they can focus on their core products and services.
We're bootstrapped, profitable and ready to add a third full-time engineer to our team. Our stack is primarily TypeScript. It's a bonus if you're also interested in learning how to orchestrate and scale headless browsers on our Kubernetes clusters. There's also opportunities to create/maintain libraries and SDK's in a range of other languages.
We're excited to hear from people early in their tech career as well as more experienced folk.
A Martello tower in Howth, Dublin, has also become a technology museum, known as the Hurdy Gurdy Museum of Vintage Radio. It was the passion project of one individual, Pat Herbert, who sadly passed away in 2020.
Not 3D but my 7 year old and I have been having loads of fun with DragonRuby[0]
He also wanted 3D but once we added some great looking dinosaur sprites (generated with DALL E) he was fully engaged. I'm a ruby developer and it's been a joy learning the differences between web and game dev.
Knowing that we can easily distribute on mobile platforms, web, Steam and Switch once we're ready has kept us coming back.
The challenge there is that the content is in an iframe.
If you get the URL used for the iframe you can get the content: https://url2text.com/u/kJWaZY
But that's frustrating as it requires two steps.
We might be able to help you get the content from URLs like these in one step. We have quite a bit of power in the Urlbox API that url2text isn't using.
Drop us an email: support@urlbox.com and we'll see what we can do.