More

jot · on April 14, 2024

Our tool sadly also fails on this: https://url2text.com/u/KYkpBj

The challenge there is that the content is in an iframe.

If you get the URL used for the iframe you can get the content: https://url2text.com/u/kJWaZY

But that's frustrating as it requires two steps.

We might be able to help you get the content from URLs like these in one step. We have quite a bit of power in the Urlbox API that url2text isn't using.

Drop us an email: support@urlbox.com and we'll see what we can do.

jot · on April 14, 2024

Last time I tried readability it worked well with articles but struggled with other kinds of pages. Took away far more content than I wanted it to.

jot · on April 14, 2024

It's not easy working around things like that. But here's how it could work: https://url2text.com/u/wYVake

We were lucky to build this on a mature API that already solves loads of the edge cases around rendering different kinds of pages.

jot · on April 14, 2024

Great idea to offer image downloads and filtering with GPT!

I built a similar tool last year that doesn't have those features: https://url2text.com/

Apologies if the UI is slow - you can see some example output on the homepage.

The API it's built on is Urlbox's website screenshot API which performs far better when used directly. You can request markdown along with JS rendered HTML, metadata and screenshot all in one go: https://urlbox.com/extracting-text

You can even have it all saved directly to your S3-compatible storage: https://urlbox.com/s3

And/or delivered by webhook: https://urlbox.com/webhooks

I've been running over 1 million renders per month using Urlbox's markdown feature for a side project. It's so much better using markdown like this for embeddings and in prompts.

If you want to scrape whole websites like this you might also want to checkout this new tool by dctanner: https://usescraper.com/

dctanner · on April 16, 2024

Founder of https://usescraper.com here. Thanks for the mention jot. We've also got a single URL scraping option https://docs.usescraper.com/api-reference/scraper/scrape now which is $0.001 per page, and uses a headless chrome browser. Results are snappy and you only pay for what you use.

jph00 · on April 14, 2024

Looks nice, but url2text doesn't seem to have an API, and urlbox doesn't seem to have an option to skip the screenshot if you only want the text. And for just the text, it looks to be really expensive.

jot · on April 14, 2024

Thanks!

Sorry it's not clearer but you can skip the screenshot in the Urlbox API if you want to with:

  curl -X POST \
    https://api.urlbox.io/v1/render/sync \
    -H 'Authorization: Bearer YOUR_URLBOX_SECRET' \
    -H 'Content-Type: application/json' \
    -d '
  {
    "url": "example.com",
    "format": "md"
  }
  '

Here's the result of that: https://renders.urlbox.io/urlbox1/renders/5799274d37a8b4e604...

Sorry the pricing isn't a good fit for you. Urlbox has been running for over 11 years. We're bootstrapped and profitable with a team of 3 (plus a few contractors). We're priced to be sustainable so our customers can depend on us in the long term. We automatically give volume discounts as your usage grows.

jot · on March 11, 2024

It sounds like this is as advanced as DocRaptor[1]. They have what I consider to be the best PDF generation API, giving complete control over the documents you need to create. The pricing is similar.

If you'd rather do it for free weasyprint[2] is the best open source alternative.

Another more affordable option you might want to consider is Urlbox[3]. (Disclosure: I work on this)

Urlbox's rendering engine is based on Chrome. It's been refined over the last 11 years to render pages as images or PDFs[4] that look great. I was a customer for 5 years before I joined the team. Everything we'd tried before Urlbox was a disappointment.

Urlbox probably can't match the power of either Onedoc or DocRaptor, but pricing starts at less than $0.01 per document and drops significantly with scale. If your PDF looks great when saving as PDF in Chrome it should look identically brilliant with Urlbox.

[1]: https://docraptor.com [2]: https://weasyprint.org [3]: https://urlbox.com [4]: https://urlbox.com/html-to-pdf

jot · on Feb 20, 2024

This is how I do it.

I send the URLs I want scraped to Urlbox[0] it renders the pages saves HTML (and screenshot and metadata) to my S3 bucket[1]. I get a webhook[2] when it's ready for me to process.

I prefer to use Ruby so Nokogiri[3] is the tool I use for scraping step.

This has been particularly useful when I've want to scrape some pages live from a web app and don't want to manage running Puppeteer or Playwright in production.

Disclosure: I work on Urlbox now but I also did this in the five years I was a customer before joining the team.

[0]: https://urlbox.com [1]: https://urlbox.com/s3 [2]: https://urlbox.com/webhooks [3]: https://nokogiri.org

nkko · on Feb 21, 2024

Does it save the whole page or just the viewport? Just checked the landing page it looks targeted to a specific case of saving “screenshots” and this is also obvious from limitations in the pricing page so it would be unfeasible for larger projects?

jot · on Feb 21, 2024

Urlbox will save the whole page.

It's primarily purpose is to render screenshots full-page or limited to viewport or an element. To do that well as it does the HTML has to be rendered perfectly first.

It's not as cheap as other solutions but we have customers who render millions of pages per month with us. They value the accuracy and reliability that's come from over a decade of refinements to the service.

Larger projects can request preferential pricing based on the specifics of the kinds of pages they are rendering.

jot · on Feb 9, 2024

I recommend Jonathan Stark's writing on this: https://jonathanstark.com/

His daily email list gives me regular reminders of how to improve in sales and pricing.

His books are brilliant too. Start with Hourly Billing is Nuts: https://jonathanstark.com/hbin

jeffbrl75 · on Feb 16, 2024

Couldn't agree more. Also look into Alan Weiss's books. Avoiding the hourly billing trap via fixed fee pricing (preferably with value-based pricing) will the best decision you make both for your quality of life and wallet.

jot · on Feb 1, 2024

Urlbox helps web developers render the web with precision. We've been focused on generating screenshots, images and PDFs from HTML or URLs for over a decade. Our customers include over 500 design or compliance led organisations. They depend on us to get the intricacies of browser rendering right so they can focus on their core products and services.

We're bootstrapped, profitable and ready to add a third full-time engineer to our team. Our stack is primarily TypeScript. It's a bonus if you're also interested in learning how to orchestrate and scale headless browsers on our Kubernetes clusters. There's also opportunities to create/maintain libraries and SDK's in a range of other languages.

We're excited to hear from people early in their tech career as well as more experienced folk.

Read more: https://urlbox.com/jobs/typescript-developer

jot · on Jan 28, 2024

The one nearest me has a fantastic museum [0] inside.

A large part of it is dedicated to old tech donated by locals over the years. Highly recommended if you’re in the area.

[0]: https://seafordmuseum.co.uk/

icosian · on Jan 28, 2024

A Martello tower in Howth, Dublin, has also become a technology museum, known as the Hurdy Gurdy Museum of Vintage Radio. It was the passion project of one individual, Pat Herbert, who sadly passed away in 2020.

https://sites.google.com/site/hurdygurdymuseum/home

mttch · on Jan 28, 2024

I grew up around there. Definitely second the museum recommendation, highly eclectic.

jot · on Jan 18, 2024

Not 3D but my 7 year old and I have been having loads of fun with DragonRuby[0]

He also wanted 3D but once we added some great looking dinosaur sprites (generated with DALL E) he was fully engaged. I'm a ruby developer and it's been a joy learning the differences between web and game dev.

Knowing that we can easily distribute on mobile platforms, web, Steam and Switch once we're ready has kept us coming back.

[0]: https://dragonruby.org/