Thanks, I definitely might once I tidy things up... though, there isn't much to share. The model is pre-trained on the CNN/Dailymail dataset and hosted on AWS SageMaker (I used the BART tutorial as a reference). Fine-tuned on my own article summaries using the transformers library. The backend is Django/Bootstrap. Really, nothing special except for the fine-tune dataset which took ages to clean.
Big version bump for Extractor API - just added the ability to extract boilerplate-free text and translate it to and from 50+ languages, in a single GET request. https://extractorapi.com/features/
Hey HN - coming at this solo, my first SaaS, after years of being a freelance copywriter and content manager.
A few years back I discovered Python, and very quickly after NLP. To a writer with a love for sci-fi/tech, I was enamored, and spent ungodly hours on my employer's GCC-hosted Jupyter Notebooks, coming up with all sorts of impractical experiments with Spacy, Facebook's Starspace, Gensim, and the like.
For one, I needed a lot of training data. I'd go crawl thousands of pages of text from news sites, using Scrapy and storing data directly on the server. For text extraction and boilerplate removal, I used newspaper3k, and eventually a custom extractor that used a random forest model to select proper element "candidates".
I wanted a simpler way to aggregate text for a dataset, query it, create subsets based on keywords, and so on. The paid options out there - Diffbot, Aylien, Ujeebu, Scrapinghub's news API, etc. weren't exactly what I was looking for.
After learning the minimum amount of JS required, I built a shitty local app where you could paste a bunch of URLs and get back a JSON with the extracted text. I posted it up here, on HN, and there were a few hundred visits, absolutely demolishing the $5 DO instance. I figured others might want something like this.
So I built extractorapi.com - a text extraction API and UI that revolves around the idea of "Jobs". For example, let's say you gathered a list of URLs from the NY Times, or The Economist, or Bloomberg. You then provide that list of URLs to a job called "my_articles". For example:
r = requests.post(endpoint, headers=headers, data=data)
This job will then process your input URLs server-side, and once complete, you can query the extracted text or title within the job. All jobs and extracted text are saved on your account - you can use the API or the web app to explore the jobs you started programmatically, download them in .csv or .json formats, and check their progress.
I get that it's hard to market to devs like myself (why buy when you can build?), so I'm looking for any feedback/criticism/suggestions on your experience with Extractor API. As Faulkner points out, you must kill your darlings - let me know if all this shit doesn't make any sense, or if it's actually helpful. Or somewhere in-between.
A collection of over 1,000 AMA questions and answers on COVID-19 from various experts, professionals, and journalists. For GitHub repo with all the questions and answers: https://github.com/aleksandr-smechov/covid-19-ama-db
Author of Wordcab-Transcribe here. We use faster-whisper + NeMo for diarization, if you want to take a look.