More

copypirate · on Nov 14, 2023

https://github.com/Wordcab/wordcab-transcribe

Author of Wordcab-Transcribe here. We use faster-whisper + NeMo for diarization, if you want to take a look.

copypirate · on April 3, 2023

Transcription is a commodity at this point - great to see libraries like this make it simple to deploy on a K8s cluster

copypirate · on Dec 19, 2020

Just use test@gmail.com or something similar, there is no confirmation necessary.

copypirate · on Dec 19, 2020

An updated version of Skriber - fine-tuned on a dataset of my article summaries.

copypirate · on Nov 22, 2020

Thanks, I definitely might once I tidy things up... though, there isn't much to share. The model is pre-trained on the CNN/Dailymail dataset and hosted on AWS SageMaker (I used the BART tutorial as a reference). Fine-tuned on my own article summaries using the transformers library. The backend is Django/Bootstrap. Really, nothing special except for the fine-tune dataset which took ages to clean.

taf2 · on Nov 22, 2020

It works well which is special. Open sourcing it good whether it’s clean or not

copypirate · on Nov 21, 2020

Summarizer is at skriber.io

Powered by BART and trained on my own dataset of a few thousand summaries. Let me know how it works for you :)

login: test

password: test123

bluetwo · on Nov 21, 2020

Tried copy/paste but still got: Please enter a correct username and password. Note that both fields may be case-sensitive.

copypirate · on Nov 21, 2020

There's an asterisk at the end of 123

copypirate · on Nov 21, 2020

Oops

login: test pass: test123

copypirate · on Oct 29, 2020

StoryChief but for niche sites like HN, IH, and reddit

copypirate · on Sept 6, 2020

Big version bump for Extractor API - just added the ability to extract boilerplate-free text and translate it to and from 50+ languages, in a single GET request. https://extractorapi.com/features/

copypirate · on May 8, 2020

Hey HN - coming at this solo, my first SaaS, after years of being a freelance copywriter and content manager.

A few years back I discovered Python, and very quickly after NLP. To a writer with a love for sci-fi/tech, I was enamored, and spent ungodly hours on my employer's GCC-hosted Jupyter Notebooks, coming up with all sorts of impractical experiments with Spacy, Facebook's Starspace, Gensim, and the like.

For one, I needed a lot of training data. I'd go crawl thousands of pages of text from news sites, using Scrapy and storing data directly on the server. For text extraction and boilerplate removal, I used newspaper3k, and eventually a custom extractor that used a random forest model to select proper element "candidates".

I wanted a simpler way to aggregate text for a dataset, query it, create subsets based on keywords, and so on. The paid options out there - Diffbot, Aylien, Ujeebu, Scrapinghub's news API, etc. weren't exactly what I was looking for.

After learning the minimum amount of JS required, I built a shitty local app where you could paste a bunch of URLs and get back a JSON with the extracted text. I posted it up here, on HN, and there were a few hundred visits, absolutely demolishing the $5 DO instance. I figured others might want something like this.

So I built extractorapi.com - a text extraction API and UI that revolves around the idea of "Jobs". For example, let's say you gathered a list of URLs from the NY Times, or The Economist, or Bloomberg. You then provide that list of URLs to a job called "my_articles". For example:

api_key = "YOUR_API_KEY" endpoint = "https://extractorapi.com/api/v1/jobs"

headers = { "Authorization": f"Bearer {api_key}" } data = { "job_name": "my_job", "url_list": [ "example.com/article1", "example.com/article2", ... ] }

r = requests.post(endpoint, headers=headers, data=data)

This job will then process your input URLs server-side, and once complete, you can query the extracted text or title within the job. All jobs and extracted text are saved on your account - you can use the API or the web app to explore the jobs you started programmatically, download them in .csv or .json formats, and check their progress.

I go into more detail in this Medium piece, "Creating an Automated Text Extraction Workflow": https://medium.com/@aleks_82234/creating-an-automated-text-e...

I get that it's hard to market to devs like myself (why buy when you can build?), so I'm looking for any feedback/criticism/suggestions on your experience with Extractor API. As Faulkner points out, you must kill your darlings - let me know if all this shit doesn't make any sense, or if it's actually helpful. Or somewhere in-between.

copypirate · on April 5, 2020

A collection of over 1,000 AMA questions and answers on COVID-19 from various experts, professionals, and journalists. For GitHub repo with all the questions and answers: https://github.com/aleksandr-smechov/covid-19-ama-db