Hacker News new | past | comments | ask | show | jobs | submit | copypirate's comments login

https://github.com/Wordcab/wordcab-transcribe

Author of Wordcab-Transcribe here. We use faster-whisper + NeMo for diarization, if you want to take a look.


Transcription is a commodity at this point - great to see libraries like this make it simple to deploy on a K8s cluster


Just use test@gmail.com or something similar, there is no confirmation necessary.


An updated version of Skriber - fine-tuned on a dataset of my article summaries.


Thanks, I definitely might once I tidy things up... though, there isn't much to share. The model is pre-trained on the CNN/Dailymail dataset and hosted on AWS SageMaker (I used the BART tutorial as a reference). Fine-tuned on my own article summaries using the transformers library. The backend is Django/Bootstrap. Really, nothing special except for the fine-tune dataset which took ages to clean.


It works well which is special. Open sourcing it good whether it’s clean or not


Summarizer is at skriber.io

Powered by BART and trained on my own dataset of a few thousand summaries. Let me know how it works for you :)

login: test

password: test123


Tried copy/paste but still got: Please enter a correct username and password. Note that both fields may be case-sensitive.


There's an asterisk at the end of 123


Oops

login: test pass: test123


StoryChief but for niche sites like HN, IH, and reddit


Big version bump for Extractor API - just added the ability to extract boilerplate-free text and translate it to and from 50+ languages, in a single GET request. https://extractorapi.com/features/


Hey HN - coming at this solo, my first SaaS, after years of being a freelance copywriter and content manager.

A few years back I discovered Python, and very quickly after NLP. To a writer with a love for sci-fi/tech, I was enamored, and spent ungodly hours on my employer's GCC-hosted Jupyter Notebooks, coming up with all sorts of impractical experiments with Spacy, Facebook's Starspace, Gensim, and the like.

For one, I needed a lot of training data. I'd go crawl thousands of pages of text from news sites, using Scrapy and storing data directly on the server. For text extraction and boilerplate removal, I used newspaper3k, and eventually a custom extractor that used a random forest model to select proper element "candidates".

I wanted a simpler way to aggregate text for a dataset, query it, create subsets based on keywords, and so on. The paid options out there - Diffbot, Aylien, Ujeebu, Scrapinghub's news API, etc. weren't exactly what I was looking for.

After learning the minimum amount of JS required, I built a shitty local app where you could paste a bunch of URLs and get back a JSON with the extracted text. I posted it up here, on HN, and there were a few hundred visits, absolutely demolishing the $5 DO instance. I figured others might want something like this.

So I built extractorapi.com - a text extraction API and UI that revolves around the idea of "Jobs". For example, let's say you gathered a list of URLs from the NY Times, or The Economist, or Bloomberg. You then provide that list of URLs to a job called "my_articles". For example:

api_key = "YOUR_API_KEY" endpoint = "https://extractorapi.com/api/v1/jobs"

headers = { "Authorization": f"Bearer {api_key}" } data = { "job_name": "my_job", "url_list": [ "example.com/article1", "example.com/article2", ... ] }

r = requests.post(endpoint, headers=headers, data=data)

This job will then process your input URLs server-side, and once complete, you can query the extracted text or title within the job. All jobs and extracted text are saved on your account - you can use the API or the web app to explore the jobs you started programmatically, download them in .csv or .json formats, and check their progress.

I go into more detail in this Medium piece, "Creating an Automated Text Extraction Workflow": https://medium.com/@aleks_82234/creating-an-automated-text-e...

I get that it's hard to market to devs like myself (why buy when you can build?), so I'm looking for any feedback/criticism/suggestions on your experience with Extractor API. As Faulkner points out, you must kill your darlings - let me know if all this shit doesn't make any sense, or if it's actually helpful. Or somewhere in-between.


A collection of over 1,000 AMA questions and answers on COVID-19 from various experts, professionals, and journalists. For GitHub repo with all the questions and answers: https://github.com/aleksandr-smechov/covid-19-ama-db


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: