Not that I don't think the technology behind this is both awesome and completely applicable -- because it is, and it is -- but I hope Medium doesn't take this approach.
The biggest part of Medium's value proposition, to me, is manual curation. While the included summarizations are technically accurate, they're stilted, like they're being spat out from an algorithm (because they are!): the exact opposite feeling that you want when you're being gently told from a site that they're editing and proofing and including all of these human flourishes.
Those summaries were quite fantastic. I agree that perhaps it is unnecessary to use them for the selected entries, but feel like they must be useful somewhere else.
Perhaps it would make the most sense to have an rss reader or similar with the summarizer built-in. Or a page with links to medium articles (plus others) with summaries.
I searched for TextTeaser and only found a boilerplate login site and some unrelated results. Have you or do you plan on posting the code anywhere? The results look very good, and though I'm sure you can think of some great uses for it (incl. ways to monetize it) it would be fantastic to have a good summarizer for personal notes and the like. I suppose if Readborg was a full-blown RSS reader (it doesn't appear to be) you could pump all your writing through an RSS feed but that's a sticky way of doing things.
Edit: Saw the comment below asking if we could see your thesis, which I would be very interested in especially if you don't plan on sharing all of your code.
A practical use might be a behind the scenes installation of this technology, so that the curators can more quickly create their own summaries of articles for Medium?
Basically, tools to help the Medium staff scale the white glove nature of the site.
- There is no list/bill of demands on the table that
can be negotiated.
- This is great for some of the effects the protests will
bring to government leaders. (this one really didn't make much sense)
- In less than a week a movement was formed that led
hundreds of thousands to the streets, without any kind of
leadership, warning or prediction.
- There is an explanation for all that.
- Despite all the anarchy, there is logic behind all this
that we are experiencing.
This is quite impressive. It also worked fairly well for that article in Portuguese too (it's just missing the closing thought, but the original article is vague anyway; there's 3-4 paragraphs trying to conclude, but not a unique final thought).
I'd love to see this plugged into tldr.io, so articles on HN could be automatically extracted + summarized -- and later improved by real humans, as needed.
I feel, just like movie trailers, that these 'previews' ruin the articles by giving away too much.
Why would you read the article if you've just read the cliff notes? Previews should entice, not give away too much.
Movie trailers look awesome because it's the best the film has to offer, making the full length feature look pale in comparison, which is why I try to avoid them at all cost.
I think that technology is amazing. It was a really cool way to demonstrate it, because I had already read some of those posts, so I got a sense of how accurate the summaries were. I would love to know more. Great job.
Interesting! Nice work. I have also worked on this problem (not exactly same though!), so I want to ask few things. The only description you have given is about the features you have used. These features are very well established, and are being used almost since the inception of this field(like 60's and all). I would suggest using advanced features like HITS score etc. What are the base techniques you have compared against. Some recent work like shen et al.(Automatic document summarization using CRF) has used CRF based methods. Is your method based on bag of words or has markovian structure? Also how do you decide how many sentences to select, explain little bit about sentence ranking technique and also explain little about evaluation techniques. Without these explanation it is very difficult to make any constructive comment. Also may want to talk about training process (if supervised) and scaling issues!
We can also talk offline if you wish to :) .
-Rahul
Well-written English is remarkably well structured[1]: the first or second paragraph is usually it's main point and the rest fills in details; the first paragraph or two in a section provides an overview and the rest of the section provides support, etc. News articles are even more structured: the first two paragraphs tell the whole story in summary[2].
For writing as short as articles on Medium, it might be useful to compare the algorithm to a naive version that pulls the first two sentences from each paragraph.
If anyone wants to play, I'm the author of Classifier4J which is a very old Java rest classification tool, which also includes a text summarization engine. I believe the algorithm in c4j has been ported to Python and is available in NLTK (https://groups.google.com/forum/m/#!topic/nltk-dev/qV9e5TsCB...).
I did a bit of testing of the java version, and it was pretty competitive with commercially available summarizers at the time.
Always interesting to see new people try to approach the summary problem but I find these summaries have the defects common to automatic keyphrase extraction summaries: they feel very artificial, and are usually not accurate. The summary of "four steps to Google" is a good example.
I hope this kind of technology sees the day but I'm very skeptical about it working on general-purpose content and not just structured content such as news, as it does today.
As the author of TextTeaser noted, there are two approaches to automatic summarization: abstraction and extraction.
Abstraction combines huge portions of two young research fields -- NLP & NLG (Natural Language Processing & Generation). NLG is even harder than NLP, and less researched. Without good NLG algorithm for presenting summary, you can't have more humane summaries.
Extraction simply takes sentences (or some portions of them), ranks them and presents a few best results.
Two years ago, I was at presentation of PhD about text summarization. There I've figured out that you can make fair summarization algorithm in a few hours. Here is an prototype:
https://bitbucket.org/ivan444/textsum/src/1d09b0f4f72a60903d...
Dirty prototype code, it took me just about 10h of work to prepare dataset, think algorithm, write program and tune it (this works only for Croatian language, if you want other language, you'll need to get list of function words for that language -- http://en.wikipedia.org/wiki/Function_word ).
There is also java version of text summarizer (somewhere in repository) and simple tool to get clean, article-only text from any page containing some longer texts (it isn't tuned well, I didn't spent more than 1h of work in it, so I don't expect it works well).
Algorithm is simple:
(1) break text into sentences, (2) extract features, (3) calc features score and sum them, (4) present ranked sentences (and, later, choose a few best).
Used features: normalized number of words, type of sentence (declarative, interrogative, exclamatory), order score (give first sentence a boost, as usually first sentence is the most important one), ratio between number of function words and all words (function words are words without semantic content; there is a fwords.txt in a repository which contains ~700 Croatian function words), normalized sum of three minimum TF-IDF scores (document = sentence).
I don't know the state of the code (it is more than a year old code), but anyone is free to use that code for anything they like.
As a PhD graduate in NLG I wouldn't say NLG is a "young" research field. For example the oldest NLG book I have is Eduard Hovy's PhD work on the PAULINE system ("Generating Natural Language Under Pragmatic Constraints"), which was published back 1988. The seminal reference book for NLG ("Building Natural Language Generation Systems") was published back in 2000. What's made NLG more interesting recently is that the computing environment has changed considerably. We have considerably more larger pools of time-series data than was available in the past and that we now also have standardised data-to-text pipeline architecture when creating NLG applications for such data.
Nevertheless, I do agree there's still considerable challenges when trying to perform text-to-text generation which involves trying to combine NLP and NLG together to abstract, interpret, and then summarise unstructured free text.
Is text Summarization really mature now?.. It can be very handy, I dont have time to read entire news always.. Found one more text summarizer.. not sure how good it is -- http://pravin.paratey.com/nlp/summarization
I'm using 4 features of the article: Title, sentence length, sentence position, and modified keyword frequency. The first three features are just normal that you can see in most automatic summarization research. Modified keyword frequency considers not just the frequency, but also the distance of each keyword. :) There. That's just a brief explanation of how TextTeaser works.
You mention in the article that you developed the algorithm as a part of writing your MSc thesis. Is your thesis available on the Internet (as in pdf or in any other format)?
I was reading about various topics related to summarization recently, and I'm just wondering if your "modified keyword frequency" is some type of ConcGram.
The biggest part of Medium's value proposition, to me, is manual curation. While the included summarizations are technically accurate, they're stilted, like they're being spat out from an algorithm (because they are!): the exact opposite feeling that you want when you're being gently told from a site that they're editing and proofing and including all of these human flourishes.