Automatic Summarization in Medium

jmduke · on July 16, 2013

Not that I don't think the technology behind this is both awesome and completely applicable -- because it is, and it is -- but I hope Medium doesn't take this approach.

The biggest part of Medium's value proposition, to me, is manual curation. While the included summarizations are technically accurate, they're stilted, like they're being spat out from an algorithm (because they are!): the exact opposite feeling that you want when you're being gently told from a site that they're editing and proofing and including all of these human flourishes.

ozziegooen · on July 16, 2013

Those summaries were quite fantastic. I agree that perhaps it is unnecessary to use them for the selected entries, but feel like they must be useful somewhere else.

Perhaps it would make the most sense to have an rss reader or similar with the summarizer built-in. Or a page with links to medium articles (plus others) with summaries.

MojoJolo · on July 16, 2013

Hi, I built Readborg (http://readborg.com/) to showcase what the algorithm (TextTeaser) can do. It is a news reader for Philippine news.

ics · on July 16, 2013

I searched for TextTeaser and only found a boilerplate login site and some unrelated results. Have you or do you plan on posting the code anywhere? The results look very good, and though I'm sure you can think of some great uses for it (incl. ways to monetize it) it would be fantastic to have a good summarizer for personal notes and the like. I suppose if Readborg was a full-blown RSS reader (it doesn't appear to be) you could pump all your writing through an RSS feed but that's a sticky way of doing things.

Edit: Saw the comment below asking if we could see your thesis, which I would be very interested in especially if you don't plan on sharing all of your code.

cwilson · on July 16, 2013

A practical use might be a behind the scenes installation of this technology, so that the curators can more quickly create their own summaries of articles for Medium?

Basically, tools to help the Medium staff scale the white glove nature of the site.

MojoJolo · on July 16, 2013

I like your comment. Even though you mention that it should not be used in Medium, you still praise the technology. :) Thanks!

rglullis · on July 16, 2013

Lazy Translation of the Portuguese one:

  - There is no list/bill of demands on the table that
  can be negotiated.
  - This is great for some of the effects the protests will 
  bring to government leaders. (this one really didn't make much sense)
  - In less than a week a movement was formed that led 
  hundreds of thousands to the streets, without any kind of 
  leadership, warning or prediction.
  - There is an explanation for all that.
  - Despite all the anarchy, there is logic behind all this 
  that we are experiencing.

guiambros · on July 16, 2013

This is quite impressive. It also worked fairly well for that article in Portuguese too (it's just missing the closing thought, but the original article is vague anyway; there's 3-4 paragraphs trying to conclude, but not a unique final thought).

I'd love to see this plugged into tldr.io, so articles on HN could be automatically extracted + summarized -- and later improved by real humans, as needed.

Like a Circa News app, but for the web at large.

NKCSS · on July 16, 2013

I feel, just like movie trailers, that these 'previews' ruin the articles by giving away too much.

Why would you read the article if you've just read the cliff notes? Previews should entice, not give away too much.

Movie trailers look awesome because it's the best the film has to offer, making the full length feature look pale in comparison, which is why I try to avoid them at all cost.

whiddershins · on July 16, 2013

I think that technology is amazing. It was a really cool way to demonstrate it, because I had already read some of those posts, so I got a sense of how accurate the summaries were. I would love to know more. Great job.

grad_ml · on July 16, 2013

Interesting! Nice work. I have also worked on this problem (not exactly same though!), so I want to ask few things. The only description you have given is about the features you have used. These features are very well established, and are being used almost since the inception of this field(like 60's and all). I would suggest using advanced features like HITS score etc. What are the base techniques you have compared against. Some recent work like shen et al.(Automatic document summarization using CRF) has used CRF based methods. Is your method based on bag of words or has markovian structure? Also how do you decide how many sentences to select, explain little bit about sentence ranking technique and also explain little about evaluation techniques. Without these explanation it is very difficult to make any constructive comment. Also may want to talk about training process (if supervised) and scaling issues! We can also talk offline if you wish to :) . -Rahul

pseut · on July 16, 2013

Well-written English is remarkably well structured[1]: the first or second paragraph is usually it's main point and the rest fills in details; the first paragraph or two in a section provides an overview and the rest of the section provides support, etc. News articles are even more structured: the first two paragraphs tell the whole story in summary[2].

For writing as short as articles on Medium, it might be useful to compare the algorithm to a naive version that pulls the first two sentences from each paragraph.

[1] See, e.g. http://www.amazon.com/Style-The-Basics-Clarity-Grace/dp/0321...

[2]: http://en.wikipedia.org/wiki/Inverted_pyramid

nl · on July 16, 2013

If anyone wants to play, I'm the author of Classifier4J which is a very old Java rest classification tool, which also includes a text summarization engine. I believe the algorithm in c4j has been ported to Python and is available in NLTK (https://groups.google.com/forum/m/#!topic/nltk-dev/qV9e5TsCB...).

I did a bit of testing of the java version, and it was pretty competitive with commercially available summarizers at the time.

louischatriot · on July 16, 2013

Always interesting to see new people try to approach the summary problem but I find these summaries have the defects common to automatic keyphrase extraction summaries: they feel very artificial, and are usually not accurate. The summary of "four steps to Google" is a good example.

I hope this kind of technology sees the day but I'm very skeptical about it working on general-purpose content and not just structured content such as news, as it does today.

ivan444 · on July 16, 2013

As the author of TextTeaser noted, there are two approaches to automatic summarization: abstraction and extraction.

Abstraction combines huge portions of two young research fields -- NLP & NLG (Natural Language Processing & Generation). NLG is even harder than NLP, and less researched. Without good NLG algorithm for presenting summary, you can't have more humane summaries.

Extraction simply takes sentences (or some portions of them), ranks them and presents a few best results.

Two years ago, I was at presentation of PhD about text summarization. There I've figured out that you can make fair summarization algorithm in a few hours. Here is an prototype: https://bitbucket.org/ivan444/textsum/src/1d09b0f4f72a60903d... Dirty prototype code, it took me just about 10h of work to prepare dataset, think algorithm, write program and tune it (this works only for Croatian language, if you want other language, you'll need to get list of function words for that language -- http://en.wikipedia.org/wiki/Function_word ). There is also java version of text summarizer (somewhere in repository) and simple tool to get clean, article-only text from any page containing some longer texts (it isn't tuned well, I didn't spent more than 1h of work in it, so I don't expect it works well).

Algorithm is simple: (1) break text into sentences, (2) extract features, (3) calc features score and sum them, (4) present ranked sentences (and, later, choose a few best).

Used features: normalized number of words, type of sentence (declarative, interrogative, exclamatory), order score (give first sentence a boost, as usually first sentence is the most important one), ratio between number of function words and all words (function words are words without semantic content; there is a fwords.txt in a repository which contains ~700 Croatian function words), normalized sum of three minimum TF-IDF scores (document = sentence).

I don't know the state of the code (it is more than a year old code), but anyone is free to use that code for anything they like.

Saad_M · on July 16, 2013

As a PhD graduate in NLG I wouldn't say NLG is a "young" research field. For example the oldest NLG book I have is Eduard Hovy's PhD work on the PAULINE system ("Generating Natural Language Under Pragmatic Constraints"), which was published back 1988. The seminal reference book for NLG ("Building Natural Language Generation Systems") was published back in 2000. What's made NLG more interesting recently is that the computing environment has changed considerably. We have considerably more larger pools of time-series data than was available in the past and that we now also have standardised data-to-text pipeline architecture when creating NLG applications for such data.

Nevertheless, I do agree there's still considerable challenges when trying to perform text-to-text generation which involves trying to combine NLP and NLG together to abstract, interpret, and then summarise unstructured free text.

zerop · on July 16, 2013

Is text Summarization really mature now?.. It can be very handy, I dont have time to read entire news always.. Found one more text summarizer.. not sure how good it is -- http://pravin.paratey.com/nlp/summarization

egonschiele · on July 16, 2013

Is this algorithm a closed-source / patent pending type situation? If not, how does it work?

MojoJolo · on July 16, 2013

I'm using 4 features of the article: Title, sentence length, sentence position, and modified keyword frequency. The first three features are just normal that you can see in most automatic summarization research. Modified keyword frequency considers not just the frequency, but also the distance of each keyword. :) There. That's just a brief explanation of how TextTeaser works.

mlla · on July 16, 2013

You mention in the article that you developed the algorithm as a part of writing your MSc thesis. Is your thesis available on the Internet (as in pdf or in any other format)?

MojoJolo · on July 16, 2013

Will ask my adviser about it and upload the pdf later. :)

timrogers · on July 16, 2013

FWIW I'd love to see this too. We have a semi-regular paper-reading club at GoCardless (YC S11) and this could be super interesting.

joshka · on July 16, 2013

What about a teaser of it in the mean time? ;)

extempo · on July 16, 2013

I was reading about various topics related to summarization recently, and I'm just wondering if your "modified keyword frequency" is some type of ConcGram.

http://www.lexically.net/downloads/version5/HTML/definition_...

616c · on July 16, 2013

So where can we read the technical details of the algorithm or look at the source, if available?

swah · on July 16, 2013

I didn't get if that is extraction or abstraction...