Deep Learning on Title and Content Features to Tackle Clickbait

hbex5 · on March 13, 2017

The problem is, we don't have a clear definition of what clickbait is.

nouninformal (on the Internet) content whose main purpose is to attract attention and encourage visitors to click on a link to a particular web page.

But that's basically everything on the web.

BooglyWoo · on March 13, 2017

I'm not sure that quite addresses the problem here.

After all there is no clear definition of 'what dogs look like' (in the sense of a collection of logical rules), but deep learning models excel at detecting them, when provided with enough positive examples.

If it's possible for humans to agree on whether a given article is clickbait or not, we should be able to put together an adequate dataset for training a system to classify them too. From the linked article I am unable to discern how the training dataset was labelled.

In other words, the fact that 'clickbait' is a nebulous concept shouldn't preclude machine learning from being able to detect it.

astrodust · on March 13, 2017

Just as "dogness" is a factor, so is "clickbaityness". You're right, this is all about thresholds.

kristianc · on March 13, 2017

I often wonder what Wittgenstein would have made of today's models of machine learning / deep learning https://en.m.wikipedia.org/wiki/Family_resemblance

BooglyWoo · on March 14, 2017

Me too. My reading of the Blue and Brown books led me to believe Wittgenstein's conception of meaning is inextricably tied up with the notion of "learning" and exposure to language and its use. Rather than meaning being contingent on 'hard' logico-mathematical derivations of formal semantics.

This contrast seems somewhat reminiscent of the complementary approaches of hard-coded rule based AI vs machine learning.

trevyn · on March 13, 2017

The core of the definition is subtly wrapped in "main purpose" -- once the attention is attracted and the link is clicked, the clickbait's job is done. So the content of the article will be lower quality and less intellectually satisfying than non-clickbait articles.

For example, if you charted "interest on clicking this link text" vs "satisfaction with article after reading", I think clickbait would be clearly in the high interest vs low satisfaction quadrant.

narrator · on March 13, 2017

One problem with that definition is that intellectual satisfaction is disproportionately affected by confirmation bias. For instance, an article entitled "10 reasons why <politician> getting elected means the end of America" might be clickbait to some, but not others depending on what <politician> contains.

imanewsman · on March 13, 2017

In 2014, Jon Stewart offered an interesting definition of clickbait:

"I scroll around, but when I look at the internet, I feel the same as when I’m walking through Coney Island. It’s like carnival barkers, and they all sit out there and go, 'Come on in here and see a three-legged man!' So you walk in and it’s a guy with a crutch."

The thing is, he was talking about BuzzFeed when he said that, and that is not what BuzzFeed does at all. BuzzFeed's editor wrote about the distinction here, and it's the most insightful article I've read on the topic:

https://www.buzzfeed.com/bensmith/why-buzzfeed-doesnt-do-cli...

People tend to consider things like lists clickbait, even though those articles usually deliver exactly what the headline suggests. (If you click on "23 photos of kittens that are just too adorable," that is what you will get.) But because it's an article that was made specifically to get traffic, people incorrectly call it clickbait.

And it often goes even further than that. On Reddit and Hacker News, commenters constantly call articles clickbait. Sometimes it's true, and there's a sensational headline that leads to a bullshit story. But just as often, the story delivers on what the headline promises, but commenters call it clickbait because the headline is slightly hyperbolic, snappy, or just plain well-written.

hyperpape · on March 13, 2017

I would define clickbait as articles which intentionally try to disguise what you'll get out of reading them. The information is banal, but the headline makes it out to be revolutionary or shocking.

You might disagree with the details of the formulation, but I think there's pretty broad agreement that something similar is going on with clickbait.

richdougherty · on March 13, 2017

I guess it's an inherently fuzzy concept, so quite a good fit for machine learning.

But my definition of clickbait is any link I follow where I feel like I've been tricked into the click. The link looked interesting, but I feel regret once I see the actual content.

Cthulhu_ · on March 13, 2017

A definition would need a bit more fleshing out, mostly about the (lack of) actual content; a long-winded page (not just text) that eventually leads to the core, which could be summarised in one line, even the article title itself. (like 'peanut butter is made out of peanuts' instead of "you'll NEVER guess this ONE SECRET peanut butter ingredient!"

astrodust · on March 13, 2017

Is that a deep buried lede with a teaser headline?

The thing that irritates me the most is these absurd ads that make bold claims and never deliver on what they're advertising. Even if you're interested in what they're offering, you're willing to take the bait, it's a lost cause, they never fulfill their promise. It's just carousels of bullshit jammed full of more ads.

visarga · on March 13, 2017

I'd like a system to filter out fluff threads on reddit. It would reject easy-consumption content such as images, gifs and short vids, or anything shorter than 60 seconds; also, low quality comments (short, aggressive, memes, etc).

Reddit is a gold-mine of interesting content, but it is flooded with fluff and garbage to the point where it becomes a problem to find the good parts.

I'm wondering why they don't use more machine learning magic on the site. There are multiple machine learning papers based off the reddit comment corpus.

arkitaip · on March 13, 2017

Vanilla Reddit is almost garbage because of how default subreddit posts take over your front page.

What you need to do is to unsub from all default subreddits, subscribe to niche ones you like and use Reddit Enhancement Suite (RES) [1] to contain the default subreddits to what RES calles the Dashboard (basically a page where you can add lots of subreddits as individual widgets).

[1] https://redditenhancementsuite.com/

make3 · on March 13, 2017

your best bet is to filter out meme subs

visarga · on March 13, 2017

There is also a need to find interesting threads outside a known list of subs, or to filter out some parts of otherwise good subs.

baxuz · on March 13, 2017

You could add any title that's formulated as an imperative. "You won't believe..." "Guess which..." "You should..."

Also titles that are formulated as a simple subject - predicate - object sentence: "XY considered anti-pattern" "Trump is right" "Hitler did nothing wrong" "Drunk girl shows tits" "Homeopathy is the future of medicine"

Same works if formulated as a question: "Is Trump right?" "Has Hitler done nothing wrong?" "Is homeopathy the future of medicine?"

Bonus points for exclamation marks, pound signs and uppercase words.

minimaxir · on March 13, 2017

I wrote the original article visualizing clickbait from scraped Facebook data: http://minimaxir.com/2016/08/clickbait-cluster/

Yes, there are obvious tropes of clickbait. Facebook, however, is cracking down on them, so there's been a slight brinksmanship between "how do I get people to click articles without following the tropes?"

From the visualization in my article, you can see there is a spatial blend between sources like the NYT and BuzzFeed when subjects like kids and Pokemon are brought up.

abhisvnit · on March 13, 2017

The point that the article is trying to explain is clickbaits cant just be classified only by using these titles. The content of the webpages also plays a big role :)

mtgx · on March 13, 2017

Not all clickbait headlines are written like that.

For instance: "Russia hacked US power grid" doesn't have any of those, and yet it was a completely clickbait/sensationalist/borderline fake news headline from WashPost. How is AI going to deal with those?

https://theintercept.com/2016/12/31/russia-hysteria-infects-...

hyperpape · on March 13, 2017

That wasn't clickbait. Arguably it was worse. "You'll never guess what happens when she starts to sing!" isn't likely to contribute to increased military tensions between nuclear powers.

To put it as a triviality: just because two things are bad doesn't mean they have to be bad in the same way.

I also wouldn't classify that story as "fake news"[0]. Those were things like "Revealed: Obama says Clinton would be terrible president", or "Revealed: Trump under investigation by European Court for Human Rights". Those were straightforward false claims, with zero actual sourcing, by people who knew they were lying. This Washington Post article was shitty reporting, using thin sources, that fit a currently popular hysteria. And it was completely inaccurate. But the authors didn't sit down and say "what can we make up." They got some sources and didn't do any due diligence, because it was too hard to pass up on such a juicy story.

I'm not wedded to the idea that these articles aren't fake news, but I'm confident it doesn't make sense to call them clickbait.

[0] Of course, this relies on the idea that fake news doesn't just mean "news that is wrong", which has been with us forever, but more about a social media driven trend within the past year or two.

joosters · on March 13, 2017

Simple filter for tech articles:

$clickbait = /Deep/;

jj12345 · on March 13, 2017

Thanks for the nice, condensed article. Generating features from BeautifulSoup isn't something that I've considered before.

I'm still going through Yoshua Bengio's new book on DL, but if anyone is free to comment: what are the justifications for the general architecture? Why use LSTMs with the Glove embeddings?

volker48 · on March 13, 2017

Seems like everyone uses the glove embeddings for any text based DL project.

hikkigaya · on March 13, 2017

All I see is that the author uses deep learning to distinguish post published by Buzzfeed, clickhole, upworthy and stopclickbaitofficial v.s. the other pages?

samirahmed · on March 13, 2017

yes - i am skeptical on how generalizable the final model is - given lots of features (numerical and text) are closely linked to same domain.

abhisvnit · on March 13, 2017

code available here: https://github.com/abhishekkrthakur/clickbaits_revisited

empath75 · on March 13, 2017

I'm not sure how he can define what clickbait is and what's not.

The NY Times isn't immune to publishing clickbait, and buzzfeed sometimes posts really solid journalism.

matrix2596 · on March 13, 2017

Crowdsource is an amazing thing to do