Monday, June 5, 2017

Detects Clickbait Headlines Using Deep Learning: Clickbait Detector

1:34 PM Leave a Reply
Detects Clickbait Headlines Using Deep Learning: Clickbait Detector
Detects Clickbait Headlines Using Deep Learning

     People continually fall for clickbait and as Wired in it’s article mentioned Whether you think clickbait is on the rise, obscurant and self-negating, not such a big deal, or the root of all evil, one thing is clear about it: It’s increasingly hard to pin down.

A lot of editors use clickbait in an effort to manipulate you or grab your attention. The difference with clickbait is you’re often aware of this manipulation, and yet helpless to resist it. It’s at once obvious in its bait-iness, and somehow still effective bait. With this small toolkit using deep learning one can easily identify whether an article is clickbait or not









Requirements

Python 2.7.12
Keras 1.2.1
Tensorflow 0.12.1
Numpy 1.11.1
NLTK 3.2.1


Getting Started

Install a virtualenv in the project directoryvirtualenv venv
Activate the virtualenv
On Windows:cd venv/Scripts activate
On Linuxsource venv/bin/activate
Install the requirements
pip install -r requirements.txt
Try it out! Try running one of the examples.


Accuracy

Training Accuracy after 25 epochs = 93.8 % (loss = 0.1484)

Validation Accuracy after 25 epochs = 90.15 % (loss = 0.2670)



Examples

$ python src/detect.py "Novak Djokovic stunned as Australian Open title defence ends against Denis Istomin"
Using TensorFlow backend.
headline is 0.33 % clickbaity
$ python src/detect.py "Just 22 Cute Animal Pictures You Need Right Now"
Using TensorFlow backend.
headline is 85.38 % clickbaity
$ python src/detect.py " 15 Beautifully Created Doors You Need To See Before You Die. The One In Soho Blew Me Away"
Using TensorFlow backend.
headline is 52.29 % clickbaity
$ python src/detect.py "French presidential candidate Emmanuel Macrons anti-system angle is a sham | Philippe Marlire"
Using TensorFlow backend.
headline is 0.05 % clickbaity


Data

     The dataset consists of about 12,000 headlines half of which are clickbait. The clickbait headlines were fetched from BuzzFeed, NewsWeek, The Times of India and, The Huffington Post. The genuine/non-clickbait headlines were fetched from The Hindu, The Guardian, The Economist, TechCrunch, The wall street journal, National Geographic and, The Indian Express.

Some of the data was from peterldowns’s clickbait-classifier repository



Pretrained Embeddings

Author used Stanford’s Glove Pretrained Embeddings PCA-ed to 30 dimensions. This sped up the training.



Improving accuracy

To improve Accuracy,

Increase Embedding layer dimension (Currently it is 30) – src/preprocess_embeddings.py
Use more data
Increase vocabulary size – src/preprocess_text.py
Increase maximum sequence length – src/train.py
Do better data cleaning

Download