Fake news predictor

7 min readDec 8, 2021

Project Overview

Nowadays we receive a large amount of news from different mediums, and there are many of them (fake or unreal), so I decided in this project to create Machine Learning model that can predicts whether the news is real or fake, I used this dataset from Kaggle to work on it

Introduction

Before social media the amount of news was limited, also the amount of fake news was limited, but nowadays with the large amount of news that generated each day from social media we are facing the danger of faking the reality and deluding social media users, so in this project I will try to classify the news whether it is real or fake using Machine Learning approach based on the news title

Problem Statement

As I mentioned before that nowadays we have large amount of news, and people face hard time to know whether the news that they are read are real or fake, whether this source is trustworthy or not, so in this project I will try to solve the problem of deluding people by building Machine Learning model that classify whether the news is real or fake based on the news title.

Metrics

As we know that in this particular project we will have to use classification model, so I decided to take the Accuracy as metric, because the accuracy works well if our dataset is have balanced between the data points of each class, and our dataset have farily balance between the data points of the classes with ( 21417 row in true news) & (23481 row in fake news)

Accuracy: Accuracy is a common metric for binary classifiers; it takes into account both true positives and true negatives with equal weight, and it formula is:

Accuracy = (true positives + true negatives) / dataset size

Business Understanding

What news subject that contribute most in fake news
Is there certain time frame that fake news increase or it has random increasing
Try to build Machine Learning model that can predict whether the news is fake or real based on the news title, then deploy it to flask web appliaction

Data Understanding

Glance at the first five rows of (true & fake) dataframes

The columns in our dataframe

Shapes of our dataframes

Data Preparation

I noticed that there is duplicated values in the both dataframes (fake & true), so I deleted them

The dataset was divided into two csv files one for fake news, and the other for true news, so I combined them together, and assigned variable called (label) to indicate whether it is fake or true (label -> 1 = true news, label -> 0 = fake news)

Function that tokenize the texts in our dataframe, to use it in the NLP model

Data Modeling

Splitting the data into features (news title), and labels (label), lastly split the data into training and testing sets

Pipeline creation, and the pipeline contain transformer (Count Vectorizer), and classifier (Logistic Regression), then fit the pipeline to our training set

Calculate the accuracy score (which is our choosing metric) in the training, and testing sets

Thankfully the classifier did a great job with accuracy in training and testing sets, but I will try to do some refinement using grid search

Refinement

I have used GridSearchCv to refine our model, using two parameters for our classifier (Logistic Regression)

The scoring was similar to our previous model, so I think the model reached to it is optimal level

And our best estimator is:

Data Visualization

Question One : What is the most news subject that contribute most in fake news

General news, politics news, and left-news seems to be the most faked news and have it has the larger portion comparing with other subjects

Question Two : Is there certain time frame that fake news increase or it has random increasing

It seems that fake news tend to spread more in (May) as we can see from the ten most frequent dates in our date column all of them are in May

Model Deployment

I have used pickle to export the model, then I have used Flask framework to deploy it in web app

Save the model as pickle (pkl) to deploy it in the web app

Our flask app contain two files

1-app.py: Which contains the model as (pkl) file, and make the prediction

2-index.html: It displays the result returned from the predictive model to the user as User Interface (UI)

Screenshot from the web app

Model Evaluation and Validation

I have used Count Vectorizer as our transformer -because we are dealing with NLP model-, then Logistic Regression as the predictive model, because it works well in binary classification, and it is fairly simple and have low computationally cost comparing with other models (e.g., Random Forest), and it also did a great job in the prediction process

I have used two parameters in Logistic Regression:

1-max_iter: with values of 500 and 1000, max_iter is “Maximum number of iterations taken for the solvers to converge.”, it makes iterations in the dataset unitl it reaches the optimal form

2-penalty: with values of ‘l1’ and ‘l2’, penalty is a way to specify the penalty that the model will treat when it overfit, I have used the regularization penalties which are (L1, and L2) to prevent the model from overfitting

I have used one parameter in the Count Vectorizer which tokenizer, and I have pass to it our function that we created eariler (tokenize)

Result Evaluation & Conclusion

Nowadays we receive a large amount of news from different mediums, and there are many of them (fake or unreal)
I decided in this project to create Machine Learning model that can predicts whether the news is real or fake, to solve the problem of deluding people from fake news
I am going to use (Accuracy) is the scoring metric for the model because we have fairly balanced classes, and accuracy works well in balanced classes datasets
We have understand the dataset, then prepare it using multiple ways, to make it works in the model
We were able to build machine learning model that can predict whether the news is fake or true based on the title of the news, and we scored almost (98%) accuracy in the training set, and (95%) accuracy in the testing set, then we have refined the model using (GridSearchCV), and we did not notice any different
Our first question was “What news subject that contribute most in fake news”, and the answer was “General news, politics news, and left-news seems to be the most faked news and have it has the larger portion comparing with other subjects”
Our second question was “Is there certain time frame that fake news increase or it has random increasing”, and the answer was “It seems that fake news tend to spread more in (May) as we can see from the ten most frequent dates in our date column all of them are in May”
Then I have deployed the model in web application using Flask Framework and pickle to export the model and make it usable in the web app

Improvmenets

There are no optimal solution always, improvements and enhancements are needed always, so there are two points that could be improved in future

1-We could acquire more data about news to make the model more accurate and more intelligent

2-We could use news from multiple languages (e.g., Arabic, French, etc…) instead of English only, to make the model more usable, and international