Fake news predictor
Project Overview
Nowadays we receive a large amount of news from different mediums, and there are many of them (fake or unreal), so I decided in this project to create Machine Learning model that can predicts whether the news is real or fake, I used this dataset from Kaggle to work on it
Introduction
Before social media the amount of news was limited, also the amount of fake news was limited, but nowadays with the large amount of news that generated each day from social media we are facing the danger of faking the reality and deluding social media users, so in this project I will try to classify the news whether it is real or fake using Machine Learning approach based on the news title
Problem Statement
As I mentioned before that nowadays we have large amount of news, and people face hard time to know whether the news that they are read are real or fake, whether this source is trustworthy or not, so in this project I will try to solve the problem of deluding people by building Machine Learning model that classify whether the news is real or fake based on the news title.
Metrics
As we know that in this particular project we will have to use classification model, so I decided to take the Accuracy as metric, because the accuracy works well if our dataset is have balanced between the data points of each class, and our dataset have farily balance between the data points of the classes with ( 21417 row in true news) & (23481 row in fake news)
Accuracy: Accuracy is a common metric for binary classifiers; it takes into account both true positives and true negatives with equal weight, and it formula is:
Accuracy = (true positives + true negatives) / dataset size
Business Understanding
- What news subject that contribute most in fake news
- Is there certain time frame that fake news increase or it has random increasing
- Try to build Machine Learning model that can predict whether the news is fake or real based on the news title, then deploy it to flask web appliaction
Data Understanding
Glance at the first five rows of (true & fake) dataframes
The columns in our dataframe
Shapes of our dataframes
Data Preparation
I noticed that there is duplicated values in the both dataframes (fake & true), so I deleted them
The dataset was divided into two csv files one for fake news, and the other for true news, so I combined them together, and assigned variable called (label) to indicate whether it is fake or true (label -> 1 = true news, label -> 0 = fake news)
Function that tokenize the texts in our dataframe, to use it in the NLP model
Data Modeling
Splitting the data into features (news title), and labels (label), lastly split the data into training and testing sets
Pipeline creation, and the pipeline contain transformer (Count Vectorizer), and classifier (Logistic Regression), then fit the pipeline to our training set
Calculate the accuracy score (which is our choosing metric) in the training, and testing sets
Thankfully the classifier did a great job with accuracy in training and testing sets, but I will try to do some refinement using grid search
Refinement
I have used GridSearchCv to refine our model, using two parameters for our classifier (Logistic Regression)
The scoring was similar to our previous model, so I think the model reached to it is optimal level
And our best estimator is:
Data Visualization
Question One : What is the most news subject that contribute most in fake news
General news, politics news, and left-news seems to be the most faked news and have it has the larger portion comparing with other subjects
Question Two : Is there certain time frame that fake news increase or it has random increasing
It seems that fake news tend to spread more in (May) as we can see from the ten most frequent dates in our date column all of them are in May
Model Deployment
I have used pickle to export the model, then I have used Flask framework to deploy it in web app
Save the model as pickle (pkl) to deploy it in the web app
Our flask app contain two files
1-app.py: Which contains the model as (pkl) file, and make the prediction
2-index.html: It displays the result returned from the predictive model to the user as User Interface (UI)
Screenshot from the web app
Model Evaluation and Validation
I have used Count Vectorizer as our transformer -because we are dealing with NLP model-, then Logistic Regression as the predictive model, because it works well in binary classification, and it is fairly simple and have low computationally cost comparing with other models (e.g., Random Forest), and it also did a great job in the prediction process
I have used two parameters in Logistic Regression:
1-max_iter: with values of 500 and 1000, max_iter is “Maximum number of iterations taken for the solvers to converge.”, it makes iterations in the dataset unitl it reaches the optimal form
2-penalty: with values of ‘l1’ and ‘l2’, penalty is a way to specify the penalty that the model will treat when it overfit, I have used the regularization penalties which are (L1, and L2) to prevent the model from overfitting
I have used one parameter in the Count Vectorizer which tokenizer, and I have pass to it our function that we created eariler (tokenize)
Result Evaluation & Conclusion
- Nowadays we receive a large amount of news from different mediums, and there are many of them (fake or unreal)
- I decided in this project to create Machine Learning model that can predicts whether the news is real or fake, to solve the problem of deluding people from fake news
- I am going to use (Accuracy) is the scoring metric for the model because we have fairly balanced classes, and accuracy works well in balanced classes datasets
- We have understand the dataset, then prepare it using multiple ways, to make it works in the model
- We were able to build machine learning model that can predict whether the news is fake or true based on the title of the news, and we scored almost (98%) accuracy in the training set, and (95%) accuracy in the testing set, then we have refined the model using (GridSearchCV), and we did not notice any different
- Our first question was “What news subject that contribute most in fake news”, and the answer was “General news, politics news, and left-news seems to be the most faked news and have it has the larger portion comparing with other subjects”
- Our second question was “Is there certain time frame that fake news increase or it has random increasing”, and the answer was “It seems that fake news tend to spread more in (May) as we can see from the ten most frequent dates in our date column all of them are in May”
- Then I have deployed the model in web application using Flask Framework and pickle to export the model and make it usable in the web app
Improvmenets
There are no optimal solution always, improvements and enhancements are needed always, so there are two points that could be improved in future
1-We could acquire more data about news to make the model more accurate and more intelligent
2-We could use news from multiple languages (e.g., Arabic, French, etc…) instead of English only, to make the model more usable, and international