The Truth Behind Fake News: Tools and Techniques for Detection
Authors: Linsong, Shlok Nangia, Jialiang Guo (3_datamen)
This blog is written and maintained by students in the Master of Science in Professional Computer Science Program at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit sfu.ca/computing/mpcs.
Motivation and Background:
Recent years have witnessed fake news becoming a major problem, particularly on social media platforms like Twitter and Facebook. Our project aims to identify and classify these fake news and articles, to improve media literacy, protect public from the effects of fake news, and support easier fact-checking.
Problem Statement:
The challenge of fake news detection lies in distinguishing between genuine and false information, particularly as fake news becomes increasingly sophisticated and difficult to detect. Our project aims to answer the following questions:
-
Can we accurately identify fake news using machine learning techniques?
-
How can we improve the accuracy and reliability of fake news detection?
-
What features or characteristics of news articles are most indicative of fake news?
Our Data Science Pipeline:
Our data science pipeline involves several components, including data collection, preprocessing, feature extraction, model training, and evaluation. We collected a dataset of news articles from various sources, both genuine and fake, and preprocessed the data by removing irrelevant information and cleaning the text.
- Data Collection:
The initial stage of our data science pipeline involves gathering raw data from various sources. In this project, we have selected the PHEME, Liar, and Buzzfeed News datasets as the primary sources for fake news detection. These datasets provide diverse and rich information, making them suitable for our analysis. Additionally, we collect live data from Twitter using the Tweepy library as an up-to-date and dynamic source of information to enhance our model’s performance and adaptability.
Downloading the datasets:
PHEME: Access to the PHEME dataset can be requested through the PHEME project page. Once access is granted, download the dataset files.
Liar: The Liar dataset can be downloaded from the University of Texas website. Download and unzip the dataset files.
Buzzfeed News: The Buzzfeed News dataset is available on the BuzzFeed News GitHub repository.
2. Data Preprocessing:
In this stage, the raw data is cleaned and preprocessed to make it suitable for analysis. This process involves several steps, including handling missing values, removing duplicates, correcting inconsistencies, and converting data types. Data preprocessing also includes feature engineering, which entails creating new features from existing ones to better represent the problem domain. We have performed the following preprocessing steps:
-
Load the BuzzFeed, Liar, and PHEME datasets from their respective files.
-
For each dataset, perform the following preprocessing steps:
- Remove any unnecessary columns and keep only the relevant ones (e.g., label and text).
- Convert the text to lowercase.
- Remove any non-alphabetic characters.
-
For the BuzzFeed dataset, convert the “type” column to a binary label.
-
For the Liar dataset, convert the “label” column to a binary label.
-
Combine the preprocessed datasets into a single DataFrame.
-
Save the combined preprocessed data to a CSV file for further analysis and modeling.
We perform these preprocessing steps, ensuring that the resulting dataset is clean, consistent, and ready for analysis. By applying these data preprocessing techniques, you can ensure that the dataset is suitable for use in machine learning algorithms and other data analysis methods.
3. Exploratory Data Analysis (EDA):
EDA is the process of visually and analytically exploring the data to understand its structure, relationships, and patterns. This helps in generating hypotheses, identifying outliers, and determining the most relevant features for the analysis. In this case, we will analyze two datasets:
-
the combined dataset containing rumor and non-rumor threads for various events, and
-
The BuzzFeed dataset containing real and fake news. Common EDA techniques include descriptive statistics, histograms, box plots, and scatter plots.
We will perform the following tasks using the provided code as a reference for both datasets:
-
Compare the number of different types of threads (rumor and non-rumor) and news (real and fake) for each event and source, respectively. Analyze the dynamics of information spread over time, such as retweets of source tweets in rumor and non-rumor threads.
-
Examine the depth of reaction structures for both rumor and non-rumor threads and the relationship between news type and the presence of movies (video links) and images in the news articles.
-
Visualize the network structures of rumor and non-rumor threads using graph representations and identify common sources that publish both real and fake news.
Here, we’ll summarize the insights obtained from the Exploratory Data Analysis:
-
The comparison of rumor and non-rumor threads for each event and the comparison of real and fake news among various sources help us understand the distribution of rumors, non-rumors, real news, and fake news in different scenarios.
-
Analyzing the dynamics of information spread, such as the number of retweets of source tweets in rumor and non-rumor threads over time, helps us understand the impact of rumors on social media and the spread of real and fake news.
-
Examining the depth of reaction structures for both rumor and non-rumor threads and the relationship between news type and the presence of movies (video links) and images in the news articles provide insights into the complexity of discussions and the prevalence of visual content in different types of threads and news articles. This can help us identify the level of engagement and how different types of threads and news articles evolve over time.
-
Visualizing the network structures of rumor and non-rumor threads using graph representations and identifying common sources that publish both real and fake news allows us to explore the connectivity and interaction patterns among users participating in these threads and assess the credibility and reliability of various sources. This can help us identify influential nodes or clusters in the network that may contribute to the spread of information, rumors, or fake news.
-
Overall, the Exploratory Data Analysis helps us to understand the structure and the relationships among the features in both datasets. This understanding can guide us in developing more accurate predictive models for detecting rumors and non-rumors in social media threads and real and fake news in the BuzzFeed dataset.
Top 5 sources that publish Real news: Buzzfeed
Top 5 sources that publish Fake news: Buzzfeed
Common source of real and fake news in Buzzfeed
In addition to analyzing the BuzzFeed dataset, we also explored the connections between speakers in the “liar” dataset based on their shared contexts. We created a directed graph using NetworkX, where nodes represent speakers and edges represent pairs of speakers who appear in the same context. This allowed us to compute various network statistics, such as the number of nodes, the number of edges, and the average degree. Moreover, we visualized the graph using two different layouts: the spring layout (force-directed) and the Circos plot. This analysis helped us understand the relationships between speakers and provided insights into the structure of their interactions. By examining the connections between speakers, we can potentially identify patterns or clusters that may contribute to the spread of misinformation.
4. Model Monitoring and Maintenance: In this part, we integrated live Twitter data collection and used OpenAI to simulate labels generated by human annotators. To enhance the user experience, we can also incorporate user feedback based on the predicted scores from our model. Here’s a high-level overview of how to include user feedback in the pipeline:
-
After deploying the model, users can interact with it by submitting text or links that they want to evaluate for credibility.
-
The model provides a credibility score or prediction for the submitted content.
-
Users can then provide feedback on the prediction, indicating whether they agree or disagree with the model’s evaluation.
-
Collect user feedback and store it in a database, along with the original content and the model’s prediction.
-
Periodically update the training data with the new feedback and retrain the model to improve its performance. This step can be automated using techniques such as active learning or by manually incorporating the user feedback into the training data.
By incorporating user feedback into the model, we ensure that the model continuously adapts to the changing patterns of fake news and misinformation. This approach can lead to better overall performance and improved accuracy in detecting fake news.
Methodology
Data Splitting:
To prepare the data for model training and evaluation, we first integrated four datasets from different sources into a single dataset. The resultant dataset consists of 64296 pieces of text with their corresponding labels (True or Fake).
To ensure that the dataset was split in a way that could allow us to evaluate the models properly, we used the train_test_split function from the scikit-learn library to divide it into training set, validation set, and test set, in a ratio of 70:15:15.
We also set the random_state parameter to ensure that the split is reproducible, so that we can compare different models with each other later, and used the stratify parameter to ensure that each set contains the same proportion of samples of each class.
Model Selection and Training:
We did transfer learning based on three popular natural language processing models on our dataset: BERT, GPT-2, and LSTM, and explored their performance. BERT and GPT-2 are transformer-based models, while LSTM is a recurrent neural network (RNN) model.
- BERT and GPT-2:
To utilize the power of these pre-trained transformer models, we first used HuggingFace’s transformers library to tokenize our training data and encapsulate the tokenized data with corresponding labels into Dataloaders, for batch training.
Then we load the pre-trained BERT/GPT-2 model, froze their parameters, and added a trainable binary classification head, which consists of two dense layers, a ReLu activation and a Dropout (to prevent overfitting) between them, and a LogSoftmax function at the end. We took the CLS token from the output of the last hidden layer of each transformer and used them as input to the classification head.
For the training stage, we adopted PyTorch Lightning Framework to conduct batch training for both models. During training, we logged the training and validation performance metrics including accuracy and loss, which can be visualized using TensorBoard.
2. LSTM
For the LSTM model, we first defined a text transform pipeline to preprocess the text data before training. The pipeline consists of a SentencePieceTokenizer, a VocabTransform, truncation, and adding any extra tokens. Similar to previous section, the transformed data was encapsulated with corresponding labels into Dataloaders.
As LSTM is a type of neural network architecture other than transformer-based, it needs word embedding to get continuous, dense vector representations of words with it can work on. We applied pre-trained GloVe-6B word embeddings for this purpose. The neural network architecture consists of the word embedding, the LSTM model which takes the output of embedding as its input, and a dense layer to translate the output of LSTM model into logarithmic probability of two classes.
Again, We used PyTorch Lightning Framework to conduct batch training, and logged the training and validation performance of the LSTM model which can be visualized in TensorBoard.
Model Explanation:
So, we built our fake news detector data product using BERT, GPT-2, and LSTM models. Our goal was to demonstrate the effectiveness of these models by employing a Global Surrogate approach: training a logistic regression model using the predicted labels from our models on the training dataset. We further enhanced the interpretability of these black-box models using LIME and SHAP techniques.
We used the dataset with pred_labels generated by BERT, GPT-2, and LSTM models to train and evaluate a logistic regression model for fake news detection. We then applied LIME and SHAP techniques to interpret the model. Here’s a brief explanation of each step:
Logistic Regression Model:
-
Load the dataset and preprocess it to create binary labels.
-
Split the dataset into train and test sets, and vectorize the text using TfidfVectorizer.
-
Train a logistic regression model and evaluate its performance using the classification report.
-
Use the logistic regression model as a surrogate model for the black-box models (BERT, GPT-2, and LSTM) to improve interpretability.
Interpretability Techniques (LIME and SHAP):
LIME:
-
Define a function to predict probabilities for the LIME explainer.
-
Create a LimeTextExplainer object and choose an instance from the test set to explain.
-
Use the LIME explainer to generate an explanation for the chosen instance.
-
Visualize the LIME explanation to understand the impact of each feature (word) on the prediction.
SHAP:
-
Create a SHAP explainer using the logistic regression model and calculate SHAP values for the test set.
-
Visualize the SHAP values using a beeswarm plot to understand the contribution of each feature (word) to the predictions.
Evaluation
Model Validation and Evaluation: The validation is conducted on the validation set during training, as stated above. We conducted experiment among different hyperparameters such as learning rate, dropout rate, layer dimension, optimizer, word embedding/transformer etc. By observing the validation performance, we selected the best performing models and their corresponding hyperparameter, and we were able to decide whether to stop early to prevent overfitting.
Once the model has been trained and validated, we can evaluate its performance by using it to make predictions on the test set, which contains unseen samples. We used scikit-learn’s classification_report method to generate metrics describing accuracy, precision, recall, and F1-score of each model.
The GPT-2-based model can generally achieve the best performance over all three models. LSTM-based model can also achieve relatively high accuracy, but its performance is not stable. In some training rounds it can only achieve a 55–60% accuracy, which may be due to weights initialization. BERT-based model is more stable than LSTM, but can only achieve a accuracy of ~85%. The confusion_matrix and ConfusionMatrixDisplay methods can also be used to visualize the performance of the model and identify areas where it may be making errors.
From Left To Right: Confusion Matrix of BERT-Based, GPT-2 Based, LSTM-Based Model
Interpretability Results: LIME and SHAP results provide insights into the most important features contributing to the predictions. LIME produces instance-by-instance explanations, which may seem random and harder to interpret. In contrast, SHAP offers a comprehensive evaluation, revealing features like “video” positively affect predictions, possibly suggesting that visual evidence makes news more reliable. Features like “http” and “says” negatively affect predictions, indicating online rumors and hearsay might be prevalent in these cases. Analyzing these explanations helps us better understand the factors considered by the models when detecting fake news.
Data Product
Our data product is a tool that allows users to enter a news article details such as title, source URL, description and receive a prediction of whether it is genuine or fake. This tool can be integrated/added to different “News Sharing” platforms to validate the integrity of a specific article before the article is posted/shared on a platform.
The application uses our best-performing machine learning model(s) to make predictions based on the text content of the article and the output probabilities will then be adjusted according to the credibility of source of news. We demonstrated the functionality of the application by testing it on several news articles and comparing the predictions to known ground truth labels.
This Screenshot was taken in April 2023
And using the “thumbs-up” and “thumbs-down” buttons, we have also provided options to give feedback to the models if its prediction is correct or not. Which then sends update to our model and this information will be used to re-train our model or extend its training on this live data.
Lessons Learnt
Through this project, we learned the importance of carefully selecting and preprocessing data, as well as the benefits and limitations of various machine learning techniques for text classification tasks. We also gained experience in developing web applications and presenting our findings in a clear and concise manner.
Future Work
As part of our future work, we plan to implement the following enhancements to improve the accuracy and reliability of our fake news detection system:
-
Feedback Processing: Introduce a user identification system (user ID) to differentiate the credibility of feedback. This will help us better evaluate and weigh the input from users when refining the model, ensuring that the feedback provided is reliable and trustworthy.
-
Maintain Information Sources and Weights: Develop a list of information sources and their respective weights to further enhance the accuracy of fake news detection. By assigning weights to the credibility and reliability of each information source, the model can better assess the likelihood of a news item being fake or genuine. Regularly updating this list will ensure that the model stays up-to-date with the evolving landscape of news sources.
By implementing these improvements, we aim to develop a more accurate and robust fake news detection system that can adapt to the changing nature of online information dissemination.
Summary
Our project ”The Truth Behind Fake News: Tools and Techniques for Detection” aimed to develop tools and techniques for identifying and detecting fake news articles. Through a data science pipeline involving data collection, preprocessing, feature extraction, model training, and evaluation, we were able to train machine learning models that accurately identified fake news with high accuracy.
Our data product, a web application for predicting the authenticity of news articles, demonstrated the effectiveness of our models in a real-world application. Our findings contribute to ongoing efforts to combat the spread of fake news and improve the accuracy and reliability of information available to the public.
Final Note
In the future, we plan to explore more advanced deep learning techniques to improve the performance of our fake news detection model. We also aim to integrate our tool with social media platforms to reach a larger audience and combat the spread of fake news. But that might be the topic of our future posts.
Till then , Keep Learning 😊