使用 python 进行真假新闻识别

程序员学长 2023-03-08

258

大家好，我是小寒。

假新闻的传播已成为当今社会的一个主要问题，「能够识别没有事实依据或故意误导的新闻文章非常重要」。

在这个项目中，我们将使用机器学习「根据内容将新闻文章分类为真实的或虚假的」。

通过识别假新闻文章，我们可以防止错误信息的传播并帮助人们做出更明智的决定。

加载数据集

本文使用的数据集是 Kaggle 上可用的 “假新闻和真实新闻数据集”，其中包含 50,000 篇标记为真实或虚假的新闻文章。该数据集是从各种新闻网站收集的，并经过预处理以删除无关的内容，例如 HTML 标签、广告和样板文本。该数据集提供了每篇新闻文章的「标题、文本、主题和发布日期等特征」。数据集可以从以下链接下载：https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset。

import pandas as pd
real_news=pd.read_csv('True.csv')
fake_news=pd.read_csv('Fake.csv')
real_news

正如我们所见，数据包含几列：文章标题、文章正文、文章主题和发表日期。「我们将使用标题和文本列来训练我们的模型。」

在开始训练模型之前，我们需要进行一些「探索性数据分析」以了解数据。

例如，我们可以使用以下代码绘制数据集中每个文章长度的分布：

import matplotlib.pyplot as plt
real_lengths = real_news['text'].apply(len)
fake_lengths = fake_news['text'].apply(len)

plt.hist(real_lengths, bins=50, alpha=0.5, label='Real')
plt.hist(fake_lengths, bins=50, alpha=0.5, label='Fake')
plt.title('Article Lengths')
plt.xlabel('Length')
plt.ylabel('Count')
plt.legend()
plt.show()

从上图可以看到，文章长度变化很大，有些文章很短（不到 1000 个字符），有些则很长（超过 40,000 个字符）。我们在预处理文本时需要考虑到这一点。

「我们还可以使用以下代码查看每个数据集中最常见的单词：」

from collections import Counter
import nltk
#downloading stopwords and punkt
nltk.download('stopwords')
nltk.download('punkt')

def get_most_common_words(texts, num_words=10):
    all_words = []
    for text in texts:
        all_words.extend(nltk.word_tokenize(text.lower()))
    stop_words = set(nltk.corpus.stopwords.words('english'))
    words = [word for word in all_words if word.isalpha() and word not in stop_words]
    word_counts = Counter(words)
    return word_counts.most_common(num_words)

real_words = get_most_common_words(real_news['text'])
fake_words = get_most_common_words(fake_news['text'])

print('Real News:', real_words)
print('Fake News:', fake_words)

Real News: [('said', 99036), ('trump', 54249), ('would', 31526), ('reuters', 28412), ('president', 26397), ('state', 19728), ('government', 18288), ('new', 16784), ('house', 16519), ('states', 16515)]
Fake News: [('trump', 74240), ('said', 31149), ('people', 26012), ('president', 25770), ('would', 23461), ('one', 22994), ('clinton', 18085), ('obama', 17920), ('like', 17660), ('donald', 17235)]

正如我们所看到的，两个数据集中一些最常见的词都与政治和原美国总统唐纳德特朗普有关。然而，这两个数据集之间存在一些差异，「假新闻数据集包含更多对希拉里克林顿的引用以及更多地使用“喜欢”等词。」

文本预处理

1、小写文本

小写文本是指将一段文本中的所有字母都转换为小写。这是一个常见的文本预处理步骤，可用于提高文本分类模型的准确性。例如，“Hello” 和 “hello” 将被不考虑大小写的模型视为两个不同的词，而如果将文本转换为小写，它们将被视为同一个词。

2、删除标点符号和数字

删除标点符号和数字是指从文本中删除非字母字符。这对于降低文本的复杂性并使模型更容易分析很有用。例如，“你好” 和 “你好！” 如果不考虑标点符号，文本分析模型会将其视为不同的词。

3、删除停用词

停用词是语言中非常常见且意义不大的词，例如 “the”、“and”、“in” 等。从一段文本中删除停用词有助于降低数据的维度并专注于文本中最重要的词。这还可以通过减少数据中的噪声来帮助提高文本分类模型的准确性。

4、对文本进行词干化或词形还原

词干提取和词形还原是将单词简化为基本形式的常用技术。

词干提取涉及删除单词的后缀以生成词干或词根。例如，“jumping” 一词将被词干化为 “jump”。此技术可用于降低数据的维度，但有时会导致词干不是实际单词。

相反，词形还原涉及使用字典或词法分析将单词还原为其基本形式。例如，单词 “jumping” 将被词形还原为 “jump”，这是一个实际单词。这种技术比词干提取更准确，但计算成本更高。

「词干提取和词形还原都可以降低文本数据的维度，使模型更容易分析。」

但是，请务必注意，它们有时会导致信息丢失，因此对这两种技术进行试验并确定哪种技术最适合特定的文本分类问题非常重要。

「我们将使用 NLTK 库执行这些步骤，它提供了各种文本处理工具。」

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Lowercase the text
    text = text.lower()

    # Remove punctuation and digits
    text = text.translate(str.maketrans('', '', string.punctuation + string.digits))

    # Tokenize the text
    words = word_tokenize(text)

    # Remove stop words
    words = [word for word in words if word not in stop_words]

    # Stem or lemmatize the words
    words = [stemmer.stem(word) for word in words]
   
        # Join the words back into a string
    text = ' '.join(words)

    return text

我们现在可以将此预处理函数应用于我们数据集中的每篇文章：

real_news['text'] = real_news['text'].apply(preprocess_text)
fake_news['text'] = fake_news['text'].apply(preprocess_text)

模型训练

既然我们已经预处理了文本数据，就可以训练我们的模型了。

我们将使用简单的词袋方法，将每篇文章表示为词频向量。我们将使用 「sklearn」 库中的 「CountVectorizer」 类将预处理文本转换为特征向量。

CountVectorizer 是自然语言处理中常用的文本预处理技术。它将文本文档的集合转换为字数矩阵。矩阵中的每一行代表一个文档，每一列代表文档集合中的一个词。

CountVectorizer 将文本文档集合转换为「标记计数矩阵」。它的工作原理是首先将文本标记为单词，然后计算每个文档中每个单词的频率。生成的矩阵可用作机器学习算法的输入，用于执行文本分类等任务。

CountVectorizer 有几个参数可以调整以自定义文本预处理。例如，“stop_words” 参数可用于指定在计数之前应从文本中删除的单词列表。“max_df” 参数可以指定一个词的最大文档频率，超过该词被认为是停用词并从文本中删除。

CountVectorizer 的优点之一是它使用简单并且适用于多种类型的文本分类问题。它在内存使用方面也非常有效，因为它只存储每个文档中每个单词的频率计数。另一个优点是它易于解释，因为可以直接检查生成的矩阵以了解不同单词在分类过程中的重要性。

「将文本数据转换为数字特征的其他方法包括 TF-IDF（词频-逆文档频率）、Word2Vec、Doc2Vec 和 GloVe（用于词表示的全局向量）。」

TF-IDF 与 CountVectorizer 类似，但它不是只计算每个词的频率，而是考虑该词在整个语料库中出现的频率，并根据每个词在文档中的重要程度为每个词分配权重。

Word2Vec 和 Doc2Vec 是学习单词和文档的低维向量表示的方法，可以捕获它们之间的潜在语义关系。

GloVe 是另一种学习单词向量表示的方法，它结合了 TF-IDF 和 Word2Vec 的优点。

每种方法都有其优点和缺点，方法的选择取决于手头的问题和数据集。

「对于这个数据集，我们使用 CountVectorizer」，如下所示。

from sklearn.feature_extraction.text import CountVectorizer
import scipy.sparse as sp
import numpy as np

vectorizer = CountVectorizer()
X_real = vectorizer.fit_transform(real_news['text'])
X_fake = vectorizer.transform(fake_news['text'])

X = sp.vstack([X_real, X_fake])
y = np.concatenate([np.ones(X_real.shape[0]), np.zeros(X_fake.shape[0])])

在这里，我们首先创建一个 「CountVectorizer」 对象并将其拟合到真实新闻数据集中的预处理文本。然后我们使用相同的向量化器来转换假新闻数据集中的预处理文本。然后，我们垂直堆叠两个数据集的特征矩阵，并创建相应的标签向量 y。

「现在我们有了特征和标签向量，我们可以将数据分成训练集和测试集：」

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

我们现在可以使用逻辑回归分类器训练我们的模型：

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=42)
clf.fit(X_train, y_train)

模型评估

现在我们已经训练好了模型，我们可以评估它在测试集上的性能。

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1)

输出:

Accuracy: 0.994988864142539
Precision: 0.9935498733010827
Recall: 0.9960739030023095
F1 Score: 0.9948102871641102

正如我们所看到的，我们的模型表现非常好，准确率超过 99%。

我们的数据集达到了 99% 以上的测试准确率，表明该模型可以准确地将新闻文章分类为真假。

改进模型

虽然我们的逻辑回归模型在测试集上取得了很高的准确性，但我们可以通过多种方式提高其性能：

特征工程：我们可以使用更高级的文本表示，例如词嵌入或主题模型，而不是使用词袋方法，这可能会捕获词之间更细微的关系。
超参数调整：我们可以使用网格搜索或随机搜索等方法调整逻辑回归模型的超参数，以找到数据集的最佳参数集。

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Define a function to train and evaluate a model
def train_and_evaluate_model(model, X_train, y_train, X_test, y_test):
    # Train the model on the training data
    model.fit(X_train, y_train)
    
    # Predict the labels for the testing data
    y_pred = model.predict(X_test)
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    # Print the evaluation metrics
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-score: {f1:.4f}")
# Train and evaluate a Multinomial Naive Bayes model
print("Training and evaluating Multinomial Naive Bayes model...")
nb = MultinomialNB()
train_and_evaluate_model(nb, X_train, y_train, X_test, y_test)
print()

# Train and evaluate a Support Vector Machine model
print("Training and evaluating Support Vector Machine model...")
svm = SVC()
train_and_evaluate_model(svm,  X_train, y_train, X_test, y_test)

输出：

Training and evaluating Multinomial Naive Bayes model...
Accuracy: 0.9422
Precision: 0.9422
Recall: 0.9422
F1-score: 0.9422

Training and evaluating Support Vector Machine model...
Accuracy: 0.9919
Precision: 0.9919
Recall: 0.9919
F1-score: 0.9919

下面使用 GridSearchCV 调整超参数。

from sklearn.model_selection import GridSearchCV

# Define a list of hyperparameters to search over
hyperparameters = {
    'penalty': ['l1', 'l2'],
    'C': [0.1, 1]
}

# Perform grid search to find the best hyperparameters
grid_search = GridSearchCV(LogisticRegression(), hyperparameters, cv=5)
grid_search.fit(X_train, y_train)

# Print the best hyperparameters and test accuracy
print('Best hyperparameters:', grid_search.best_params_)
print('Test accuracy:', grid_search.score(X_test, y_test))

输出：

Best hyperparameters: {'C': 1, 'penalty': 'l2'}
Test accuracy: 0.994988864142539

这些方法可能会进一步提高我们模型的准确性。

保存我们的模型：

from joblib import dump
dump(clf, 'model.joblib')
dump(vectorizer, 'vectorizer.joblib')

「joblib 库中的 dump 函数可用于将 clf 模型保存到 model.joblib 文件中。保存模型后，可以使用加载函数将其加载到其他 Python 脚本中。」

模型部署

最后，我们可以使用 Flask 框架将我们的模型部署为 Web 应用程序。

我们将创建一个简单的网络表单，用户可以在其中输入文本，模型将输出文本可能是真实新闻还是假新闻。

from flask import Flask, request, render_template
from joblib import load
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

clf = load('model.joblib')
vectorizer = load('vectorizer.joblib')

def preprocess_text(text):
    # Lowercase the text
    text = text.lower()

    # Remove punctuation and digits
    text = text.translate(str.maketrans('', '', string.punctuation + string.digits))

    # Tokenize the text
    words = word_tokenize(text)

    # Remove stop words
    words = [word for word in words if word not in stop_words]

    # Stem or lemmatize the words
    words = [stemmer.stem(word) for word in words]
   
        # Join the words back into a string
    text = ' '.join(words)

    return text


app = Flask(__name__)

@app.route('/')
def home():
    return render_template('home.html')

@app.route('/predict', methods=['POST'])
def predict():
    text = request.form['text']
    preprocessed_text = preprocess_text(text)
    X = vectorizer.transform([preprocessed_text])
    y_pred = clf.predict(X)
    if y_pred[0]== 1:
        result = 'real'
    else:
        result = 'fake'
    return render_template('result.html', result=result, text=text)

if __name__ == '__main__':
    app.run(debug=True)

我们可以将上面的代码保存在名为 “app.py” 的文件中。我们还需要创建两个 HTML 模板，“home.html” 和 “result.html”。

「home.html」

<!DOCTYPE html>
<html>
<head>
    <title>Real or Fake News</title>
</head>
<body>
    <h1>Real or Fake News</h1>
    <form action="/predict" method="post">
        <label for="text">Enter text:</label><br>
        <textarea name="text" rows="10" cols="50"></textarea><br>
        <input type="submit" value="Submit">
    </form>
</body>
</html>

「result.html」

<!DOCTYPE html>
<html>
<head>
    <title>Real or Fake News</title>
</head>
<body>
    <h1>Real or Fake News</h1>
    <p>The text you entered:</p>
    <p>{{ text }}</p>
    <p>The model predicts that this text is:</p>
    <p>{{ result }}</p>
</body>
</html>

我们现在可以在命令行中使用命令 python app.py 运行 Flask 应用程序。

该应用程序应该可以通过 http://127.0.0.1:5000/ 访问。