Home > Article > Backend Development > Text classification examples in Python
Text Classification Examples in Python
With the development of artificial intelligence and natural language processing technology, text classification has become one of the widely used technologies, and it can play an important role in natural language processing tasks. As a popular programming language, Python's powerful natural language processing libraries and machine learning libraries, such as NLTK, Scikit-learn and Tensorflow, make text classification very easy to implement in Python.
This article will introduce examples of Python text classification and demonstrate how to use Python for text classification through examples.
Before text classification, data needs to be collected, cleaned and preprocessed. Here we will use a dataset from a sentiment analysis task as an example. This dataset contains two categories of movie reviews, representing positive and negative sentiments respectively. The data set comes from the movie review website IMDb and can be downloaded at http://ai.stanford.edu/~amaas/data/sentiment/.
Each comment in the dataset is a text file with the tag pos
or neg
in the file name. We can use Python's os
library to read the file, and then store the text and labels into a Pandas DataFrame to facilitate subsequent processing.
import os import pandas as pd # 读取文件 def read_data(folder): files = os.listdir(folder) data = {'text': [], 'sentiment': []} for file in files: with open(os.path.join(folder, file), 'r') as f: data['text'].append(f.read()) data['sentiment'].append(file.split('.')[0]) return pd.DataFrame.from_dict(data) # 读取数据集 train_folder = 'aclImdb/train' test_folder = 'aclImdb/test' train_data = read_data(train_folder) test_data = read_data(test_folder)
Then, we can use Pandas’s groupby
method to count the proportion of text length and emotional labels in the data set.
# 统计文本长度 train_data['text_len'] = train_data['text'].apply(len) test_data['text_len'] = test_data['text'].apply(len) # 统计情感标签比例 train_sentiment_pct = train_data.groupby('sentiment').size() / len(train_data) test_sentiment_pct = test_data.groupby('sentiment').size() / len(test_data) print('Train Sentiment Distribution: {} '.format(train_sentiment_pct)) print('Test Sentiment Distribution: {} '.format(test_sentiment_pct))
Running the above code, we can see that the number of positive and negative comments in the dataset is roughly the same, and the sentiment labels are evenly distributed.
Before text classification, the text needs to be converted into a form that the computer can understand. Here we will use the bag-of-words model for feature extraction.
The bag-of-words model is based on an assumption: the importance of each word in the text is equal, so all the words in the text are extracted to form a vocabulary (vocabulary), and then each word is A text is represented as a vector, and each element of the vector represents the number of times the word appears in the text.
In Scikit-learn, you can use CountVectorizer
for feature extraction.
from sklearn.feature_extraction.text import CountVectorizer # 创建CountVectorizer对象 vectorizer = CountVectorizer(stop_words='english') # 将文本转换为向量 train_features = vectorizer.fit_transform(train_data['text']) test_features = vectorizer.transform(test_data['text']) # 打印特征维度 print('Train Feature Dimension: {}'.format(train_features.shape)) print('Test Feature Dimension: {}'.format(test_features.shape))
The above code converts text into vectors. Each text is a sparse vector with a dimension of the size of the vocabulary. As you can see, there are a total of 250,000 features in this data set, and the dimension is very high.
Train and evaluate using multiple classifiers in Scikit-learn. Here we will use Logistic Regression Classifier, Naive Bayes Classifier, Support Vector Machine Classifier and Random Forest Classifier to see which classifier performs best.
from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import MultinomialNB from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import f1_score, accuracy_score # 训练和评估函数 def train_and_evalute(classifier, train_features, train_labels, test_features, test_labels): # 训练分类器 classifier.fit(train_features, train_labels) # 在训练集和测试集上计算F1分数和准确率 train_predictions = classifier.predict(train_features) test_predictions = classifier.predict(test_features) train_f1 = f1_score(train_labels, train_predictions, pos_label='pos') test_f1 = f1_score(test_labels, test_predictions, pos_label='pos') train_accuracy = accuracy_score(train_labels, train_predictions) test_accuracy = accuracy_score(test_labels, test_predictions) # 打印评估结果 print('Train F1 Score: {0:.3f}'.format(train_f1)) print('Test F1 Score: {0:.3f}'.format(test_f1)) print('Train Accuracy: {0:.3f}'.format(train_accuracy)) print('Test Accuracy: {0:.3f}'.format(test_accuracy)) # 训练和评估各个分类器 classifiers = [ ('Logistic Regression', LogisticRegression(max_iter=1000)), ('Multinomial Naive Bayes', MultinomialNB()), ('Support Vector Machine', SVC(kernel='linear')), ('Random Forest', RandomForestClassifier(n_estimators=100)) ] for classifier_name, classifier in classifiers: print(' {}'.format(classifier_name)) train_and_evalute(classifier, train_features, train_data['sentiment'], test_features, test_data['sentiment'])
The above code uses the training set and test set to evaluate each classifier. We can see that the Naive Bayes classifier performs very well on both the training set and the test set, achieving an F1 score of 0.87 and an accuracy of 0.85. Other classifiers performed slightly less well, but also performed well.
This article introduces examples of text classification in Python, including data collection and preprocessing, feature extraction, and model training and evaluation. Through examples, we learned how to use Python for text classification, and learned about text classification algorithms based on logistic regression, naive Bayes, support vector machines, and random forests.
In real situations, we may need to perform more in-depth processing and analysis of text data, such as removing stop words, stemming, word vector representation, etc., to improve the performance of text classification. At the same time, you can also try to use deep learning models, such as convolutional neural networks (CNN) and recurrent neural networks (RNN), for text classification.
The above is the detailed content of Text classification examples in Python. For more information, please follow other related articles on the PHP Chinese website!