Text classification examples in Python-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Text classification examples in Python

PHPz

Jun 09, 2023 pm 08:22 PM

pythonExampleText Categorization

Text Classification Examples in Python

With the development of artificial intelligence and natural language processing technology, text classification has become one of the widely used technologies, and it can play an important role in natural language processing tasks. As a popular programming language, Python's powerful natural language processing libraries and machine learning libraries, such as NLTK, Scikit-learn and Tensorflow, make text classification very easy to implement in Python.

This article will introduce examples of Python text classification and demonstrate how to use Python for text classification through examples.

Data collection and preprocessing

Before text classification, data needs to be collected, cleaned and preprocessed. Here we will use a dataset from a sentiment analysis task as an example. This dataset contains two categories of movie reviews, representing positive and negative sentiments respectively. The data set comes from the movie review website IMDb and can be downloaded at http://ai.stanford.edu/~amaas/data/sentiment/.

Each comment in the dataset is a text file with the tag pos or neg in the file name. We can use Python's os library to read the file, and then store the text and labels into a Pandas DataFrame to facilitate subsequent processing.

import os
import pandas as pd

# 读取文件
def read_data(folder):
    files = os.listdir(folder)
    data = {'text': [], 'sentiment': []}
    for file in files:
        with open(os.path.join(folder, file), 'r') as f:
            data['text'].append(f.read())
            data['sentiment'].append(file.split('.')[0])
    return pd.DataFrame.from_dict(data)

# 读取数据集
train_folder = 'aclImdb/train'
test_folder = 'aclImdb/test'
train_data = read_data(train_folder)
test_data = read_data(test_folder)

Then, we can use Pandas’s groupby method to count the proportion of text length and emotional labels in the data set.

# 统计文本长度
train_data['text_len'] = train_data['text'].apply(len)
test_data['text_len'] = test_data['text'].apply(len)

# 统计情感标签比例
train_sentiment_pct = train_data.groupby('sentiment').size() / len(train_data)
test_sentiment_pct = test_data.groupby('sentiment').size() / len(test_data)
print('Train Sentiment Distribution: 
{}
'.format(train_sentiment_pct))
print('Test Sentiment Distribution: 
{}
'.format(test_sentiment_pct))

Running the above code, we can see that the number of positive and negative comments in the dataset is roughly the same, and the sentiment labels are evenly distributed.

Feature extraction

Before text classification, the text needs to be converted into a form that the computer can understand. Here we will use the bag-of-words model for feature extraction.

The bag-of-words model is based on an assumption: the importance of each word in the text is equal, so all the words in the text are extracted to form a vocabulary (vocabulary), and then each word is A text is represented as a vector, and each element of the vector represents the number of times the word appears in the text.

In Scikit-learn, you can use CountVectorizer for feature extraction.

from sklearn.feature_extraction.text import CountVectorizer

# 创建CountVectorizer对象
vectorizer = CountVectorizer(stop_words='english')

# 将文本转换为向量
train_features = vectorizer.fit_transform(train_data['text'])
test_features = vectorizer.transform(test_data['text'])

# 打印特征维度
print('Train Feature Dimension: {}'.format(train_features.shape))
print('Test  Feature Dimension: {}'.format(test_features.shape))

The above code converts text into vectors. Each text is a sparse vector with a dimension of the size of the vocabulary. As you can see, there are a total of 250,000 features in this data set, and the dimension is very high.

Model training and evaluation

Train and evaluate using multiple classifiers in Scikit-learn. Here we will use Logistic Regression Classifier, Naive Bayes Classifier, Support Vector Machine Classifier and Random Forest Classifier to see which classifier performs best.

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score

# 训练和评估函数
def train_and_evalute(classifier, train_features, train_labels, test_features, test_labels):
    # 训练分类器
    classifier.fit(train_features, train_labels)

    # 在训练集和测试集上计算F1分数和准确率
    train_predictions = classifier.predict(train_features)
    test_predictions = classifier.predict(test_features)
    train_f1 = f1_score(train_labels, train_predictions, pos_label='pos')
    test_f1 = f1_score(test_labels, test_predictions, pos_label='pos')
    train_accuracy = accuracy_score(train_labels, train_predictions)
    test_accuracy = accuracy_score(test_labels, test_predictions)

    # 打印评估结果
    print('Train F1 Score: {0:.3f}'.format(train_f1))
    print('Test  F1 Score: {0:.3f}'.format(test_f1))
    print('Train Accuracy: {0:.3f}'.format(train_accuracy))
    print('Test  Accuracy: {0:.3f}'.format(test_accuracy))

# 训练和评估各个分类器
classifiers = [
    ('Logistic Regression', LogisticRegression(max_iter=1000)),
    ('Multinomial Naive Bayes', MultinomialNB()),
    ('Support Vector Machine', SVC(kernel='linear')),
    ('Random Forest', RandomForestClassifier(n_estimators=100))
]
for classifier_name, classifier in classifiers:
    print('
{}'.format(classifier_name))
    train_and_evalute(classifier, train_features, train_data['sentiment'], test_features, test_data['sentiment'])

The above code uses the training set and test set to evaluate each classifier. We can see that the Naive Bayes classifier performs very well on both the training set and the test set, achieving an F1 score of 0.87 and an accuracy of 0.85. Other classifiers performed slightly less well, but also performed well.

Conclusion

This article introduces examples of text classification in Python, including data collection and preprocessing, feature extraction, and model training and evaluation. Through examples, we learned how to use Python for text classification, and learned about text classification algorithms based on logistic regression, naive Bayes, support vector machines, and random forests.

In real situations, we may need to perform more in-depth processing and analysis of text data, such as removing stop words, stemming, word vector representation, etc., to improve the performance of text classification. At the same time, you can also try to use deep learning models, such as convolutional neural networks (CNN) and recurrent neural networks (RNN), for text classification.

The above is the detailed content of Text classification examples in Python. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Python and Time: Making the Most of Your Study TimeApr 14, 2025 am 12:02 AM

To maximize the efficiency of learning Python in a limited time, you can use Python's datetime, time, and schedule modules. 1. The datetime module is used to record and plan learning time. 2. The time module helps to set study and rest time. 3. The schedule module automatically arranges weekly learning tasks.

Python: Games, GUIs, and MoreApr 13, 2025 am 12:14 AM

Python excels in gaming and GUI development. 1) Game development uses Pygame, providing drawing, audio and other functions, which are suitable for creating 2D games. 2) GUI development can choose Tkinter or PyQt. Tkinter is simple and easy to use, PyQt has rich functions and is suitable for professional development.

Python vs. C : Applications and Use Cases ComparedApr 12, 2025 am 12:01 AM

Python is suitable for data science, web development and automation tasks, while C is suitable for system programming, game development and embedded systems. Python is known for its simplicity and powerful ecosystem, while C is known for its high performance and underlying control capabilities.

The 2-Hour Python Plan: A Realistic ApproachApr 11, 2025 am 12:04 AM

You can learn basic programming concepts and skills of Python within 2 hours. 1. Learn variables and data types, 2. Master control flow (conditional statements and loops), 3. Understand the definition and use of functions, 4. Quickly get started with Python programming through simple examples and code snippets.

Python: Exploring Its Primary ApplicationsApr 10, 2025 am 09:41 AM

Python is widely used in the fields of web development, data science, machine learning, automation and scripting. 1) In web development, Django and Flask frameworks simplify the development process. 2) In the fields of data science and machine learning, NumPy, Pandas, Scikit-learn and TensorFlow libraries provide strong support. 3) In terms of automation and scripting, Python is suitable for tasks such as automated testing and system management.

How Much Python Can You Learn in 2 Hours?Apr 09, 2025 pm 04:33 PM

You can learn the basics of Python within two hours. 1. Learn variables and data types, 2. Master control structures such as if statements and loops, 3. Understand the definition and use of functions. These will help you start writing simple Python programs.

How to teach computer novice programming basics in project and problem-driven methods within 10 hours?Apr 02, 2025 am 07:18 AM

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...