Detailed explanation of Python text feature extraction and vectorization algorithm learning examples-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Detailed explanation of Python text feature extraction and vectorization algorithm learning examples

小云云

Dec 23, 2017 pm 05:05 PM

pythonextractQuantify

Suppose we have just watched Nolan's blockbuster "Interstellar", how can we let the machine automatically analyze whether the audience's evaluation of the movie is "positive" or "negative"? This type of problem is a sentiment analysis problem. The first step in dealing with this type of problem is to convert text into features. This article mainly introduces the Python text feature extraction and vectorization algorithm in detail. It has certain reference value. Interested friends can refer to it. I hope it can help everyone.

Therefore, in this chapter we only learn the first step, how to extract features from text and vectorize them.

Since the processing of Chinese involves word segmentation, this article uses a simple example to illustrate how to use Python's machine learning library to extract features from English.

1. Data preparation

Python's sklearn.datasets supports reading all classified texts from the directory. However, the directories must be placed according to the rules of one folder and one label name. For example, the data set used in this article has a total of 2 labels, one is "net" and the other is "pos", and there are 6 text files under each directory. The directory is as follows:

neg
1.txt
2.txt
......
pos
1.txt
2 .txt
....

The contents of the 12 files are summarized as follows:

neg: 
  shit. 
  waste my money. 
  waste of money. 
  sb movie. 
  waste of time. 
  a shit movie. 
pos: 
  nb! nb movie! 
  nb! 
  worth my money. 
  I love this movie! 
  a nb movie. 
  worth it!

2. Text features

How to extract emotional attitudes from these English words and classify them?

The most intuitive way is to extract words. It is generally believed that many keywords can reflect the speaker's attitude. For example, in the simple data set above, it is easy to find that anything that says "shit" must belong to the neg category.

Of course, the above data set is simply designed for convenience of description. In reality, a word often has ambiguous attitudes. But there is still reason to believe that the more a word appears in the neg category, the greater the probability that it expresses the neg attitude.

We also noticed that some words are meaningless for sentiment classification. For example, words such as "of" and "I" in the above data. This type of word has a name, called "
Stop_Word" (stop word). Such words can be completely ignored and not counted. Obviously by ignoring these words, the storage space of word frequency records can be optimized and the construction speed is faster. There is also a problem in using the word frequency of each word as an important feature. For example, "movie" in the above data appears 5 times in 12 samples, but the number of positive and negative occurrences is almost the same, and there is no distinction. And "worth" appears twice, but only in the pos category. It obviously has a strong strong color, that is, the distinction is very high.

Therefore, we need to introduce

TF-IDF (Term Frequency-Inverse Document Frequency, Term frequency and reverse document frequency) to further consider each word .

TF (Word Frequency) is calculated very simply, that is, for a document t, the frequency of a certain word Nt appearing in the document. For example, in the document "I love this movie", the TF of the word "love" is 1/4. If you remove the stop words "I" and "it", it is 1/2.

IDF (Inverse Document Frequency) means that for a certain word t, the number of documents Dt in which the word appears accounts for the proportion of all test documents D. Then find the natural logarithm. For example, the word "movie" appears 5 times in total, and the total number of documents is 12, so the IDF is ln(5/12).
Obviously, IDF is to highlight the words that appear rarely but have strong emotional color. For example, the IDF of a word like "movie" is ln(12/5)=0.88, which is much smaller than the IDF of "love"=ln(12/1)=2.48.

TF-IDF is simply multiplying the two together. In this way, finding the TF-IDF of each word in each document is the text feature value we extracted.

3. Vectorization

With the above foundation, the document can be vectorized. Let’s look at the code first, and then analyze the meaning of vectorization:

# -*- coding: utf-8 -*- 
import scipy as sp 
import numpy as np 
from sklearn.datasets import load_files 
from sklearn.cross_validation import train_test_split 
from sklearn.feature_extraction.text import TfidfVectorizer 
 
&#39;&#39;&#39;&#39;&#39;加载数据集，切分数据集80%训练，20%测试&#39;&#39;&#39; 
movie_reviews = load_files(&#39;endata&#39;)  
doc_terms_train, doc_terms_test, y_train, y_test\ 
  = train_test_split(movie_reviews.data, movie_reviews.target, test_size = 0.3) 
   
&#39;&#39;&#39;&#39;&#39;BOOL型特征下的向量空间模型，注意，测试样本调用的是transform接口&#39;&#39;&#39; 
count_vec = TfidfVectorizer(binary = False, decode_error = &#39;ignore&#39;,\ 
              stop_words = &#39;english&#39;) 
x_train = count_vec.fit_transform(doc_terms_train) 
x_test = count_vec.transform(doc_terms_test) 
x    = count_vec.transform(movie_reviews.data) 
y    = movie_reviews.target 
print(doc_terms_train) 
print(count_vec.get_feature_names()) 
print(x_train.toarray()) 
print(movie_reviews.target)

运行结果如下：
[b'waste of time.', b'a shit movie.', b'a nb movie.', b'I love this movie!', b'shit.', b'worth my money.', b'sb movie.', b'worth it!']
['love', 'money', 'movie', 'nb', 'sb', 'shit', 'time', 'waste', 'worth']
[[ 0.          0.          0.          0.          0.          0.   0.70710678 0.70710678 0.        ]
[ 0.          0.          0.60335753 0.          0.          0.79747081   0.          0.          0.        ]
[ 0.          0.          0.53550237 0.84453372 0.          0.          0.   0.          0.        ]
[ 0.84453372 0.          0.53550237 0.          0.          0.          0.   0.          0.        ]
[ 0.          0.          0.          0.          0.          1.          0.   0.          0.        ]
[ 0.          0.76642984 0.          0.          0.          0.          0.   0.          0.64232803]
[ 0.          0.          0.53550237 0.          0.84453372 0.          0.   0.          0.        ]
[ 0.          0.          0.          0.          0.          0.          0.   0.          1.        ]]
[1 1 0 1 0 1 0 1 1 0 0 0]

python输出的比较混乱。我这里做了一个表格如下：

从上表可以发现如下几点：

1、停用词的过滤。

初始化count_vec的时候，我们在count_vec构造时传递了stop_words = 'english'，表示使用默认的英文停用词。可以使用count_vec.get_stop_words()查看TfidfVectorizer内置的所有停用词。当然，在这里可以传递你自己的停用词list（比如这里的“movie”）

2、TF-IDF的计算。

这里词频的计算使用的是sklearn的TfidfVectorizer。这个类继承于CountVectorizer，在后者基本的词频统计基础上增加了如TF-IDF之类的功能。
我们会发现这里计算的结果跟我们之前计算不太一样。因为这里count_vec构造时默认传递了max_df=1，因此TF-IDF都做了规格化处理，以便将所有值约束在[0,1]之间。

3. The result of count_vec.fit_transform is a huge matrix. We can see that there are a lot of 0's in the above table, so sklearn uses a sparse matrix for its internal implementation. The data in this example is small. If readers are interested, you can try real data used by machine learning researchers, from Cornell University: http://www.cs.cornell.edu/people/pabo/movie-review-data/. This website provides many data sets, including several databases of about 2M, with about 700 positive and negative examples. The scale of this kind of data is not large and can still be completed within 1 minute. I suggest you give it a try. However, be aware that these data sets may have illegal character issues. So when constructing count_vec, decode_error = 'ignore' is passed in to ignore these illegal characters.

The results in the above table are the results of training 8 features of 8 samples. This result can be classified using various classification algorithms.

Related recommendations:

Share Python text generation QR code example

Detailed explanation of edit distance for Python text similarity calculation

Example detailed explanation of Python implementation of simple web page image grabbing

The above is the detailed content of Detailed explanation of Python text feature extraction and vectorization algorithm learning examples. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Learning Python: Is 2 Hours of Daily Study Sufficient?Apr 18, 2025 am 12:22 AM

Is it enough to learn Python for two hours a day? It depends on your goals and learning methods. 1) Develop a clear learning plan, 2) Select appropriate learning resources and methods, 3) Practice and review and consolidate hands-on practice and review and consolidate, and you can gradually master the basic knowledge and advanced functions of Python during this period.

Python for Web Development: Key ApplicationsApr 18, 2025 am 12:20 AM

Key applications of Python in web development include the use of Django and Flask frameworks, API development, data analysis and visualization, machine learning and AI, and performance optimization. 1. Django and Flask framework: Django is suitable for rapid development of complex applications, and Flask is suitable for small or highly customized projects. 2. API development: Use Flask or DjangoRESTFramework to build RESTfulAPI. 3. Data analysis and visualization: Use Python to process data and display it through the web interface. 4. Machine Learning and AI: Python is used to build intelligent web applications. 5. Performance optimization: optimized through asynchronous programming, caching and code

Python vs. C : Exploring Performance and EfficiencyApr 18, 2025 am 12:20 AM

Python is better than C in development efficiency, but C is higher in execution performance. 1. Python's concise syntax and rich libraries improve development efficiency. 2.C's compilation-type characteristics and hardware control improve execution performance. When making a choice, you need to weigh the development speed and execution efficiency based on project needs.

Python in Action: Real-World ExamplesApr 18, 2025 am 12:18 AM

Python's real-world applications include data analytics, web development, artificial intelligence and automation. 1) In data analysis, Python uses Pandas and Matplotlib to process and visualize data. 2) In web development, Django and Flask frameworks simplify the creation of web applications. 3) In the field of artificial intelligence, TensorFlow and PyTorch are used to build and train models. 4) In terms of automation, Python scripts can be used for tasks such as copying files.

Python's Main Uses: A Comprehensive OverviewApr 18, 2025 am 12:18 AM

Python is widely used in data science, web development and automation scripting fields. 1) In data science, Python simplifies data processing and analysis through libraries such as NumPy and Pandas. 2) In web development, the Django and Flask frameworks enable developers to quickly build applications. 3) In automated scripts, Python's simplicity and standard library make it ideal.

The Main Purpose of Python: Flexibility and Ease of UseApr 17, 2025 am 12:14 AM

Python's flexibility is reflected in multi-paradigm support and dynamic type systems, while ease of use comes from a simple syntax and rich standard library. 1. Flexibility: Supports object-oriented, functional and procedural programming, and dynamic type systems improve development efficiency. 2. Ease of use: The grammar is close to natural language, the standard library covers a wide range of functions, and simplifies the development process.

Python: The Power of Versatile ProgrammingApr 17, 2025 am 12:09 AM

Python is highly favored for its simplicity and power, suitable for all needs from beginners to advanced developers. Its versatility is reflected in: 1) Easy to learn and use, simple syntax; 2) Rich libraries and frameworks, such as NumPy, Pandas, etc.; 3) Cross-platform support, which can be run on a variety of operating systems; 4) Suitable for scripting and automation tasks to improve work efficiency.

Learning Python in 2 Hours a Day: A Practical GuideApr 17, 2025 am 12:05 AM

Yes, learn Python in two hours a day. 1. Develop a reasonable study plan, 2. Select the right learning resources, 3. Consolidate the knowledge learned through practice. These steps can help you master Python in a short time.

See all articles