Home  >  Article  >  Backend Development  >  How to use CountVectorizer in Python's sklearn?

How to use CountVectorizer in Python's sklearn?

2023-05-07 23:58:061669browse


CountVectorizer official document.

Vectorize a document collection into a count matrix.

If you do not provide an a priori dictionary and do not use the analyzer to do some kind of feature selection, then the number of features will be equal to the vocabulary discovered by analyzing the data.

Data preprocessing

Two methods: 1. You can put it directly into the model without word segmentation; 2. You can segment the Chinese text first.

The vocabulary produced by the two methods will be very different. Specific demonstrations will be given later.

import jieba
import re
from sklearn.feature_extraction.text import CountVectorizer
text = ['很少在公众场合手机外放',
text = [' '.join(re.findall('[\u4e00-\u9fa5]+',tt,re.S)) for tt in text]
text = [' '.join(jieba.lcut(tt)) for tt in text]

How to use CountVectorizer in Pythons sklearn?

Build the model

Train the model

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text)

All words: model.get_feature_names()

feature_names = vectorizer.get_feature_names()

No word segmentation Generated vocabulary

How to use CountVectorizer in Pythons sklearn?

Generated vocabulary after word segmentation

How to use CountVectorizer in Pythons sklearn?

Counting matrix: X.toarray()

matrix = X.toarray()

How to use CountVectorizer in Pythons sklearn?

df = pd.DataFrame(matrix, columns=feature_names)

How to use CountVectorizer in Pythons sklearn?

Vocabulary index: model.vocabulary_


How to use CountVectorizer in Pythons sklearn?

The above is the detailed content of How to use CountVectorizer in Python's sklearn?. For more information, please follow other related articles on the PHP Chinese website!

This article is reproduced at:yisu.com. If there is any infringement, please contact admin@php.cn delete