Table of Contents
1 Word frequency statistics
1.1 Simple word frequency statistics
1.2 Add stop words
2 Keyword Extraction
2.1 Keyword Extraction Principle
2.2 Keyword extraction code
Home Backend Development Python Tutorial How to use Jieba for word frequency statistics and keyword extraction in Python

How to use Jieba for word frequency statistics and keyword extraction in Python

May 02, 2023 pm 07:46 PM
python jieba

1 Word frequency statistics

1.1 Simple word frequency statistics

1. Import the jieba library and define the text

import jieba
text = "Python是一种高级编程语言,广泛应用于人工智能、数据分析、Web开发等领域。"
Copy after login

2. Segment the text

words = jieba.cut(text)
Copy after login

This step will divide the text into several words and return a generator object words. You can use for to loop through all the words.

3. Count word frequency

word_count = {}
for word in words:
    if len(word) > 1:
        word_count[word] = word_count.get(word, 0) + 1
Copy after login

This step traverses all words, counts the number of times each word appears, and saves it to a dictionaryword_count. When counting word frequencies, optimization can be performed by removing stop words. Here, words with a length less than 2 are simply filtered.

4. Result output

for word, count in word_count.items():
    print(word, count)
Copy after login

How to use Jieba for word frequency statistics and keyword extraction in Python

1.2 Add stop words

In order to count the word frequency more accurately, we can count the word frequency in the word frequency statistics Add stop words to remove some common but meaningless words. The specific steps are as follows:

Define the stop word list

import jieba

# 停用词列表
stopwords = ['是', '一种', '等']
Copy after login

Segment the text and filter the stop words

text = "Python是一种高级编程语言,广泛应用于人工智能、数据分析、Web开发等领域。"
words = jieba.cut(text)
words_filtered = [word for word in words if word not in stopwords and len(word) > 1]
Copy after login

Count word frequencies and output the results

word_count = {}
for word in words_filtered:
    word_count[word] = word_count.get(word, 0) + 1
for word, count in word_count.items():
    print(word, count)
Copy after login

After adding stop words, the output result is:

How to use Jieba for word frequency statistics and keyword extraction in Python

It can be seen that the disabled word kind of is not displayed.

2 Keyword Extraction

2.1 Keyword Extraction Principle

Different from word frequency statistics that simply count words, jieba’s principle of keyword extraction is based on TF-IDF ( Term Frequency-Inverse Document Frequency) algorithm. The TF-IDF algorithm is a commonly used text feature extraction method that can measure the importance of a word in the text.

Specifically, the TF-IDF algorithm contains two parts:

  • Term Frequency: refers to the number of times a word appears in the text, usually using a Simple statistical value representation, such as word frequency, bigram word frequency, etc. Word frequency reflects the importance of a word in the text, but ignores the prevalence of the word in the entire corpus.

  • Inverse Document Frequency: refers to the reciprocal of the frequency of a word appearing in all documents, and is used to measure the prevalence of a word. The greater the inverse document frequency, the more common a word is and the lower the importance; the smaller the inverse document frequency is, the more unique the word is and the higher the importance.

The TF-IDF algorithm calculates the importance of each word in the text by comprehensively considering word frequency and inverse document frequency to extract keywords. In jieba, the specific implementation of keyword extraction includes the following steps:

  • Perform word segmentation on the text and obtain the word segmentation results.

  • Count the number of times each word appears in the text and calculate the word frequency.

  • Count the number of times each word appears in all documents and calculate the inverse document frequency.

  • Comprehensive consideration of word frequency and inverse document frequency, calculate the TF-IDF value of each word in the text.

  • Sort the TF-IDF values ​​and select the words with the highest scores as keywords.

For example:
F (Term Frequency) refers to the frequency of a certain word appearing in a document. The calculation formula is as follows:
T F = (number of times a word appears in the document) / (total number of words in the document)
For example, in a document containing 100 words, a certain word appears 10 times , then the TF of the word is
10 / 100 = 0.1
IDF (Inverse Document Frequency) refers to the reciprocal of the number of documents in which a certain word appears in the document collection. The calculation formula is as follows:
I D F = log (total number of documents in the document collection/number of documents containing the word)
For example, in a document collection containing 1,000 documents, a certain word appears in 100 documents If so, the IDF of the word is log (1000 / 100) = 1.0
TFIDF is the result of multiplying TF and IDF. The calculation formula is as follows:
T F I D F = T F ∗ I D F

It should be noted that the TF-IDF algorithm only considers the occurrence of words in the text and ignores the correlation between words. Therefore, in some specific application scenarios, other text feature extraction methods need to be used, such as word vectors, topic models, etc.

2.2 Keyword extraction code

import jieba.analyse

# 待提取关键字的文本
text = "Python是一种高级编程语言,广泛应用于人工智能、数据分析、Web开发等领域。"

# 使用jieba提取关键字
keywords = jieba.analyse.extract_tags(text, topK=5, withWeight=True)

# 输出关键字和对应的权重
for keyword, weight in keywords:
    print(keyword, weight)
Copy after login

In this example, we first imported the jieba.analyse module, and then defined a text to be extracted for keywordstext. Next, we use the jieba.analyse.extract_tags() function to extract keywords, where the topK parameter indicates the number of keywords to be extracted, and the withWeight parameter indicates whether Returns the weight value of the keyword. Finally, we iterate through the keyword list and output each keyword and its corresponding weight value.
The output result of this function is:

How to use Jieba for word frequency statistics and keyword extraction in Python

As you can see, jieba extracted several keywords in the input text based on the TF-IDF algorithm, and returned the weight value of each keyword.

The above is the detailed content of How to use Jieba for word frequency statistics and keyword extraction in Python. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

PHP and Python: Code Examples and Comparison PHP and Python: Code Examples and Comparison Apr 15, 2025 am 12:07 AM

PHP and Python have their own advantages and disadvantages, and the choice depends on project needs and personal preferences. 1.PHP is suitable for rapid development and maintenance of large-scale web applications. 2. Python dominates the field of data science and machine learning.

Python vs. JavaScript: Community, Libraries, and Resources Python vs. JavaScript: Community, Libraries, and Resources Apr 15, 2025 am 12:16 AM

Python and JavaScript have their own advantages and disadvantages in terms of community, libraries and resources. 1) The Python community is friendly and suitable for beginners, but the front-end development resources are not as rich as JavaScript. 2) Python is powerful in data science and machine learning libraries, while JavaScript is better in front-end development libraries and frameworks. 3) Both have rich learning resources, but Python is suitable for starting with official documents, while JavaScript is better with MDNWebDocs. The choice should be based on project needs and personal interests.

Detailed explanation of docker principle Detailed explanation of docker principle Apr 14, 2025 pm 11:57 PM

Docker uses Linux kernel features to provide an efficient and isolated application running environment. Its working principle is as follows: 1. The mirror is used as a read-only template, which contains everything you need to run the application; 2. The Union File System (UnionFS) stacks multiple file systems, only storing the differences, saving space and speeding up; 3. The daemon manages the mirrors and containers, and the client uses them for interaction; 4. Namespaces and cgroups implement container isolation and resource limitations; 5. Multiple network modes support container interconnection. Only by understanding these core concepts can you better utilize Docker.

How to run programs in terminal vscode How to run programs in terminal vscode Apr 15, 2025 pm 06:42 PM

In VS Code, you can run the program in the terminal through the following steps: Prepare the code and open the integrated terminal to ensure that the code directory is consistent with the terminal working directory. Select the run command according to the programming language (such as Python's python your_file_name.py) to check whether it runs successfully and resolve errors. Use the debugger to improve debugging efficiency.

Python: Automation, Scripting, and Task Management Python: Automation, Scripting, and Task Management Apr 16, 2025 am 12:14 AM

Python excels in automation, scripting, and task management. 1) Automation: File backup is realized through standard libraries such as os and shutil. 2) Script writing: Use the psutil library to monitor system resources. 3) Task management: Use the schedule library to schedule tasks. Python's ease of use and rich library support makes it the preferred tool in these areas.

What is vscode What is vscode for? What is vscode What is vscode for? Apr 15, 2025 pm 06:45 PM

VS Code is the full name Visual Studio Code, which is a free and open source cross-platform code editor and development environment developed by Microsoft. It supports a wide range of programming languages ​​and provides syntax highlighting, code automatic completion, code snippets and smart prompts to improve development efficiency. Through a rich extension ecosystem, users can add extensions to specific needs and languages, such as debuggers, code formatting tools, and Git integrations. VS Code also includes an intuitive debugger that helps quickly find and resolve bugs in your code.

Is the vscode extension malicious? Is the vscode extension malicious? Apr 15, 2025 pm 07:57 PM

VS Code extensions pose malicious risks, such as hiding malicious code, exploiting vulnerabilities, and masturbating as legitimate extensions. Methods to identify malicious extensions include: checking publishers, reading comments, checking code, and installing with caution. Security measures also include: security awareness, good habits, regular updates and antivirus software.

How to install nginx in centos How to install nginx in centos Apr 14, 2025 pm 08:06 PM

CentOS Installing Nginx requires following the following steps: Installing dependencies such as development tools, pcre-devel, and openssl-devel. Download the Nginx source code package, unzip it and compile and install it, and specify the installation path as /usr/local/nginx. Create Nginx users and user groups and set permissions. Modify the configuration file nginx.conf, and configure the listening port and domain name/IP address. Start the Nginx service. Common errors need to be paid attention to, such as dependency issues, port conflicts, and configuration file errors. Performance optimization needs to be adjusted according to the specific situation, such as turning on cache and adjusting the number of worker processes.

See all articles