Home  >  Article  >  Backend Development  >  The use of Python lightweight search tool Whoosh (summary sharing)

The use of Python lightweight search tool Whoosh (summary sharing)

WBOY
WBOYforward
2022-07-26 14:03:413016browse

This article brings you relevant knowledge about Python. It will briefly introduce Whoosh, a lightweight search tool in Python, and give the corresponding usage example code, as follows Let's take a look, I hope it will be helpful to everyone.

The use of Python lightweight search tool Whoosh (summary sharing)

[Related recommendations: Python3 video tutorial ]

This article will briefly introduce Whoosh, a lightweight search tool in Python. And give the corresponding usage example code.

Whoosh Introduction

Whoosh was created by Matt Chaput. It started as a simple and fast search service tool for the online documentation of the Houdini 3D animation software package, and then slowly became a mature The search solution tool has been open sourced.

Whoosh is purely written in Python. It is a flexible, convenient and lightweight search engine tool. It now supports both Python2 and 3. Its advantages are as follows:

  • Whoosh is purely written in Python, but it is very fast. It only requires a Python environment and does not require a compiler;
  • The Okapi BM25F sorting algorithm is used by default, and other sorting algorithms are also supported;
  • Compared with other search engines, Whoosh will create smaller index files;
  • The index file encoding in Whoosh must be unicode;
  • Whoosh can store any Python object.

Whoosh’s official introduction website is: https://whoosh.readthedocs.io/en/latest/intro.html. Compared with mature search engine tools such as ElasticSearch or Solr, Whoosh is lighter and simpler to operate, and can be considered for use in small search projects.

Index & query

For those who are familiar with ES, the two important aspects of search are mapping and query, that is, index construction and query. Behind the scenes are complex index storage, Query parsing and sorting algorithms, etc. If you have experience in ES, then Whoosh is very easy to get started with.

According to the author's understanding and the official documentation of Whoosh, the main entry-level uses of Whoosh are index and query. One of the powerful features of a search engine is that it can provide full-text search, which depends on the sorting algorithm, such as BM25, and also depends on how we store fields. Therefore, when index is used as a noun, it refers to the index of the field, and when index is used as a verb, it refers to establishing the index of the field. The query will use the sorting algorithm to give reasonable search results based on the statements we need to query.

Regarding the use of Whoosh, detailed instructions have been given in the official documents. The author only gives a simple example here to illustrate how Whoosh can easily improve our search experience.

Sample code

Data

The sample data for this project is poetry.csv. The following picture is the first ten rows of the data set:

Fields

According to the characteristics of the data set, we create four fields (fields): title, dynasty, poet, content. The created code is as follows:

# -*- coding: utf-8 -*-
import os
from whoosh.index import create_in
from whoosh.fields import *
from jieba.analyse import ChineseAnalyzer
import json

# 创建schema, stored为True表示能够被检索
schema = Schema(title=TEXT(stored=True, analyzer=ChineseAnalyzer()),
                dynasty=ID(stored=True),
                poet=ID(stored=True),
                content=TEXT(stored=True, analyzer=ChineseAnalyzer())
                )

Among them, the ID can only be a unit value and cannot be divided into several words. It is often used for file paths, URLs, dates, and categories;

The text of the TEXT file Content, index and store text, and support word search; Analyzer selects the stuttering Chinese word segmenter.

Create index file

Next, we need to create an index file. We use the program to first parse the poem.csv file, convert it into index, and write it to the indexdir directory. The Python code is as follows:

# 解析poem.csv文件
with open('poem.csv', 'r', encoding='utf-8') as f:
    texts = [_.strip().split(',') for _ in f.readlines() if len(_.strip().split(',')) == 4]

# 存储schema信息至indexdir目录
indexdir = 'indexdir/'
if not os.path.exists(indexdir):
    os.mkdir(indexdir)
ix = create_in(indexdir, schema)

# 按照schema定义信息,增加需要建立索引的文档
writer = ix.writer()
for i in range(1, len(texts)):
    title, dynasty, poet, content = texts[i]
    writer.add_document(title=title, dynasty=dynasty, poet=poet, content=content)
writer.commit()

After the index is successfully created, the indexdir directory will be generated, which contains the index files for each field of the above poem.csv data.

Query

After the index is successfully created, we will use it to query.

For example, if we want to query the poems containing 明月 in the content, we can enter the following code:

# 创建一个检索器
searcher = ix.searcher()

# 检索content中出现'明月'的文档
results = searcher.find("content", "明月")
print('一共发现%d份文档。' % len(results))
for i in range(min(10, len(results))):
    print(json.dumps(results[i].fields(), ensure_ascii=False))

The output results are as follows:

A total of 44 documents were found.
The first 10 documents are as follows:
{"content": "There is bright moonlight in front of the bed, which is suspected to be frost on the ground. Look up at the bright moon and lower your head to think about your hometown.", "dynasty": "Tang Dynasty", "poet ": "Li Bai ", "title": "Quiet Night Thoughts"}
{"content": "The grass on the edge, the grass on the edge, the grass on the edge are all here. The snow is clear in the south of the mountain and in the north, and the moon is bright for thousands of miles. The bright moon, the bright moon , the Hujia screamed with sorrow.", "dynasty": "Tang Dynasty", "poet": "Dai Shulun", "title": "Tiao Xiaoling·Biancao"}
{"content": "Sitting alone in the quiet bamboo Inside, I play the piano and whistle loudly. People in the deep forest don't know that the bright moon comes to shine.", "dynasty": "Tang Dynasty", "poet": "Wang Wei", "title": "Zhuli Pavilion"}
{" content": "The bright moon of the Han River shines on people returning home, and the autumn wind spreads across thousands of miles. Don't wash your guest clothes lightly, there are still dust from the imperial capital.", "dynasty": "Ming Dynasty", "poet": "Bian Gong", "title": "A heavy gift to Wu Guobin"}
{"content": "The bright moon of the Qin Dynasty and the Pass of the Han Dynasty, and the people who marched thousands of miles have not returned. But the flying generals of Dragon City are here, and they will not teach Hu Ma to cross the Yin Mountains.", "dynasty": "Tang Dynasty", "poet": "Wang Changling", "title": "Two poems out of the fortress·One"}
{"content": "Between Jingkou and Guazhou, there is only one water, Zhongshan Mountain Countless mountains. The spring breeze turns green to the south bank of the river. When will the bright moon shine on me again?", "dynasty": "Song Dynasty", "poet": "Wang Anshi", "title": "Boancing Guazhou"}
{" content": "Looking around, you can see the light of the mountains and the light of the water, and you can lean on the railing and smell the fragrance of wild flowers. There is no one to care about the clear breeze and the bright moon, and it is always cool as the south building.", "dynasty": "Song Dynasty", "poet": "Huang Tingjian ", "title ": "Ezhou Nanlou Calligraphy"}
{"content": "The green mountains are faint and the water is far away, and the grass in the south of the Yangtze River has not withered after autumn. On the moonlit night of the Twenty-Four Bridge, where can the beauty teach the flute?", "dynasty ": "Tang Dynasty", "poet": "Du Mu", "title": "To Judge Han Chuo of Yangzhou"}
{"content": "The dew air is cold and the light is gathering, and the sun is shining under the Chuqiu. The ape is crying in the cave. Trees, people in the Mulan boat. The bright moon shines in Guangze, and the turbulent currents in the Cangshan Mountains. I don’t see you in the clouds, but I feel sad at night.", "dynasty": "Tang Dynasty", "poet": "马dai", "title ": "One of three nostalgic poems about the Chu River"}
{"content": "The bright moon rises on the sea, and we share this moment at the end of the world. Lovers complain about the distant night, but they miss each other at night. The candles are extinguished and the light is full of pity, and the clothes are covered with dew. Nourishing. I can't bear to give it away, but I still have a good night's sleep.", "dynasty": "Tang Dynasty", "poet": "Zhang Jiuling", "title": "Looking at the Moon and Huaiyuan / Looking at the Moon and Nostalgic for the Past"}

[Related recommendations: Python3 video tutorial]

The above is the detailed content of The use of Python lightweight search tool Whoosh (summary sharing). For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:jb51.net. If there is any infringement, please contact admin@php.cn delete