Home > Article > Backend Development > The use of ES in MySQL and PHP

The use of ES in MySQL and PHP

coldplay.xixiforward: 2020-09-01 17:36:384340browse

Introduction to ES

A highly scalable, open source full-text retrieval and analysis engine that can quickly store and search in quasi-real-time , analyze massive amounts of data.
Full-text retrieval means that the computer indexing program builds an index for each word by scanning each word in the article, indicating the number and location of the word in the article. When the user queries, the retrieval program will build an index based on the pre-established Search the index and feed the search results back to the user's search method. This process is similar to the process of looking up a word through the search word list in a dictionary. Full-text search of data in search engine database

Why is ES faster than mysql

Mysql only has the term dictionary layer, which is stored on disk in a b-tree sorting manner Up. Retrieving a term requires several random access disk operations. Lucene adds a term index to the term dictionary to speed up retrieval. The term index is cached in memory in the form of a tree. After finding the block location of the corresponding term dictionary from the term index, we then go to the disk to find the term, which greatly reduces the number of random accesses to the disk. In addition: the term index is saved in the form of FST (finite state transducers) in memory, which is very memory-saving. The term dictionary is saved on the disk in blocks. A block is compressed using a common prefix. For example, if all words start with Ab, Ab can be omitted. In this way, term dictionary can save more disk space than b-tree.

Synchronized database

We use MySQL data storage, use MySQL's transaction features to maintain data consistency, and use ElasticSearch for data collection and query. At this time, es and the database The synchronization scheme is particularly important.

Process

First add the product into the database. After adding the product successfully, the product will be entered into ES. If the entry into ES fails, the failed product ID will be put into the redis cache queue, and the failed product ID will be added. Enter the log file (if redis hangs, you can get the abnormal product ID from the log and then enter it into ES). The task task refreshes the redis cache queue every second. If the product ID is obtained from the cache queue, it is retrieved from the database based on the product ID. Get the product data and then enter it into ES.

Use

logstash-input-jdbc plug-in to synchronize the database, installation and configuration: create a .conf file, configure the database to be synchronized and the .sql sql statement for execution, and finally put A jdbc driver is placed in this folder to connect to the mysql database

[Related learning recommendations: mysql tutorial]

Possible problems

elasticsearch data duplication and incremental synchronization

In the default configuration, the value of tracking_column is @timestamp , the _id value that exists in elasticsearch is the time when logstash is stored in elasticsearch. The main function of this value is similar to the primary key of mysql. It is unique, but our timestamp is actually changing all the time, so we use the select statement to query every time All data will be stored in elasticsearch, resulting in data duplication.

Solution

In the table to be queried, find the primary key or auto-increment field, and set it to the value of _id, because the _id value is unique, so when there is When the _id is repeated, the data will not be repeated

Frequent data synchronization affects the performance of the mysql database

The mysql statement we write to the jdbc.sql file is hard-coded, so every time There are many databases that are queried that no longer need to be queried, especially every time select * from table;, which puts a lot of pressure on the mysql database

Solution:

(1 ) According to business needs, the scheduled synchronization time can be modified appropriately. I have relatively high real-time requirements here, so I set a 10-minute schedule => “*/10 * * * *”
(2) Set the mysql query range To prevent a large number of queries from dragging down the database, set select * from WHERE autoid > :sql_last_value;

elasticsearch storage capacity continues to increase

elasticsearch For data security, after receiving the data, set select * from WHERE autoid > :sql_last_value; , first write the data to the memory and translog, and then create an index and write it to the disk. In this way, even if there is a sudden power outage and restart, it can be restored through translog. However, since there is a lot of duplicate data in each query, these The duplicate data was not written into the elasticsearch index, so it was accumulated, causing the elasticsearch capacity to continue to increase.

Solution:

The query official website says that it will be refreshed regularly and old logs will be automatically cleaned up, so no processing is required.

Incremental synchronization and mysql range query result in the inability to synchronize to the previous data when the mysql database is modified.

Solve the problem of small-scale query of mysql every time and solve the problem of database pressure, but it leads to the problem of being unable to synchronize the modification of old data

Solution:

Can be based on the business If your database is frequently modified, you can only do full updates. However, indexing by scanning the database with high frequency and wide range is worse than not doing indexing at all (because indexing is also costly), we do indexing. Mainly for situations where the amount of data is large, is not modified frequently, and consumes database performance. I have less data modifications here, and the modifications are generally recent data, because during synchronization, I slightly adjusted the mysql range

php using ES

php composer install composer require elasticsearch/elasticsearch
Introduce the es file autoload.php file, set the IP address
Create index, index corresponds to the database in relational data (hereinafter referred to as MySQL), rather than corresponding to the index in MySQL
Having a database is not enough, you still need to create a table. The same is true for ES. The type in ES corresponds to the table in MySQL. type is not defined separately, but is defined together with the field, which is defined in the body; of course, ik word segmentation can also be used in the body field;
Use EsClient->search () to implement search;
Synonyms and The use of synonyms

[Related learning recommendations: php programming (video)]

Configure the word segmenter: configure IK

Download the ik version package of es
Create the ik directory in the plugins under the es directory, and unzip all the files in the downloaded ik zip package Go in.
Go into the config directory of es, edit elasticsearch.yml, and add index.analysis.analyzer.default.type: "ik" in the blank space.
Pinyin word segmenter configuration: Use the compiled one: elasticsearch-analysis-pinyin-1.3.0
In the plugins directory of elasticsearch, create a new analysis-pinyin folder, unzip the compressed package, and put the jar package inside to the analysis-pinyin folder.
Configure the filter of the Pinyin tokenizer in elasticsearch.yml

Synonym tokenizer configuration

Configure the filter of the synonym tokenizer in elasticsearch.yml
Configure the synonym database and create a new sysnonym.txt in the config directory of elasticsearch.
Configuring the tokenizer for ik pinying synonyms, mainly including the name, type, components of token segmentation, and processing of the segmented dimensions: Pinyin and synonyms are used here

ES Keyword Highlighting

ES By adding the html tag field to the field data after the query when querying, the document can be formatted by color or font when displayed on the web interface. , is to modify the highlighted field in highlight. This part contains the text fragment matched by the name attribute, and is encapsulated with HTML tags

ES query paging

The data in Elasticsearch Stored in shards, when performing a search, each shard is searched independently, and the data is then integrated and returned.

The general query process is

1) The client request is sent to a certain node
2) The node forwards it to each shard and queries the first 10 items on each shard
3) The results are returned to the node, the data is integrated, and the first 10 items are extracted
4) Returned to the requesting client
When we query the data from items 10 to 20, there are two ways, including deep paging (from-size) and snapshot paging (scroll);
Depth paging (from-size)

from defines the offset value of the target data, and size defines the number of events currently returned. The default from is 0 and size is 10, which means that all queries only return the first 10 pieces of data by default. Query the first 20 pieces of data, then truncate the first 10 pieces, and only return 10-20 pieces of data. The first 10 queries were wasted. The further back the paging is, the lower the execution efficiency will be. The larger the paging offset value, the longer it will take to execute the paging query

Snapshot paging (scroll)

Compared to the paging of from and size, use scroll can simulate a traditional data cursor and record the position of the currently read document information. The usage of this paging is not to query data in real time, but to query a large amount of data (or even all data) at one time. Because this scroll is equivalent to maintaining a snapshot information of the current index segment, this snapshot information is the snapshot when you execute this scroll query. Any new indexed data after this query will not be queried in this snapshot. But compared with from and size, it does not query all the data and eliminate the unnecessary parts, but records a reading position to ensure that the next reading can continue quickly.
process:

Call: index/type/_search?pretty&scroll=2m, return a scroll value
Use scroll_id directly to query.
Clear the scroll. When we set up the scroll, we set a scroll survival time. However, if we can close it after use, we can release resources early and reduce the burden on ES.

php training

column!

The above is the detailed content of The use of ES in MySQL and PHP. For more information, please follow other related articles on the PHP Chinese website!

Statement：

This article is reproduced at:learnku.com. If there is any infringement, please contact admin@php.cn delete

Previous article：N algorithms for Fibonacci sequence in PHPNext article：N algorithms for Fibonacci sequence in PHP

See more

The use of ES in MySQL and PHP

Introduction to ES

Why is ES faster than mysql

Synchronized database

Possible problems

php using ES

Configure the word segmenter: configure IK

Synonym tokenizer configuration

ES Keyword Highlighting

ES query paging

Snapshot paging (scroll)

Related articles