Deeply understand the B+Tree index principle of Mysql-Mysql Tutorial-php.cn

First of all, correctly creating appropriate indexes is the basis for improving database query performance.

What is the index?

An index is a decentralized storage data structure created to speed up the retrieval of data rows in a table.

How does the index work?

As shown in the picture above, if there is a sql statement select * from teacher where id = 101, if there is no index, If we want to find this record, we need to scan the entire table and match the data with id = 101. If we have an index, we can quickly find the address of the row corresponding to 101 recorded on the disk through the index, and then retrieve the corresponding row data based on the given address.

Why does the MYSQL database use B TREE as the data structure of the index?

To speed up the retrieval of data, the first thing that comes to mind is the binary tree. The search time complexity of the binary tree can reach O(log2(n)). Let's take a look at the storage structure of the binary tree:

Deeply understand the B+Tree index principle of Mysql

Binary tree search is equivalent to a binary search. Binary search can greatly improve the efficiency of queries, but it has a problem: the binary tree uses the first inserted data as the root node. As shown in the figure above, if you only look at the right side, you will find that it is a linear linked list structure. If our current data only contains 1, 2, 3, 4, 5, 6, the following situation will occur:

Deeply understand the B+Tree index principle of Mysql

If the data we want to query is 6, we need Only by traversing all the nodes can we find 6, which is equivalent to a full table scan. Because of this problem, the binary search tree is not suitable for use as an index data structure.

Based on such deduction, in order to solve the problem of linear linked list, it is easy to think of balanced binary search tree. Let’s take a look at what a balanced binary tree looks like:

Deeply understand the B+Tree index principle of Mysql

##The balanced binary search tree is defined as: the height difference between the child nodes of a node cannot exceed 1, as shown in node 20 in the figure above, left The node height is 1, the right node height is 0, and the difference is 1, so the above picture does not violate the definition. It is a balanced binary tree. The ways to ensure the balance of a binary tree are left-hand, right-hand and other operations. As for left-hand and right-hand operations, you can search for relevant knowledge by yourself.

If the balanced binary tree in the above figure saves the id index, now to start with the data with id = 8, first load the root node into the memory, compare 8 and 10, and find that 8 is smaller than 10, continue Load the left subtree of 10. Load 5 into memory and compare 8 with 5. In the same way, load the right subtree of node 5. At this time, a hit is found, and now the data corresponding to the index with id 8 is loaded.

How to find the data corresponding to the index?

There are generally two ways to save data in the index. The first is to save all the specific data content of the row data with id = 8 in the data area of the node. Another way, the data area saves the disk address where the data is actually stored.

At this point, the balanced binary tree solves the problem of linear linked lists. The efficiency of data query seems to be okay, basically reaching O(log2(n)). So why doesn’t MySQL choose such a data structure? What kind of problems does he have?

Problem 1: Insufficient search efficiency. Generally speaking, in the tree structure, the depth of the data determines the number of IOs during search. As shown in the figure above, searching for data with id = 8 requires 3 IOs. When the amount of data reaches millions, the height of the tree will be terrifying.

Question 2: The query is not unstable. If the queried data falls on the root node, only one IO is required. If it is a leaf node or a branch node, multiple IOs will be required.

Problem 3: The data content stored by the node is too little. It does not make good use of the operating system and disk data exchange features, nor does it make good use of the read-ahead capability of disk IO. Because a data exchange between the operating system and the disk is in units of pages, one page = 4K, that is, the operating system will load 4K data into the memory for each IO. However, the structure of each node in the binary tree only saves one keyword, one data area, and two child node references, which cannot fill 4K of content. Fortunately, I worked hard on an IO operation, but only one keyword was loaded. When the height of the tree is very high, and the searched keyword happens to be located at a leaf node or a branch node, it takes many times to retrieve a keyword. IO.

Is there a structure that can solve this problem of binary trees?

Yes, multi-way balanced search tree: (Balance Tree):

B Tree is an absolutely balanced tree, all leaf nodes are at the same height, as shown in the following figure:

Deeply understand the B+Tree index principle of Mysql

What are the advantages of B Tree, and how does it solve some problems?

Let’s look at the definition first. The picture above shows a 2-3 tree (each node stores 2 keywords and has 3 ways). A multi-way balanced search tree means multi-fork. From the above As can be seen from the figure, the relationship between the number of keywords saved in each node and the number of paths is:

Number of keywords = Number of paths – 1.

Assume that you want to find the data with id = 28 from the above picture. The B TREE search process is as follows:

First load the root node into the memory and load the two keywords 17 and 35. The judgment rule is:

Deeply understand the B+Tree index principle of Mysql

After hitting 28 according to the above rules, then load the data corresponding to 28, and then find the data area corresponding to 28. The data area stores the specific The data or a pointer to the data.

Why is this structure able to solve the problem of balanced binary trees?

It can make good use of the interactive characteristics of the operating system and the disk. In order to make good use of the read-ahead ability of the disk, MYSQL sets the page size to 16K, which is to set the size of a node (disk block). is 16K, one IO loads the contents of one node (16K) into memory. Here, assume that the keyword type is int, that is, 4 bytes. If the data area corresponding to each keyword is also 4 bytes, without considering the reference of child nodes, each node in the above figure can store approximately ( 16 * 1000) / 8 = 2000 keywords, then there are 2001 ways in total. For a binary tree with three levels of height, up to 7 keywords can be saved. However, for this B-tree with 2001 paths, the number of keywords that can be searched for with three levels of height is much larger than that of a binary tree.

In the process of B TREE ensuring the balance of the tree, every change of keywords will lead to great changes in the structure. This process is particularly time-consuming, so when creating an index, you must create a suitable index. , instead of creating indexes for all fields, creating redundant indexes will only increase performance consumption when adding, deleting, and modifying data.

Since B-tree has solved the problem very well, why does MYSQL still use B-TREE?

Let’s first look at what B TREE looks like. B TREE is a variant of B TREE. In B tree species, the relationship between the number of paths in B tree species and the number of keywords no longer holds true. B In TREE, the data retrieval rule uses a left closed interval, and the relationship between the number of paths and the number of keys is 1:1, as shown in the following figure:

Deeply understand the B+Tree index principle of Mysql

If in the above figure It is an index made by ID. If you are searching for data with id = 1, the search rules are as follows:

Deeply understand the B+Tree index principle of Mysql

According to the above rules, the data will eventually be hit in the leaf node. According to the leaf node Get the real data from the data area of node 1.

What is the difference between B TREE and B TREE?

1. The B TREE keyword search uses a left closed interval. The reason why the left closed interval is used is because it wants to best support auto-incrementing ids. This is also the original design intention of mysql. That is, if id = 1 hits, the search will continue until 1 in the leaf node is found.

2. B TREE root node and branch node have no data area, and the data corresponding to the keyword is only saved in the leaf node. That is, only the keyword data area in the leaf node will save the real data content or the address of the content. In the B tree species, if the root node is hit, the data will be returned directly. And in B TREE, leaf nodes will not save references to child nodes.

3. B TREE leaf nodes are arranged sequentially, and adjacent nodes have a sequential reference relationship. As shown in the figure above, the leaf nodes are connected by pointers.

Why does MYSQL choose B TREE in the end?

1. B TREE is a variant of B TREE. The problems that B TREE can solve can also be solved by B TREE (reduce the height of the tree and increase the amount of data stored in nodes)

2. B TREE has stronger database and table scanning capabilities. If we want to scan the data table based on the index, scanning the B TREE requires traversing the entire tree, while B TREE only needs to traverse all its leaf nodes. That’s it (there are references between leaf nodes).

3. B TREE has stronger disk reading and writing capabilities. Its root node and support nodes do not save data areas. When all root nodes and support nodes are of the same size, the number of saved keywords is larger than that of B TREE. many. Leaf nodes do not save child node references. Therefore, B TREE reads and writes more disk-loaded keywords than B TREE.

4. B TREE has stronger sorting ability. As can be seen from the above figure, B TREE naturally has sorting function.

5. B TREE query efficiency is more stable. Every time data is queried, the number of IO queries must be stable. Of course, everyone's understanding of this is different, because in B TREE, if the root node hits, it returns directly, which is indeed more efficient.

The specific implementation form of MYSQL B TREE

The main explanation here is the implementation of MYSQL's two storage engines (MYISAM and INNODB) based on different B TREE index structures. First, find the folder where MYSQL saves data and see how mysql saves data:

Deeply understand the B+Tree index principle of Mysql

Enter this directory. All databases are stored in this directory, and then enter a specific database directory. Here, there are a variety of data storage engines. Here we explain MYISAM and innodb, as shown in the figure:

Deeply understand the B+Tree index principle of Mysql

MYISAM storage engine index:

From As can be seen from the figure, there are three files in total using the MYISAM storage engine to store database data:

Frm, the table definition file. MYD: Data file, all data is saved in this file. MYI: index file.

In the MYISAM storage engine, the relationship between data and indexes is as follows:

Deeply understand the B+Tree index principle of Mysql

How to find data? If you want to query the data with id = 101, first find the node with id = 101 based on the MYI index file (as shown on the left in the figure above), get the disk address that actually saves the data through the data area of this node, and then use this address to get the data from the MYD data file (As shown on the right in the picture above) Load the corresponding record.

If there are multiple indexes, the expression is as follows:

Deeply understand the B+Tree index principle of Mysql

So in the MYISAM storage engine, the primary key index and the auxiliary index are at the same level, and there is no primary key index. Secondly.

Innodb storage engine:

First, let’s take a look at the concept of a clustered index. A clustered index is defined as: the physical order of the data in the database table rows is the same as the logical order of the key values.

Innodb uses primary keys as indexes to aggregate and organize data storage. Let’s take a look at how Innodb organizes data.

Innodb has only two files, the FRM file: the table definition file, and the Ibd file. There is no file specifically for saving data. Data is aggregated and stored using primary keys, and the real data is stored in leaf nodes. The original design intention of innodb is that the primary key is the most important index. Specifically as shown in the figure below:

Deeply understand the B+Tree index principle of Mysql

As shown in the figure above, the data area of the leaf node saves the real data. When retrieving through the index, if the leaf node is hit, Row data can be retrieved directly from leaf nodes. Before mysql5.5 version, the MYISAM engine was used, and after 5.5, the innodb engine was used.

In innodb, the format of the auxiliary index is as shown in the figure below?

Deeply understand the B+Tree index principle of Mysql

#As shown above, the leaf nodes of the primary key index store the real data. The data area of the auxiliary index leaf node stores the value of the primary key index key. The search process is: If you want to query the data with name = seven, first query in the auxiliary index and finally find the primary key id = 101, then search for the data with id 101 in the primary key index, and finally obtain the real data from the leaf node of the primary key index. data. Therefore, retrieval through the auxiliary index requires retrieval of the index twice.

Put the difference between Innodb and MYISAM in a picture, as shown below:

Deeply understand the B+Tree index principle of Mysql

Several principles for creating indexes:

1. Discrete type of column:

The calculation formula of discrete type: count(distinct col):count(col). The higher the discrete type, the better the selection type.

In each field in the following table, which column has the best discrete type:

Deeply understand the B+Tree index principle of Mysql

From the above figure, it is obvious that the discrete type of name is the best , if you use sex to create an index:

Why is it said that the higher the discrete type, the better the selective type?

As shown below, if you create an index on Sex, the index structure will be as follows:

Deeply understand the B+Tree index principle of Mysql

If you retrieve the data of sex = 1 at this time, When the root node is judged, the result is to query the left subtree, but when the judgment is made at the second level of the left subtree, because both the left and right branches meet the conditions, it is difficult to decide which branch to choose to continue the search, or to combine the two branches. Branches are searched simultaneously.

2. Leftmost matching principle

When comparing keywords in the index, the comparison must be from left to right and cannot be skipped. The ids explained before are all int data. If the id is a string, it is as shown below:

Deeply understand the B+Tree index principle of Mysql

When matching, the string will be converted into ascll code, such as abc becomes 97 98 99, and then compared character by character from left to right. Therefore, when using like %a in a SQL query, the index will be invalid, because % means a full match. If there is a full match, there is no need for an index. It is better to scan the whole table directly.

3. Least Space Principle

As mentioned before, when the space occupied by keywords is smaller, the number of keywords saved in each node will be more, which will be loaded into the memory each time. The more keywords there are, the higher the search efficiency will be.

Joint index:

Single column index: Keywords in the node [name]

Joint index: Keywords in the node [name, phoneNum]

Single-column indexes can be regarded as special joint indexes. The comparison of joint indexes is also based on the leftmost matching principle.

Principles for selecting joint index columns:

(1) Frequently used column priority (leftmost matching principle)

(2) Column priority with high discreteness (discrete Principle of high degree)

(3) Column priority with small width, (principle of least space)

The following is a simple example of problems that are often encountered:

For example, usually The frequently used query sql is as follows:

Select * from users where name = ?

Select * from users where name = ? and pahoneNum = ?

In order to speed up retrieval, create an index for the above query sql as follows:

Create index idx_name on users(name)

Create index idx_name_phoneNum on users(name, phoneNum)

In the above solution, according to the leftmost matching principle, idx_name is a redundant index, where name = ?The index idx_name_phoneNum can also be used for retrieval. Redundant indexes will increase or decrease the performance consumption of maintaining B TREE balance and occupy disk space.

Covered index:

If the queried column can be directly returned through the information of the index item, the index is called a covering index for querying SQL. Covering indexes can improve query efficiency.

The following explains the covering index through an example.

Table: teacher

Index:PK(id), key(name, phoneNum), unique(teacherNo)

Which of the following SQL uses covering index?

Select teacherNo from teacher where teacherNo = ?: When used, when teacherNo is retrieved, the teacherNo value in the index can be directly returned without entering the data area.

Select id, teacherNo from teacher where teacherNo = ?: When used, the leaf node of the auxiliary index saves the value of the primary index, so when the leaf node of the auxiliary index is retrieved, return id between.

Select name, phoneNum from teacher where teacherNo = ?: Not used

Select phoneNum from teacher where name = ?, used.

After knowing the covering index, you will know why SQL requires you not to use select * as much as possible, but to specify the specific fields to be queried. One reason is that when using the covering index, you do not need to enter Once in the data area, the data can be returned directly, improving query efficiency.

Through the previous study, we can easily understand the following conclusions:

1. The data length of the index column can be as small as possible if it meets the business needs.

2. The more indexes in the table, the better.

3. In the Where condition, like 9%, like %9%, like%9, the three methods do not use the index. The latter two methods are invalid for indexes. The first 9% is uncertain and depends on the discrete type of the column. In conclusion, it can be used. If the discrete situation is found to be particularly bad, the query optimizer feels that the index query performance is worse and it is not as good as a full table scan. .

4. Indexes cannot be used for NOT IN in the Where condition.

5. Use specified queries more often to return only the columns you want, and use select * less.

6. If a function is used in the query condition, the index will be invalid. This is related to the discrete type of the column. Once the function is used, the function is uncertain.

7. In the joint index, if the search is not started from the leftmost column of the index, the index cannot be used.

8. For joint indexes, the index can be used to exactly match the leftmost column and range match another column.

9. In the joint index, if the query has a range query of a certain column, all columns to the right of it cannot use the index.

Recommended Mysql tutorial "Mysql tutorial"

The above is the detailed content of Deeply understand the B+Tree index principle of Mysql. For more information, please follow other related articles on the PHP Chinese website!