Home >Database >Mysql Tutorial >What is an index in MySQL? Brief analysis of index storage model
The followingmysql tutorial column will give you an in-depth analysis of the indexes in MySQL and introduce some knowledge of MySQL indexes. I hope it will be helpful to everyone!
MySQL database should be one of the most commonly used databases. It can be seen in various large and small companies. How well do you master the MySQL database? ? If we want to use it better, we must first understand it. As the saying goes, if a worker wants to do his job well, he must first sharpen his tools. This article will lead you to an in-depth analysis of some knowledge of MySQL indexes. First, let’s understand what an index is, the deduction of the index storage model, and why the underlying data structure is
B treeThe reason? What is an index?
select * from user_innodb where name ='小马';
What if there is an index on the name field? Create an index on the name field and execute the same query again.
ALTER TABLE user_innodb DROP INDEX idx_name; ALTER TABLE user_innodb ADD INDEX idx_name (name);
Compared with queries without indexes, the efficiency of queries with indexes is dozens of times different.
Through this case, everyone should be able to feel very intuitively that the index can greatly improve the performance of data retrieval.
So what exactly is an index? Why can it have such a big impact on our queries? What happens when the index is created?
Index definition#Data is stored on the disk in the form of a file, and each row of data has its disk address. If there is no index, we have to retrieve a piece of data from 5 million rows of data, and we can only traverse all the data in this table until we find this piece of data.
But after we have the index, we only need to retrieve this data in the index, because it is a special data structure designed for fast retrieval. After we find the disk address where the data is stored, , you can get the data.
Index typeNormal
: Also called non-unique index, it is the most common index without any restrictions.
Unique: A unique index requires that key values cannot be repeated. In addition, it should be noted that the primary key index is a special unique index. It also has an additional restriction, which requires that the key value cannot be empty . Primary key indexes are created using primary key.
Fulltext: For relatively large data, for example, when we store message content and several KB of data, if we want to solve the problem of low like query efficiency , you can create a full-text index. Full-text indexes can only be created for text type fields, such as char, varchar, and text.
An index is a data structure, so what kind of data structure should it choose to achieve efficient data retrieval?Index storage model deductionBinary search
This is an idea of binary search, also called half search. Each time, we reduce the candidate data by half. This method is more efficient if the data has been sorted. So first, we can consider using an ordered array as the index data structure. Equal query and comparison query of ordered arrays are very efficient, but there is a problem when updating data. A large amount of data may need to be moved (change index), so it is only suitable for storing static data. In order to support frequent modifications, such as inserting data, we need to use a linked list. As for linked lists, if it is a singly linked list, its search efficiency is still not high enough.10000? Low. 30000? Taller. What will you guess next? 20000. Why didn't you guess 11,000 or 29,000?
So, is there a linked list that can use binary search?
In order to solve this problem, BST (Binary [ˈbaɪnəri] Search Tree), which is what we call a binary search tree, was born.
Binary Search Tree
二叉查找树既能够实现快速查找,又能够实现快速插入。
但是二叉查找树有一个问题:查找耗时是和这棵树的深度相关的,在最坏的情况下时间复杂度会退化成 O(n)。
什么情况是最坏的情况呢?
还是刚才的这一批数字,如果我们插入的数据刚好是有序的,2、10、12、15、 21、28
这个时候 BST 会变成链表( “斜树”),这种情况下不能达到加快检索速度的目的,和顺序查找效率是没有区别的。
造成它倾斜的原因是什么呢?
因为左右子树深度差太大,这棵树的左子树根本没有节点——也就是它不够平衡。
所以,我们有没有左右子树深度相差不是那么大,更加平衡的树呢?
这个就是平衡二叉树,叫做 Balanced binary search trees,或者 AVL 树。
平衡二叉树的定义:左右子树深度差绝对值不能超过 1。
是什么意思呢?比如左子树的深度是 2,右子树的深度只能是 1 或者 3。
这个时候我们再按顺序插入 1、2、3、4、5、6,一定是这样,不会变成一棵“斜树”。
那 AVL 树的平衡是怎么做到的呢?怎么保证左右子树的深度差不能超过 1 呢? 例如:插入 1、2、3。
当我们插入了 1、2 之后,如果按照二叉查找树的定义,3 肯定是要在 2 的右边的,这个时候根节点 1 的右节点深度会变成 2,但是左节点的深度是 0,因为它没有子节点,所以就会违反平衡二叉树的定义。
那应该怎么办呢?因为它是右节点下面接一个右节点,右-右型,所以这个时候我们要把 2 提上去,这个操作叫做左旋。
同样的,如果我们插入 7、6、5,这个时候会变成左左型,就会发生右旋操作,把 6 提上去。
所以为了保持平衡,AVL 树在插入和更新数据的时候执行了一系列的计算和调整的操作。
平衡的问题我们解决了,那么平衡二叉树作为索引怎么查询数据? 在平衡二叉树中,一个节点,它的大小是一个固定的单位,作为索引应该存储什么内容?
第一个:索引的键值。比如我们在 id 上面创建了一个索引,我在用 where id =1 的条件查询的时候就会找到索引里面的 id 的这个键值。
第二个:数据的磁盘地址,因为索引的作用就是去查找数据的存放的地址。
第三个因为是二叉树,它必须还要有左子节点和右子节点的引用,这样我们才能找到下一个节点。比如大于 26 的时候,走右边,到下一个树的节点,继续判断。
如果是这样存储数据的话,我们来看一下会有什么问题。
首先,索引的数据,是放在硬盘上的。查看数据和索引的大小:
select CONCAT(ROUND(SUM(DATA_LENGTH/1024/1024),2),'MB') AS data_len, CONCAT(ROUND(SUM(INDEX_LENGTH/1024/1024),2),'MB') as index_len from information_schema.TABLES where table_schema='gupao' and table_name='user_innodb';
当我们用树的结构来存储索引的时候,因为拿到一块数据就要在 Server 层比较是不是需要的数据,如果不是的话就要再读一次磁盘。访问一个节点就要跟磁盘之间发生一次 IO。InnoDB 操作磁盘的最小的单位是一页(或者叫一个磁盘块),大小是 16K(16384 字节)。
Then, the node of a tree is 16K in size. If we only store one key-value data reference in a node, such as an integer field, it may only use a dozen or dozens of bytes, which is far from the 16K capacity, so accessing a Tree nodes waste a lot of space when performing an IO.
So if each node stores too little data, to find the data we need from the index, we need to access more nodes, which means there will be too many interactions with the disk.
In the era of mechanical hard disks, it takes about 10ms of seeking time to read data from the disk each time. The more interactions there are, the more time is consumed.
For example, in the picture above, we have 6 pieces of data in one table. When we query id=37, to query two child nodes, we need to interact with the disk three times. If we have hundreds of What about the data of 10,000? This time is even more difficult to estimate.
So what is our solution?
The first one is to allow each node to store more data.
Second, the more keywords on a node, the more pointers we have, which means there can be more forks.
Because the more branches there are, the depth of the tree will decrease (the root node is 0). In this way, will our tree change from the original tall and thin look to a short and fat look?
At this time, our tree is no longer two-forked, but multi-forked, or multi-way.
Same as the AVL tree, the B tree stores key values, data addresses, and node references in branch nodes and leaf nodes.
It has a characteristic: the number of forks (number of paths) is always 1 more than the number of keywords. For example, in the tree we drew, each node stores two keywords, so there will be three pointers pointing to three child nodes.
#What are the search rules for B Tree?
For example, we want to find 15 in this table. Since 15 is less than 17, go left. Since 15 is greater than 12, go to the right. 15 was found in disk block 7, and only 3 IOs were used.
Is this more efficient than AVL tree? So how does B Tree realize that one node stores multiple keywords and still maintains balance? What is the difference with AVL trees?
For example, when the Max Degree (number of ways) is 3, we insert data 1, 2, and 3. When inserting 3, it should be in the first disk block, but if a node has three When a keyword is used, it means that there are 4 pointers, and the child node will become 4-way, so it must be split at this time (actually a B Tree). Bring up the middle data 2 and turn 1 and 3 into child nodes of 2.
If you delete a node, there will be a reverse merge operation.
Note that this is splitting and merging, which is different from the left-hand and right-hand rotation of the AVL tree.
We continue to insert 4 and 5, and B Tree will split and merge again.
We can also see from this that there will be a large number of index structure adjustments when updating the index, which explains why we don’t want to update frequently updated columns. Build an index above, or why not update the primary key.
The splitting and merging of nodes are actually the splitting and merging of InnoDB pages.
B Tree is already very efficient. Why does MySQL still need to improve B Tree and finally use it? What about B Tree?
Generally speaking, this improved version of B-Tree solves more comprehensive problems than B-Tree.
Let’s take a look at the storage structure of the B-tree in InnoDB:
The B-Tree in MySQL has several characteristics:
The number of its keywords is equal to the number of paths;
B Tree will not store data in its root node or branch nodes , only leaf nodes store data. Searching for keywords will not return directly, but will go to the leaf nodes of the last layer. For example, if we search for id=28, although it is directly hit on the first layer, all the data is on the leaf nodes, so I will continue to search downwards, all the way to the leaf nodes.
Each leaf node of B Tree adds a pointer to the adjacent leaf node, and its last data will point to the first data of the next leaf node, forming a The structure of an ordered linked list.
It retrieves data based on the interval [ ) that is closed on the left and open on the right.
B Tree’s data search process:
For example, if we want to search for 28, we find the key value at the root node, but because it is not a page child node, we will continue to search downwards. 28 is the left closed right of [28,66) The critical value of the open interval, so the middle child node will be walked, and then the search will continue. It is also the critical value of the left-closed and right-open interval of [28,34), so the left child node will be walked, and finally the leaf node The required data was found on.
Second, if it is a range query, for example, if you want to query data from 22 to 60, after finding 22, you only need to traverse the nodes and pointers sequentially to access it all at once. to all data nodes, which greatly improves the efficiency of interval query (no need to return to the upper parent node to repeatedly traverse the search).
Characteristics of B Tree in InnoDB:
It is a variant of B Tree and can be solved by B Tree Problems can be solved. What are the two major problems B Tree solves? (Each node stores more keywords; more paths) ;
Stronger database and table scanning capabilities (if we want to perform a full table scan on the table, we only need to traverse Leaf nodes are enough, there is no need to traverse the entire B Tree to get all the data);
The disk read and write capabilities of B Tree are stronger than those of B Tree (the root node The branch nodes do not save the data area, so a node can save more keywords and load more keywords from the disk at one time);
The sorting ability is stronger (because there are The pointer to the next data area, the data forms a linked list);
The efficiency is more stable (B Tree always gets the data at the leaf node, so the number of IOs is stable).
After seeing this, I believe everyone should know why MySQL chooses to use B-tree as the data structure model of the index.
For more programming-related knowledge, please visit: Introduction to Programming! !
The above is the detailed content of What is an index in MySQL? Brief analysis of index storage model. For more information, please follow other related articles on the PHP Chinese website!