Is MySQL's join function too weak?-Mysql Tutorial-php.cn

Todaymysql tutorial column introduces the join function.

Is MySQL's join function too weak?

Regarding MySQL join, you must have known a lot of "anecdotes" about it. For example, a two-table join requires a small table to drive a large table. Alibaba developer specifications prohibit three For join operations of more than one table, MySQL's join function is extremely weak and so on. These norms or remarks may be true or false, sometimes right or sometimes wrong. You need to have an in-depth understanding of join to understand clearly.

Next, let’s take a comprehensive look at MySQL’s join operation.

Text

In daily database queries, we often need to join multiple tables to obtain the merged data of multiple tables at one time. This requires the use of join in the database. grammar. Join is a very common operation in the data field to merge two data sets. If you know more about it, you will find that MySQL, Oracle, PostgreSQL and Spark all support this operation. The protagonist of this article is MySQL. If there is no special explanation below, MySQL's join will be used as the main subject. Oracle, PostgreSQL and Spark can be regarded as the big bosses that beat them. Their algorithm optimization and implementation of join are better than MySQL.

MySQL's join has many rules. If you are not careful, a bad join statement may not only cause a full table query on a certain table, but also may affect The cache of the database causes most hot data to be replaced, dragging down the performance of the entire database.

Therefore, the industry has summarized many norms or principles for MySQL joins, such as small tables driving large tables and prohibiting join operations of more than three tables. Below we will introduce the MySQL join algorithm in turn, compare it with the join implementation of Oracle and Spark, and intersperse answers to why the above specifications or principles are formed.

For the implementation of join operations, there are probably three more common algorithms: Nested Loop Join (loop nested join), Hash Join (hash join) and Sort Merge Join (sort merge join), they each have their own advantages, disadvantages and applicable conditions, we will introduce them in turn next.

Implementation of Nested Loop Join in MySQL

Nested Loop Join scans the driver table. Every time a record is read, the corresponding data is queried in the driven table according to the index on the associated field of the join. . It is suitable for scenarios where the subset of data to be connected is small. It is also the only algorithm implementation of MySQL join. We will explain its details in detail next.

There are two variants of the Nested Loop Join algorithm in MySQL, namely Index Nested-Loop Join and Block Nested-Loop Join.

Index Nested-Loop Join algorithm

Next, let’s initialize the related table structure and data

CREATE TABLE `t1` (
  `id` int(11) NOT NULL,
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `a` (`a`)
) ENGINE=InnoDB;

delimiter ;;
# 定义存储过程来初始化t1
create procedure init_data()
begin
  declare i int;
  set i=1;
  while(i<p>As can be seen from the above command, both tables have a The primary key index id and an index a, there is no index on field b. The stored procedure init_data inserts 10,000 rows of data into table t1 and 500 rows of data into table t2. </p><p>In order to prevent the MySQL optimizer from selecting the table as the driver table and affecting the execution process of the analyzed SQL statement, we directly use straight_join to let MySQL use a fixed connection table order for querying. In the following statement, t1 is the driver Table, t2 is the driven table. </p><pre class="brush:php;toolbar:false">select * from t2 straight_join t1 on (t2.a=t1.a);复制代码

Copy after login

Use the explain command introduced in our previous article to view the execution plan of the statement.

As you can see from the above figure, the a field on the t1 table is indexed, and the index is used in the join process. Therefore, the execution flow of the SQL statement is as follows:

Read a row of data L1 from the t2 table;
Use the a field of L1 to query the t1 table as a condition;
Get out the data that satisfies the condition in t1 The conditional rows form corresponding rows with L1 and become part of the result set;
Repeated execution until the t2 table is scanned.

We call this process Index Nested-Loop Join, or NLJ for short, and its corresponding flow chart is as follows.

It should be noted that in the second step, when querying table t1 based on the a field, an index is used, so each scan will only scan one row (from explain the results, which vary according to different case scenarios).

Assume that the number of rows in the driving table is N and the number of rows in the driven table is M. Because during the execution of this join statement, the driving table performs a full table scan, while the driven table uses an index, and each row of data in the driving table must be indexed in the driven table for index query, so the entire join process is The approximate complexity is N2log2M. Obviously, N has a greater impact on the number of scanned rows, so in this case, a small table should be used as the driving table.

Of course, the premise of all this is that the associated field of join is a, and there is an index on the a field of the t1 table.

如果没有索引时，再用上图的执行流程时，每次到 t1 去匹配的时候，就要做一次全表扫描。这也导致整个过程的时间复杂度编程了 N * M，这是不可接受的。所以，当没有索引时，MySQL 使用 Block Nested-Loop Join 算法。

Block Nested-Loop Join

Block Nested-Loop Join的算法，简称 Is MySQLs join function too weak?，它是 MySQL 在被驱动表上无可用索引时使用的 join 算法，其具体流程如下所示：

把表 t2 的数据读取当前线程的 join_buffer 中，在本篇文章的示例 SQL 没有在 t2 上做任何条件过滤，所以就是讲 t2 整张表放入内存中；
扫描表 t1，每取出一行数据，就跟 join_buffer 中的数据进行对比，满足 join 条件的，则放入结果集。

比如下面这条 SQL

select * from t2 straight_join t1 on (t2.b=t1.b);复制代码

Copy after login

这条语句的 explain 结果如下所示。可以看出

可以看出，这次 join 过程对 t1 和 t2 都做了一次全表扫描，并且将表 t2 中的 500 条数据全部放入内存 join_buffer 中，并且对于表 t1 中的每一行数据，都要去 join_buffer 中遍历一遍，都要做 500 次对比，所以一共要进行 500 * 10000 次内存对比操作，具体流程如下图所示。

主要注意的是，第一步中，并不是将表 t2 中的所有数据都放入 join_buffer，而是根据具体的 SQL 语句，而放入不同行的数据和不同的字段。比如下面这条 join 语句则只会将表 t2 中符合 b >= 100 的数据的 b 字段存入 join_buffer。

select t2.b,t1.b from t2 straight_join t1 on (t2.b=t1.b) where t2.b >= 100;复制代码

Copy after login

join_buffer 并不是无限大的，由 join_buffer_size 控制，默认值为 256K。当要存入的数据过大时，就只有分段存储了，整个执行过程就变成了：

扫描表 t2，将符合条件的数据行存入 join_buffer，因为其大小有限，存到100行时满了，则执行第二步；
扫描表 t1，每取出一行数据，就跟 join_buffer 中的数据进行对比，满足 join 条件的，则放入结果集；
清空 join_buffer；
再次执行第一步，直到全部数据被扫描完，由于 t2 表中有 500行数据，所以一共重复了 5次

这个流程体现了该算法名称中 Block 的由来，分块去执行 join 操作。因为表 t2 的数据被分成了 5 次存入 join_buffer，导致表 t1 要被全表扫描 5次。

	全部存入	分5次存入
内存操作	10000 * 500	10000 * (100 + 100 + 100 + 100 + 100)
扫描行数	10000 + 500	10000 * 5 + 500

As shown above, compared with table data that can be all stored in join_buffer, the number of memory judgments has not changed. It is the product of the row numbers of the two tables, which is 10000 * 500, but the driven table will be scanned multiple times. , every time it is saved, the driven table will be scanned again, which affects the final execution efficiency.

Based on the above two algorithms, we can draw the following conclusion, which is also the standard for most MySQL join statements on the Internet.

There is an index on the driven table, that is, when the Index Nested-Loop Join algorithm can be used, the join operation can be used.
Whether it is the Index Nested-Loop Join algorithm or Block Nested-Loop Join, a small table must be used as the driving table.

Because the time complexity of the above two join algorithms is at least also has a first-order relationship with the number of rows in the table involved, and it takes a lot of memory space , so it is understandable that Alibaba developer specifications strictly prohibit join operations of more than three tables.

But the above two algorithms are only one of the join algorithms. There are also more efficient join algorithms, such as Hash Join and Sorted Merged Join. Unfortunately, these two algorithms are not currently available in the mainstream version of MySQL, but Oracle, PostgreSQL and Spark all support it. This is also the reason why online complaints about MySQL are so weak(MySQL 8.0 version supports Hash join, but 8.0 currently supports it. Not yet a mainstream version).

In fact, Alibaba developer specifications also stipulated that when migrating from Oracle to MySQL, the join operation performance of MySQL was too poor to prohibit the join operation of more than three tables.

Hash Join Algorithm

Hash Join scans the driver table, uses the associated fields of the join to create a hash table in the memory, and then scans the driven table, reading out each row of data from the hash table Find the corresponding data. It is a common method for large data set connection operations. It is suitable for scenarios where the amount of data in the driver table is small and can be put into memory. It can be used for large tables without indexes and parallel queries. Best performance. Unfortunately, it only applies to equi-join scenarios, such as on a.id = where b.a_id.

It is still the join statement of the above two tables, and its execution process is as follows

Take out the qualified data from the driver table t2, and The join field value of each row is hashed and then stored in the hash table in memory;
Traverse the driven table t1, and each row of data that meets the conditions is taken out, and the join field value is also hashed. Get the result into the hash table in memory to find a match. If found, it becomes part of the result set.

It can be seen that This algorithm is similar to Block Nested-Loop Join, except that the unordered Join Buffer is changed to a hash table hash table, so that the data does not match It is no longer necessary to traverse all the data in the join buffer, but directly use hashing to obtain the matching row with a time complexity close to O(1), which greatly improves the join speed of the two tables.

However, due to the characteristics of hash, this algorithm can only be applied to the equivalent connection scenario, and this algorithm cannot be used in other connection scenarios.

Sorted Merge Join algorithm

Sort Merge Join first sorts the two tables according to the associated fields of the join (if they are already sorted, such as if there is an index on the field, there is no need to sort again) , and then perform a merge operation on the two tables. If the two tables have been sorted, there is no need to sort them again when performing a sort merge join. In this case, the performance of Merge Join will be better than Hash Join. Merge Join can be adapted to non-equivalent Joins (>, =, ).

It should be noted that if the connected field already has an index, that is to say, if it has been sorted, the merge operation can be performed directly. However, if the connected field does not have an index, its execution process is as follows shown.

Traverse table t2, read the data that meets the conditions, and sort them according to the value of the connection field a;
Traverse table t1, Read the data that meets the conditions and sort them according to the value of the connection field a;
Merge the two sorted data to obtain the result set.

The main time consumption of the Sorted Merge Join algorithm is the sorting operation of the two tables, so if the two tables have been sorted according to the connection field, the algorithm is even faster than the Hash Join algorithm. In one case, this algorithm is faster than the Nested Loop Join algorithm.

Now, let’s summarize the differences, advantages and disadvantages of the above three algorithms.

##Nested Loop JoinHash JoinSorted Merge JoinConnection conditionsApplies to any conditionApplies only to equivalence connections (=)Equal or non-equivalent connections ( >, =, ' Mainly consumes resourcesCPU, disk I/ OMemory, temporary spaceMemory, temporary spaceFeaturesWhen there is a highly selective index or restrictive The search efficiency is relatively high and the first search result can be quickly returnedWhen there is a lack of index or the index conditions are ambiguous, Hash Join is more effective than Nested Loop. Usually faster than Merge Join. In a data warehouse environment, if the table has a large number of records, the efficiency is highWhen there is a lack of indexes or the index conditions are ambiguous, Sort Merge Join is more effective than Nested Loop. When the connection field has an index or is sorted in advance, it is faster than hash join and supports more connection conditionsDisadvantagesEfficiency when there is no index or there are many table records LowBuilding a hash table requires a lot of memory, and the first result return is slowAll tables need to be sorted. It is designed for optimal throughput and does not return data until all results are foundRequires an indexYes (no indexing is too inefficient)NoNo

Understanding of Join operation

After talking about Join-related algorithms, let’s also talk about it here Let’s talk about the business understanding of join operation.

When the business is not complex, most joins are not irreplaceable. For example, the order record generally only contains the user_id of the order user. When returning information, the user's name needs to be obtained. The possible implementation solutions are as follows:

Two database operations, two queries, the first time to obtain the order information and user_id, the second time to get the name based on the user_id, and use the code program to merge the information;
Use redundant user names or read from non-relational databases such as ES.

The above solutions can solve the problem of data aggregation, and are processed based on program code, which is easier to debug and optimize than database join. For example, the user name is not fetched from the database, but first from the cache. Find in .

Of course, the join operation is not without merit, so the technology has its usage scenarios. The above solutions or rules are summarized by the Internet development team and are suitable for high concurrency, light writing and heavy reading, distribution, and business logic. In simple cases, these scenarios generally do not have high requirements for data consistency, and even dirty reads are allowed.

However, in enterprise application scenarios such as financial banking or finance, join operations are indispensable. These applications are generally low concurrency, frequent complex data writing, CPU-intensive rather than IO-intensive, and their main business The logic is processed through the database and even systems that contain a large number of stored procedures and have high requirements for consistency and integrity.

More related free learning recommendations: mysql tutorial(Video)

The above is the detailed content of Is MySQL's join function too weak?. For more information, please follow other related articles on the PHP Chinese website!