Home>Article>Database> How does Mysql handle large data tables? Solution sharing

How does Mysql handle large data tables? Solution sharing

青灯夜游 forward: 2022-10-12 19:58:42 2287browse

How does Mysql handle large data tables? The following article will introduce to you the Mysql big data table processing solution, I hope it will be helpful to you.

Scenario:

When there are more and more data in our business database table, if you and I have encountered the following Similar scenario, let us solve this problem together

The insertion of data will take a long time to query
The expansion of subsequent business requirements will have a greater impact on the new fields in the table
Not all the data in the table is valid data. You only need to query the

Evaluation table data volume

We Data volume can be evaluated from three aspects: table capacity/disk space/instance capacity. Next, let us expand and look at

Table capacity:

Table capacity is mainly Evaluate based on the number of records, average length, growth, read and write volume, and total size of the table. Generally, for OLTP tables, it is recommended that a single table should not exceed 20 million rows of data and the total size should be within 15G. Access volume: The read and write volume of a single table is within 1600/s

How to query row data: The classic SQL statements we generally use when querying how much data there is in a table are as follows:

select count(*) from table
select count(1) from table But when the amount of data is too large, such a query may time out, so we need to change the query method

use library name
show table status like 'table name' ; or show table status like 'table name'\G ;

The above method can not only query the data of the table, but also output it For detailed information about the table, add \G to format the output. Including table name, storage engine version, number of rows, number of bytes per row, etc. You can try it yourself

Disk space

View the specified database capacity

select table_schema as '数据库', table_name as '表名', table_rows as '记录数', truncate(data_length/1024/1024, 2) as '数据容量(MB)', truncate(index_length/1024/1024, 2) as '索引容量(MB)' from information_schema.tables order by data_length desc, index_length desc;

Query the disk usage of all tables in a single database

select table_schema as '数据库', table_name as '表名', table_rows as '记录数', truncate(data_length/1024/1024, 2) as '数据容量(MB)', truncate(index_length/1024/1024, 2) as '索引容量(MB)' from information_schema.tables where table_schema='mysql' order by data_length desc, index_length desc;

The query results are as follows:

How does Mysql handle large data tables? Solution sharing

The recommended data amount accounts for the disk usage Within 70%. At the same time, for some data that grows rapidly, you can consider using a large slow disk for data archiving (for archiving, please refer to Plan 3)

Instance Capacity

MySQL is a thread-based service model, so in In some scenarios with high concurrency, a single instance cannot fully utilize the CPU resources of the server, and the throughput will be stuck at the mysql layer. You can consider your own instance mode based on the business

Causes of problems

We have already found the size of our data table above. So what is the root cause of why the larger the data size of a single table, the slower the execution efficiency of the business?

When the amount of data in a table reaches tens of millions or hundreds of millions, the effect of adding an index is not that obvious. The reason why the performance becomes worse is because theBtree structure level of maintaining the index becomes higher. When querying a piece of data, more disk IO needs to be experienced, so the query performance becomes slower.

Do you still remember how much data a B-tree can store?

The smallest storage unit of the InnoDB storage engine is a page, and the size of a page is16k.

B The tree leaves store data, and the internal nodes store key value pointers. The index-organized table determines which page the data is in through the binary search method of non-leaf nodes and pointers, and then goes to the data page to find the required data;

How does Mysql handle large data tables? Solution sharing

Assume that the B-tree If the height is2, there is one root node and several leaf nodes. The total number of records stored in this B-tree is = number of root node pointers * number of rows recorded in a single leaf node.

If the data size of a row of records is 1k, then the number of records that can be stored in a single leaf node =16k/1k =16.
How many pointers are stored in non-leaf nodes? We assume that the primary key ID is of typebigint, with a length of 8 bytes(The interviewer asked you about the int type, an int is 32 bits, 4 bytes), and the pointer size is in the InnoDB source code is set to 6 bytes, so 8 6 = 14 bytes, 16k/14B = 16*1024B/14B = 1170

Therefore, a B-tree with a height of 2 can store1170 * 16=18720such data records. In the same way, a B-tree with a height of 3 can store1170 *1170 *16 =21902400, which means that it can store about 20 million records. The height of B-tree is generally 1-3 layers, which can meet the storage requirements of tens of millions of levels of data.

If the B-tree wants to store more data, the tree structure level will be higher. When querying a piece of data, more disk IO will be required, so the query performance will become slower.

How to solve the problem of too much data in a single table and slow query

After knowing the root cause, we need to consider how to optimize the database to solve the problem

这里提供了三种解决方案，包括数据表分区，分库分表，冷热数据归档了解完这些方案之后大家可以选取适合自己业务的方案

方案一：数据表分区

为什么要分区：表分区可以在区间内查询对应的数据，降低查询范围并且索引分区也可以进一步提高命中率，提升查询效率

分区是指将一个表的数据按照条件分布到不同的文件上面，未分区前都是存放在一个文件上面的，但是它还是指向的同一张表，只是把数据分散到了不同文件而已。

我们首先看一下分区有什么优缺点：

表分区有什么好处？

与单个磁盘或文件系统分区相比，可以存储更多的数据。
对于那些已经失去保存意义的数据，通常可以通过删除与那些数据有关的分区，很容易地删除那些数据。相反地，在某些情况下，添加新数据的过程又可以通过为那些新数据专门增加一个新的分区，来很方便地实现。
一些查询可以得到极大的优化，这主要是借助于满足一个给定WHERE语句的数据可以只保存在一个或多个分区内，这样在查找时就不用查找其他剩余的分区。因为分区可以在创建了分区表后进行修改，所以在第一次配置分区方案时还不曾这么做时，可以重新组织数据，来提高那些常用查询的效率。
涉及到例如SUM()和COUNT()这样聚合函数的查询，可以很容易地进行并行处理。这种查询的一个简单例子如 “SELECT salesperson_id, COUNT (orders) as order_total FROM sales GROUP BY salesperson_id；”。通过“并行”，这意味着该查询可以在每个分区上同时进行，最终结果只需通过总计所有分区得到的结果。
通过跨多个磁盘来分散数据查询，来获得更大的查询吞吐量。

表分区的限制因素

一个表最多只能有1024个分区。
MySQL5.1中，分区表达式必须是整数，或者返回整数的表达式。在MySQL5.5中提供了非整数表达式分区的支持。
如果分区字段中有主键或者唯一索引的列，那么多有主键列和唯一索引列都必须包含进来。即：分区字段要么不包含主键或者索引列，要么包含全部主键和索引列。
分区表中无法使用外键约束。
MySQL的分区适用于一个表的所有数据和索引，不能只对表数据分区而不对索引分区，也不能只对索引分区而不对表分区，也不能只对表的一部分数据分区。

在进行分区之前可以用如下方法看下数据库表是否支持分区哈

mysql> show variables like '%partition%'; +-------------------+-------+ | Variable_name | Value | +-------------------+-------+ | have_partitioning | YES | +-------------------+-------+ 1 row in set (0.00 sec)

方案二：数据库分表

为什么要分表：分表后，显而易见，单表数据量降低，树的高度变低，查询经历的磁盘io变少，则可以提高效率

mysql 分表分为两种水平分表和垂直分表

分库分表就是为了解决由于数据量过大而导致数据库性能降低的问题，将原来独立的数据库拆分成若干数据库组成，将数据大表拆分成若干数据表组成，使得单一数据库、单一数据表的数据量变小，从而达到提升数据库性能的目的。

水平分表

定义：数据表行的拆分，通俗点就是把数据按照某些规则拆分成多张表或者多个库来存放。分为库内分表和分库。比如一个表有4000万数据，查询很慢，可以分到四个表，每个表有1000万数据

How does Mysql handle large data tables? Solution sharing

垂直分表

定义：列的拆分，根据表之间的相关性进行拆分。常见的就是一个表把不常用的字段和常用的字段就行拆分，然后利用主键关联。或者一个数据库里面有订单表和用户表，数据量都很大，进行垂直拆分，用户库存用户表的数据，订单库存订单表的数据

How does Mysql handle large data tables? Solution sharing

缺点：垂直分隔的缺点比较明显，数据不在一张表中，会增加join 或 union之类的操作

知道了两个知识后，我们来看一下分库分表的方案

1. Modulo scheme:

Before splitting, estimate the amount of data. For example, the user table has 40 million data, and now the data needs to be divided into 4 tables user1 user2 uesr3 user4. For example, id = 17, 17 modulo 4 is 1, plus , so this data is stored in the user2 table.

Note: Auto_increment should be removed from the table after horizontal splitting. The ID at this time can be obtained using an ID self-increasing temporary table, or using the redis incr method.

How does Mysql handle large data tables? Solution sharing

advantage: The data is evenly divided into various tables, and the probability of hot issues is very low.

Disadvantages: It will be difficult to expand and migrate data in the future. When the amount of data becomes larger, what was previously divided into 4 tables will now be divided into 8 tables. The modulo value will change, and data migration needs to be carried out again. .

2.range range scheme

Split data by range, that is, orders within a certain range are stored in a certain table. For example, id=12 is stored in the user1 table, and id=13 million is stored in the user2 table.

How does Mysql handle large data tables? Solution sharing

Advantages: Conducive to future data expansion

Disadvantages: If the hot data exists in one table, the pressure will be in one table, and the other There is no pressure on the gauge.

We see that the above two solutions have shortcomings but are complementary. So what will happen if we combine these two solutions?

3. Combination of hash modulus and range scheme

As shown in the figure below, we can see that the group group stores data with IDs from 0 to 40 million, and then there are three databases DB0 There are four databases in DB1, DB2 and DB0, and there are three databases in DB1 and DB2

If the id is 15000, then take the modulo of 10 (why take the modulo of 10 because there are 10 tables), take 0 and then fall DB_0, and then falls into Table_0 according to the range.

How does Mysql handle large data tables? Solution sharing

Summary: The combination of hash modulus and range scheme can not only avoid the problem of hot data, but also facilitate the expansion of data in the future

We already know Now that we have knowledge about mysql partitioning and sub-tables, let’s take a look at the differences between the two technologies and their applicable scenarios

The difference between partitioning and sub-tables:

1 , In terms of implementation

mysql's sub-table is a real sub-table. After one table is divided into many tables, each small table is a complete table, corresponding to three File, a .MYD data file, .MYI index file, .frm table structure
The partitions are different. After a large table is partitioned, it is still one table and will not become two tables, but they There are more blocks storing data.

2. Improving performance

The focus of table subdivision is how to improve mysql concurrency when accessing data;
As for partitions, how to break through the read and write capabilities of the disk to achieve the purpose of improving mysql performance.

3. Difficulty of implementation

1. There are many ways to divide tables. Using merge to divide tables is the simplest one. Way. This method is about the same difficulty as root partitioning and can be transparent to the program code. If you use other table partitioning methods, it will be more troublesome than partitioning. 2. Partition implementation is relatively simple. There is no difference between creating a partition table and building an ordinary table, and it is transparent to the code end.

The relationship between partitions and tables

1. It can improve the performance of mysql and have good performance in high concurrency conditions.

2. Table partitioning and partitioning are not inconsistent and can cooperate with each other. For those tables with a large number of visits and a large amount of table data, we can combine table partitioning and partitioning. The number of visits is not large. But for tables with a lot of data, we can partition them.

Problems with sub-database and sub-table

1. Transaction problems

After executing sub-database and sub-table, due to The data is stored in different libraries, and database transaction management becomes difficult. If you rely on the distributed transaction management function of the database itself to execute transactions, you will pay a high performance price; if the application assists in the control and forms a program logic transaction, it will also cause a programming burden.

2. Cross-database and cross-table join issues

After performing sub-database and table sub-database, it is unavoidable to divide the originally highly logically related data into On different tables and different libraries, at this time, the association operations of the tables will be restricted. We cannot join tables located in different sub-databases, nor can we join tables with different sub-table granularities. As a result, the business that can be completed in one query may not be possible. It takes multiple queries to complete.

3. Extra data management burden and data calculation pressure

Extra data management burden, the most obvious is the problem of data positioning and the repeated execution of data addition, deletion, modification and query , these can be solved through applications, but they will inevitably cause additional logical operations. For example, for a user data table userTable that records user scores, the business requires finding the 100 best scores. Before dividing the table, only one The order by statement can be done, but after splitting the tables, n order by statements will be needed to find out the top 100 user data of each split table, and then merge the data to get the result.

Option three: hot and cold archiving

Why hot and cold archiving is needed: In fact, the reason is similar to the second option, which is to reduce the amount of data in a single table and the height of the tree becomes lower , the query experiences less disk IO, which can improve efficiency. If your business data has a clear distinction between hot and cold, for example, you only need to display data for the past week or month. In this case, the data for this week and a month are called hot data, and the rest of the data are cold data. Then we can archive the cold data in other database tables to improve the operating efficiency of our hot data.

Next let’s talk about the archiving process

Create the archive table In principle, the created archive table must be consistent with the original table
Initialization of archive table data

How does Mysql handle large data tables? Solution sharing

Business incremental data processing process

How does Mysql handle large data tables? Solution sharing

##Data acquisition process

How does Mysql handle large data tables? Solution sharing

How do we choose the above three options

Option Trial scenario Advantages Disadvantages Data table partition 1. Large amount of data 2 .The query scenario is only in a certain area 3. There is no joint query scenario Partitioning and table partitioning is to physically split the files corresponding to the data table. The corresponding table name remains unchanged, so it is not Businesses such as sql that will affect the previous business logic after table partitioning will create corresponding objects, which will also cause a certain amount of overhead. Aggregation of partitioned data will also take a long time; the scope of use is not suitable for the data Data table sub-tables with a volume of more than tens of millions The amount of data is large and it is impossible to distinguish obvious hot and cold areas, and the data can be completely divided according to intervals Suitable for data where the boundaries between hot and cold partitions are not very clear. This method can be used for subsequent similar data. Split the large table into small tables to improve the efficiency of query insertion. If the large data table gradually As the number increases, more and more corresponding database tables will be required, and each table needs to be divided into tables; the division of intervals is relatively fixed, and if the amount of data in a subsequent single table increases, it will also have an impact on performance; the implementation complexity is more complicated than the third solution, and the entire needs to be tested The implementation process is processed at the coding layer and has an impact on the original business; hot and cold archive sub-library The amount of data is large; the cold and hot partitions of the data are obvious; the use of cold data The frequency is extremely low; The process of data migration has less impact on the business, less development volume and less cost. Need to confirm the table splitting rules

You can choose a solution that suits your business according to your own business scenario. I will provide you with some ideas ~

So here, what I want to talk about is almost over. If If there is anything wrong or if you have any doubts, you are welcome to give us some advice!

[Related recommendations:

mysql video tutorial]

The above is the detailed content of How does Mysql handle large data tables? Solution sharing. For more information, please follow other related articles on the PHP Chinese website!

sql mysql 分布式 count select union int 指针线程并发对象 table redis 数据库

Statement：

This article is reproduced at:juejin.cn. If there is any infringement, please contact admin@php.cn delete

Previous article：Summary and sharing of mysql slow query optimization ideas Next article：Summary and sharing of mysql slow query optimization ideas

See more

How does Mysql handle large data tables? Solution sharing

Scenario:

Evaluation table data volume

Table capacity:

Disk space

Instance Capacity

Causes of problems

How to solve the problem of too much data in a single table and slow query

方案一：数据表分区

方案二：数据库分表

水平分表

垂直分表

1. Modulo scheme:

2.range range scheme

3. Combination of hash modulus and range scheme

The difference between partitioning and sub-tables:

1 , In terms of implementation

2. Improving performance

3. Difficulty of implementation

The relationship between partitions and tables

Problems with sub-database and sub-table

1. Transaction problems

2. Cross-database and cross-table join issues

3. Extra data management burden and data calculation pressure

Option three: hot and cold archiving

Next let’s talk about the archiving process

How do we choose the above three options

Related articles