Understand the paging query after tens of billions of data are divided into tables-Mysql Tutorial-php.cn

mysql video tutorial column introduces paging query of tens of billions of data.

Understand the paging query after tens of billions of data are divided into tables

#When the business scale reaches a certain scale, Taobao’s daily order volume is more than 50 million orders, and Meituan’s daily order volume is more than 30 million orders. When the database is faced with massive data pressure, sub-database and table sub-operation are necessary. After the database is divided into tables, some regular queries may cause problems. The most common ones are paging queries. Generally, we call the fields of sharding tables as shardingkey. For example, the order table uses user ID as shardingkey. So how to do paging if the query condition does not include user ID? For example, how can more multi-dimensional queries be queried if there is no sharding key?

Unique Primary Key

Generally, the primary keys of our database are auto-incremented, so the problem of primary key conflict after splitting the table is an unavoidable problem. The simplest way is to use a unique business The field serves as the only primary key. For example, the order number of the order table must be globally unique.

There are many common distributed ways to generate unique IDs, the most common ones are Snowflake algorithm, Didi Tinyid, and Meituan Leaf. Taking the snowflake algorithm as an example, multiple IDs can be generated in one millisecond.

The first bit

is not used, the default is 0, 41 digit timestamp is accurate to milliseconds, can accommodate 69 years, 10 digits work The high 5 digits of the machine ID are the data center ID, the low 5 digits are the node ID, 12-digit serial number Each node accumulates every millisecond, and the total can reach 2^12 4096 IDs.

Partitioning

The first step is to ensure that the order number is unique after dividing the table. Now consider the issue of dividing the table. First, consider the size of the sub-table based on its own business volume and increment.

For example, our daily order volume is now 100,000 orders, and it is estimated that it will reach 1 million orders per day in one year. According to business attributes, we generally support querying orders within half a year, and orders that exceed half a year. Archiving is required.

So based on the order of 1 million orders per day for half a year, without separate tables, our order volume will reach 1 million , even if you can handle RT's time, you simply can't accept it. According to experience, there is no pressure on the database if the number of a single table is in the millions, so it is enough to divide it into 256 tables, 180 million/256 ≈ 700,000. If you are on the safe side, you can also divide it into 512 tables. Then think about it, if the business volume increases another 10 times to 10 million orders per day, sub-table 1024 is a more suitable choice.

After splitting tables and archiving data for more than half a year, 700,000 data in a single table is enough to cope with most scenarios. Next, hash the order number, and then take the modulo of 256 to determine which table it falls on.

Well, because the only primary key is based on the order number, the queries you wrote based on the primary key ID in the past cannot be used. This involves some history. Modifications to query functionality. But this is not a problem, right? Just change it to check by the order number. None of this is a problem, the problem is what our title says.

C-side query

After talking for a long time, I finally got to the point. So how to solve the problems of query and paging query after table partitioning?

First of all, let’s talk about the query with sharding key. For example, query by order number. No matter what you do, you can directly locate the specific table for query. Obviously, there will be no problem with the query.

If it is not the sharding key, if the order number is used as the sharding key in the example above, APPs and small programs are generally queried through the user ID. So what should we do with the sharding done through the order number? Many companies' order tables directly use the user ID as the sharding key, which is very simple and can be checked directly. So what to do with the order number? A very simple way is to add the user ID attribute to the order number. To give a very simple example, you think you can’t use up the original 41-digit timestamp. The user ID is 10 digits. The order number generation rule contains the user ID. When entering the specific table, the 10-digit user ID hash in the order number is used. Take the modulus so that the query effect is the same regardless of the order number or user ID.

Of course, this method is just an example. The specific order number generation rules, how many digits, and what factors are included are determined according to your own business and implementation mechanism.

Okay, then whether you use the order number or user ID as the sharding key, you can solve the problem by following the above two methods. Then there is another question: What should I do if it is neither an order number nor a user ID query? The most intuitive example is the query from the merchant side or the backend. The merchant side uses the ID of the merchant or seller as the query condition. The query conditions in the background may be more complicated, like some background query conditions I encountered. There could be dozens of them. How to check? ? ? Don't worry, let's talk about the complex queries on the B-side and the backend separately.

In reality, most of the real traffic comes from the user-side C-side, so it essentially solves the problem on the user-side. This problem is mostly solved, and the rest comes from the merchant-seller-side B-side and the backend support operation business. The query traffic is not very large, so this problem is easy to solve.

Other side query

There are two ways to solve the non-shardingkey query on the B side.

Double writing. Double writing means that the order data is stored in two copies. The C side and the B side each save one copy. For C side, you can use the order number and user ID as the sharding key. OK, the B-side can just use the merchant's seller's ID as the sharding key. Some classmates will say, will it not affect performance if you double-write? Because a slight delay is acceptable for the B-side, an asynchronous method can be used to place the B-side order. Think about it, if you go to Taobao to buy something and place an order, does it matter if the seller delays receiving the order message for a second or two? Does it have any big impact on the takeaway merchant you ordered to receive the order a second or two late?

This is a solution. Another solution is to use the offline data warehouse or ES query. After the order data is dropped into the database, whether you use binlog or MQ messages are all in the form of synchronizing data to a data warehouse or ES. The order of magnitude they support is very simple for this kind of query conditions. There is definitely a slight delay in this method, but this controllable delay is acceptable.

For queries on the management backend, such as operations, business, and products that need to look at data, they naturally require complex query conditions, which can also be done through ES or data warehouse. . If you don't use this solution and do paging query without shardingkey, brother, you can only scan the entire table to query the aggregated data, and then do paging manually, but the results found in this way are limited.

For example, if you have 256 shards, when querying, you scan all shards cyclically, fetch 20 pieces of data from each shard, and finally aggregate the data and manually page it, it will be impossible to find the full amount of data.

Summary

The query problem after database and table partitioning is actually known to experienced students, but I believe that most students may not have done the business yet. At this order of magnitude, sub-databases and tables may still be in the conceptual stage. After being asked about it in the interview, I feel at a loss because I don’t know what to do because I have no experience.

Sub-database and sub-table are first judged based on the existing business volume and future increment. For example, if Pinduoduo has a daily order volume of 50 million, the data in half a year must be in the tens of billions. That’s all The score is 4096 tables, right? But the actual operation is the same. For your business, it is not necessary to score 4096. Make a reasonable choice based on the business.

We can easily solve queries based on shardingkey. Querying on non-shardingkey can be solved by dropping double copies of data, data warehouse, and ES. Of course, if the amount of data after splitting is small, , it is not a problem to build the index and scan the entire table to query.

Related free learning recommendations: mysql video tutorial

The above is the detailed content of Understand the paging query after tens of billions of data are divided into tables. For more information, please follow other related articles on the PHP Chinese website!