Recently there is a data table with 2000W records that needs to be optimized and migrated. 20 million data is very embarrassing for MySQL, because the index creation speed is still very fast, and no matter how much optimization is done, the speed cannot be improved much. However, these data have a large number of redundant fields and error information, which is extremely inconvenient for statistics and analysis. So I need to create a new table, take out the data from the old table one by one, optimize it, and put it back into the new table;
In the 2000W data, we know in advance the fields that can be used as query conditions. Therefore, create a new field separately for this part of data, and reasonably change the field structure for regular data, such as ID card is varchar(18). For unimportant data, we merge it and there is a field with a text structure.
We need to calculate some related data. Common ones such as ID card type can obtain accurate gender, birthplace, birthday, and age.
We take out an old data from the database, then get the new data we want through calculation and processing, and finally insert the new data into the new table . However, the following problems were encountered when obtaining new data.
The amount of data is too large to be obtained at once (2000W data thrown into the memory is quite scary);
We can obtain it in batches through MySQL's limit syntax. For example, each time you get 50,000, the SQL statement is as follows:
select * from table_name limit 15000000,50000;
This method can solve the problem of too much data, but as the first parameter of the limit gets larger and larger, the query speed will be frighteningly slow. (The execution of the above SQL will take 35 seconds). Time is life, so we started to optimize the SQL statement. After optimization, it becomes the following:
select * from table_name order by id desc limit 5000000,50000;
2000W data can be split by dichotomy. When 1000W data is executed, the data will be reversed. After optimization, the SQL execution efficiency has been significantly improved, from 35 seconds to 9 seconds;
But it is still very slow, time is life... Fortunately, we have self-increasing ID (the first law of creating a data table, must There are auto-increment fields), the optimized SQl is as follows:
1. select * from table_name where id>15000000 and id<15050000; 2. select * from table_name where id>15000000 limit 50000;
For intuitive demonstration, I wrote two SQLs with the same function. Compared with the first one, the second limit will cause the SQL index hit to become worse, and the efficiency will also decrease. The execution time of the first SQL is 2 milliseconds, and the execution time of the second is 5 milliseconds (the average value I took). The speed of each data query is directly reduced from 35 seconds to 2 milliseconds...
The amount of data is too large and the data cannot be estimated. Some special data will cause the data import to fail;
We have three options to store new data into the new table, which are as follows:
Insert data one by one;
I will definitely think about it at the beginning This solution will definitely not work, because each insertion will cause a database IO operation. However, the advantage of this solution is that it can detect problematic data in time and continue execution after modification; using "bind variables" in Oracle can improve performance, and MySQL also provides the "bind variable" function. So without changing the logic, try to optimize the data storage speed. The code is as follows:
public function actionTest(array $data) { $mysqli = new mysqli("192.168.1.106", "username", "password", "test"); $sql = "insert into table_name(name,identity) values (?,?)"; $stmt = $connection->prepare($sql); $name = ""; $identity = ""; //使用绑定变量 $stmt->bind_param("si", $name, $identity); foreach($data as $val) { $name = $val[name]; $identity = $val[card_id]; //执行 $stmt->execute(); } $stmt->close(); }
The final effect is not very good. MySQL's "bind variable" does not bring obvious speed improvement, but it can effectively prevent SQL injection;
Insert 50,000 pieces of data at a time;
This is the solution I finally selected. First, it can detect problematic data in time, and second, the imported data is very stable. Just like supporting breakpoint resumption, you can see the effect at every step. When executing the script, you can also start writing analysis logic simultaneously;
is assembled into a SQL file, and finally imported uniformly;
is assembled into a large SQL file, and finally passed through MySQL The built-in tool import is also excellent. But if there is a problem with one of the SQL statements, you may need to rerun the script. Because it is very painful to modify a symbol in a 9G text file...
Through various optimizations, the script execution time was finally reduced to less than 20 minutes. After optimization, the data quality has been highly guaranteed. Next time, we will try to optimize and migrate 200 million data...
The above is the content of MySQL 20 million data optimization and migration, more For related content, please pay attention to the PHP Chinese website (m.sbmmt.com)!