Interviewer: How did you query 10 million data?
Recently I have been doing mock interviews and resume optimization for everyone, and I found that many people see what tens of millions of data and the like Interview questions will make you weak.
Maybe some people have never encountered a table with tens of millions of data, and they don’t know what will happen when querying tens of millions of data.
Today I will take you through a practical operation. This time it is based on MySQL 5.7.26 for testing
Preparing data
What should I do if I don’t have 10 million data?
Can’t you create it without data?
Is it difficult to create data?
The code creates 10 million?
That's impossible, it's too slow, and it might take a whole day. You can use database scripts to execute much faster.
Create table
CREATE TABLE `user_operation_log` ( `id` int(11) NOT NULL AUTO_INCREMENT, `user_id` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `ip` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `op_data` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr1` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr2` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr3` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr4` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr5` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr6` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr7` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr8` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr9` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr10` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr11` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr12` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, PRIMARY KEY (`id`) USING BTREE ) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;
Create data script
Use batch Insert, the efficiency will be much faster, and every 1000 items will be committed. If the amount of data is too large, it will also lead to slow batch insertion efficiency.
DELIMITER ;; CREATE PROCEDURE batch_insert_log() BEGIN DECLARE i INT DEFAULT 1; DECLARE userId INT DEFAULT 10000000; set @execSql = 'INSERT INTO `test`.`user_operation_log`(`user_id`, `ip`, `op_data`, `attr1`, `attr2`, `attr3`, `attr4`, `attr5`, `attr6`, `attr7`, `attr8`, `attr9`, `attr10`, `attr11`, `attr12`) VALUES'; set @execData = ''; WHILE i<=10000000 DO set @attr = "'测试很长很长很长很长很长很长很长很长很长很长很长很长很长很长很长很长很长的属性'"; set @execData = concat(@execData, "(", userId + i, ", '10.0.69.175', '用户登录操作'", ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ")"); if i % 1000 = 0 then set @stmtSql = concat(@execSql, @execData,";"); prepare stmt from @stmtSql; execute stmt; DEALLOCATE prepare stmt; commit; set @execData = ""; else set @execData = concat(@execData, ","); end if; SET i=i+1; END WHILE; END;; DELIMITER ;
Start testing
My computer configuration is relatively low: win10 standard pressure i5, read and write about 500MB SSD
Due to the low configuration, I only prepared for this test 3148000 pieces of data were obtained, occupying 5G of disk (without indexing), and ran for 38 minutes. Students with good computer configuration can insert multiple points of data for testing
SELECT count(1) FROM `user_operation_log`
Return result: 3148000
The three query times are:
14060 ms 13755 ms 13447 ms
普通分页查询
MySQL 支持 LIMIT 语句来选取指定的条数数据, Oracle 可以使用 ROWNUM 来选取。
MySQL分页查询语法如下:
SELECT * FROM table LIMIT [offset,] rows | rows OFFSET offset
第一个参数指定第一个返回记录行的偏移量 第二个参数指定返回记录行的最大数目
下面我们开始测试查询结果:
SELECT * FROM `user_operation_log` LIMIT 10000, 10
查询3次时间分别为:
59 ms 49 ms 50 ms
这样看起来速度还行,不过是本地数据库,速度自然快点。
换个角度来测试
相同偏移量,不同数据量
SELECT * FROM `user_operation_log` LIMIT 10000, 10 SELECT * FROM `user_operation_log` LIMIT 10000, 100 SELECT * FROM `user_operation_log` LIMIT 10000, 1000 SELECT * FROM `user_operation_log` LIMIT 10000, 10000 SELECT * FROM `user_operation_log` LIMIT 10000, 100000 SELECT * FROM `user_operation_log` LIMIT 10000, 1000000
查询时间如下:
Quantity | First time | Second time | Third time |
---|---|---|---|
10 items | 53ms | 52ms | 47ms |
100 items | 50ms | 60ms | 55ms |
61ms | 74ms | 60ms | |
164ms | 180ms | 217ms | |
1609ms | 1741ms | 1764ms | |
16219ms | 16889ms | 17081ms |
偏移量 | 第一次 | 第二次 | 第三次 |
---|---|---|---|
100 | 36ms | 40ms | 36ms |
1000 | 31ms | 38ms | 32ms |
10000 | 53ms | 48ms | 51ms |
100000 | 622ms | 576ms | 627ms |
1000000 | 4891ms | 5076ms | 4856ms |
从上面结果可以得出结束:偏移量越大,花费时间越长
SELECT * FROM `user_operation_log` LIMIT 100, 100 SELECT id, attr FROM `user_operation_log` LIMIT 100, 100
如何优化
既然我们经过上面一番的折腾,也得出了结论,针对上面两个问题:偏移大、数据量大,我们分别着手优化
优化偏移量大问题
采用子查询方式
我们可以先定位偏移位置的 id,然后再查询数据
SELECT * FROM `user_operation_log` LIMIT 1000000, 10 SELECT id FROM `user_operation_log` LIMIT 1000000, 1 SELECT * FROM `user_operation_log` WHERE id >= (SELECT id FROM `user_operation_log` LIMIT 1000000, 1) LIMIT 10
查询结果如下:
sql | 花费时间 |
---|---|
第一条 | 4818ms |
第二条(无索引情况下) | 4329ms |
第二条(有索引情况下) | 199ms |
第三条(无索引情况下) | 4319ms |
第三条(有索引情况下) | 201ms |
从上面结果得出结论:
第一条花费的时间最大,第三条比第一条稍微好点 子查询使用索引速度更快
缺点:只适用于id递增的情况
id非递增的情况可以使用以下写法,但这种缺点是分页查询只能放在子查询里面
注意:某些 mysql 版本不支持在 in 子句中使用 limit,所以采用了多个嵌套select
SELECT * FROM `user_operation_log` WHERE id IN (SELECT t.id FROM (SELECT id FROM `user_operation_log` LIMIT 1000000, 10) AS t)
采用 id 限定方式
这种方法要求更高些,id必须是连续递增,而且还得计算id的范围,然后使用 between,sql如下
SELECT * FROM `user_operation_log` WHERE id between 1000000 AND 1000100 LIMIT 100 SELECT * FROM `user_operation_log` WHERE id >= 1000000 LIMIT 100
查询结果如下:
sql | 花费时间 |
---|---|
第一条 | 22ms |
第二条 | 21ms |
从结果可以看出这种方式非常快
注意:这里的 LIMIT 是限制了条数,没有采用偏移量
优化数据量大问题
返回结果的数据量也会直接影响速度
SELECT * FROM `user_operation_log` LIMIT 1, 1000000 SELECT id FROM `user_operation_log` LIMIT 1, 1000000 SELECT id, user_id, ip, op_data, attr1, attr2, attr3, attr4, attr5, attr6, attr7, attr8, attr9, attr10, attr11, attr12 FROM `user_operation_log` LIMIT 1, 1000000
查询结果如下:
sql | 花费时间 |
---|---|
第一条 | 15676ms |
第二条 | 7298ms |
第三条 | 15960ms |
It can be seen from the results that by reducing unnecessary columns, the query efficiency can also be significantly improved
The first and third query speeds are almost the same. At this time, you will definitely complain, then I Why write so many fields? Just * and you’re done.
Note that my MySQL server and client are on the same machine, so the query data is similar. Qualified students can test it. Test the client separately from MySQL
SELECT * Isn’t it delicious?
By the way, I would like to add why SELECT *
is banned. Isn't it delicious because it's simple and mindless?
Two main points:
Using " SELECT *
" the database needs to parse more objects, fields, permissions, attributes and other related content. When the SQL statements are complex and there are many hard parses, it will put a heavy burden on the database.Increases network overhead, *
Sometimes useless and large text fields such as log and IconMD5 are mistakenly added, and the data transmission size will increase geometrically. Especially since MySQL and the application are not on the same machine, this overhead is very obvious.
The above is the detailed content of Interviewer: How did you query 10 million data?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Lazy loading only queries when accessing associations can easily lead to N 1 problems, which is suitable for scenarios where the associated data is not determined whether it is needed; 2. Emergency loading uses with() to load associated data in advance to avoid N 1 queries, which is suitable for batch processing scenarios; 3. Emergency loading should be used to optimize performance, and N 1 problems can be detected through tools such as LaravelDebugbar, and the $with attribute of the model is carefully used to avoid unnecessary performance overhead.

Python's memory management is based on reference counting and garbage collection mechanisms. 1. The reference counting mechanism ensures that objects are released immediately when the reference number is 0. The return value of sys.getrefcount() is 1 more than the actual reference because it increases its reference itself; 2. Circular references cannot be cleaned through reference counting, and it depends on the generational recycling of the gc module. Calling gc.collect() can recycle unreachable objects; 3. In actual development, long-term holding of large object references should be avoided. We can use weakref weak references, timely place None to release memory, and use tracemalloc to monitor memory allocation; 4. Summary: Python combines reference counting and garbage collection to manage memory, developers can use tools and optimize reference pipes.

UseaRESTAPItobridgePHPandMLmodelsbyrunningthemodelinPythonviaFlaskorFastAPIandcallingitfromPHPusingcURLorGuzzle.2.RunPythonscriptsdirectlyfromPHPusingexec()orshell_exec()forsimple,low-trafficusecases,thoughthisapproachhassecurityandperformancelimitat

Laravel supports the use of native SQL queries, but parameter binding should be preferred to ensure safety; 1. Use DB::select() to execute SELECT queries with parameter binding to prevent SQL injection; 2. Use DB::update() to perform UPDATE operations and return the number of rows affected; 3. Use DB::insert() to insert data; 4. Use DB::delete() to delete data; 5. Use DB::statement() to execute SQL statements without result sets such as CREATE, ALTER, etc.; 6. It is recommended to use whereRaw, selectRaw and other methods in QueryBuilder to combine native expressions to improve security

Use regular expression capture group in Notepad to effectively reorganize text. First, you need to open the replacement dialog box (Ctrl H), select "Search Mode" as "regular expression", 1. Use () to define the capture group, such as (\w ) to capture words; 2. Use \1 and \2 to reference the corresponding group in the replacement box; 3. Example: Exchange the name "JohnDoe" as "Doe, John", find (\w )\s (\w ), replace it with \2,\1; 4. Date format conversion 2023-12-25 to 25/12/2023, find (\d{4})-(\d{2})-(\d{2}), replace it with \3/\2/\1; 5. Log reordering can extract time, level, ID and other information

UseefficientdatastructureslikeArrayListoverLinkedListandprimitivecollectionstoreduceoverhead;2.Minimizeobjectcreationbyreusingobjects,usingStringBuilderforconcatenation,andcachingexpensiveobjects;3.Preventmemoryleaksbynullifyingreferences,usingstatic

Python's ternary operator is used to concisely implement if-else judgment, and its syntax is "value_if_trueif conditionelsevalue_if_false"; 1. It can be used for simple assignment, such as returning the corresponding string based on positive and negative values; 2. It can avoid division errors, such as determining that the denominator is non-zero and then division; 3. It can select content according to conditions in string format; 4. It can assign labels to different elements in list derivation formula; it should be noted that this operator is only suitable for binary branches and should not be nested multiple layers. Complex logic should use the traditional if-elif-else structure to ensure readability.

table-layout:fixed will force the table column width to be determined by the cell width of the first row to avoid the content affecting the layout. 1. Set table-layout:fixed and specify the table width; 2. Set the specific column width ratio for the first row th/td; 3. Use white-space:nowrap, overflow:hidden and text-overflow:ellipsis to control text overflow; 4. Applicable to background management, data reports and other scenarios that require stable layout and high-performance rendering, which can effectively prevent layout jitter and improve rendering efficiency.
