Interviewer: How did you query 10 million data?-JavaInterview questions-php.cn

Recently I have been doing mock interviews and resume optimization for everyone, and I found that many people see what tens of millions of data and the like Interview questions will make you weak.

Maybe some people have never encountered a table with tens of millions of data, and they don’t know what will happen when querying tens of millions of data.

Today I will take you through a practical operation. This time it is based on MySQL 5.7.26 for testing

Preparing data

What should I do if I don’t have 10 million data?

Can’t you create it without data?

Is it difficult to create data?

The code creates 10 million?

That's impossible, it's too slow, and it might take a whole day. You can use database scripts to execute much faster.

Create table

CREATE TABLE `user_operation_log`  (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `user_id` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `ip` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `op_data` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr1` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr2` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr3` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr4` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr5` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr6` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr7` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr8` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr9` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr10` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr11` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr12` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;

Create data script

Use batch Insert, the efficiency will be much faster, and every 1000 items will be committed. If the amount of data is too large, it will also lead to slow batch insertion efficiency.

DELIMITER ;;
CREATE PROCEDURE batch_insert_log()
BEGIN
  DECLARE i INT DEFAULT 1;
  DECLARE userId INT DEFAULT 10000000;
 set @execSql = &#39;INSERT INTO `test`.`user_operation_log`(`user_id`, `ip`, `op_data`, `attr1`, `attr2`, `attr3`, `attr4`, `attr5`, `attr6`, `attr7`, `attr8`, `attr9`, `attr10`, `attr11`, `attr12`) VALUES&#39;;
 set @execData = &#39;&#39;;
  WHILE i<=10000000 DO
   set @attr = "&#39;测试很长很长很长很长很长很长很长很长很长很长很长很长很长很长很长很长很长的属性&#39;";
  set @execData = concat(@execData, "(", userId + i, ", &#39;10.0.69.175&#39;, &#39;用户登录操作&#39;", ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ")");
  if i % 1000 = 0
  then
     set @stmtSql = concat(@execSql, @execData,";");
    prepare stmt from @stmtSql;
    execute stmt;
    DEALLOCATE prepare stmt;
    commit;
    set @execData = "";
   else
     set @execData = concat(@execData, ",");
   end if;
  SET i=i+1;
  END WHILE;

END;;
DELIMITER ;

Start testing

My computer configuration is relatively low: win10 standard pressure i5, read and write about 500MB SSD

Due to the low configuration, I only prepared for this test 3148000 pieces of data were obtained, occupying 5G of disk (without indexing), and ran for 38 minutes. Students with good computer configuration can insert multiple points of data for testing

SELECT count(1) FROM `user_operation_log`

Return result: 3148000

The three query times are:

14060 ms
13755 ms
13447 ms

普通分页查询

MySQL 支持 LIMIT 语句来选取指定的条数数据， Oracle 可以使用 ROWNUM 来选取。

MySQL分页查询语法如下：

SELECT * FROM table LIMIT [offset,] rows | rows OFFSET offset

第一个参数指定第一个返回记录行的偏移量
第二个参数指定返回记录行的最大数目

下面我们开始测试查询结果：

SELECT * FROM `user_operation_log` LIMIT 10000, 10

查询3次时间分别为：

59 ms
49 ms
50 ms

这样看起来速度还行，不过是本地数据库，速度自然快点。

换个角度来测试

相同偏移量，不同数据量

SELECT * FROM `user_operation_log` LIMIT 10000, 10
SELECT * FROM `user_operation_log` LIMIT 10000, 100
SELECT * FROM `user_operation_log` LIMIT 10000, 1000
SELECT * FROM `user_operation_log` LIMIT 10000, 10000
SELECT * FROM `user_operation_log` LIMIT 10000, 100000
SELECT * FROM `user_operation_log` LIMIT 10000, 1000000

查询时间如下：

##1000 items61ms74ms60ms10000 items164ms180ms217ms100000 Articles1609ms1741ms1764ms##1000000 Articles

从上面结果可以得出结束：数据量越大，花费时间越长

相同数据量，不同偏移量

SELECT * FROM `user_operation_log` LIMIT 100, 100
SELECT * FROM `user_operation_log` LIMIT 1000, 100
SELECT * FROM `user_operation_log` LIMIT 10000, 100
SELECT * FROM `user_operation_log` LIMIT 100000, 100
SELECT * FROM `user_operation_log` LIMIT 1000000, 100

Quantity	First time	Second time	Third time
10 items	53ms	52ms	47ms
100 items	50ms	60ms	55ms



16219ms	16889ms	17081ms

偏移量	第一次	第二次	第三次
100	36ms	40ms	36ms
1000	31ms	38ms	32ms
10000	53ms	48ms	51ms
100000	622ms	576ms	627ms
1000000	4891ms	5076ms	4856ms

从上面结果可以得出结束：偏移量越大，花费时间越长

SELECT * FROM `user_operation_log` LIMIT 100, 100
SELECT id, attr FROM `user_operation_log` LIMIT 100, 100

如何优化

既然我们经过上面一番的折腾，也得出了结论，针对上面两个问题：偏移大、数据量大，我们分别着手优化

优化偏移量大问题

采用子查询方式

我们可以先定位偏移位置的 id，然后再查询数据

SELECT * FROM `user_operation_log` LIMIT 1000000, 10

SELECT id FROM `user_operation_log` LIMIT 1000000, 1

SELECT * FROM `user_operation_log` WHERE id >= (SELECT id FROM `user_operation_log` LIMIT 1000000, 1) LIMIT 10

查询结果如下：

sql	花费时间
第一条	4818ms
第二条(无索引情况下)	4329ms
第二条(有索引情况下)	199ms
第三条(无索引情况下)	4319ms
第三条(有索引情况下)	201ms

从上面结果得出结论：

第一条花费的时间最大，第三条比第一条稍微好点
子查询使用索引速度更快

缺点：只适用于id递增的情况

id非递增的情况可以使用以下写法，但这种缺点是分页查询只能放在子查询里面

注意：某些 mysql 版本不支持在 in 子句中使用 limit，所以采用了多个嵌套select

SELECT * FROM `user_operation_log` WHERE id IN (SELECT t.id FROM (SELECT id FROM `user_operation_log` LIMIT 1000000, 10) AS t)

采用 id 限定方式

这种方法要求更高些，id必须是连续递增，而且还得计算id的范围，然后使用 between，sql如下

SELECT * FROM `user_operation_log` WHERE id between 1000000 AND 1000100 LIMIT 100

SELECT * FROM `user_operation_log` WHERE id >= 1000000 LIMIT 100

查询结果如下：

sql	花费时间
第一条	22ms
第二条	21ms

从结果可以看出这种方式非常快

注意：这里的 LIMIT 是限制了条数，没有采用偏移量

优化数据量大问题

返回结果的数据量也会直接影响速度

SELECT * FROM `user_operation_log` LIMIT 1, 1000000

SELECT id FROM `user_operation_log` LIMIT 1, 1000000

SELECT id, user_id, ip, op_data, attr1, attr2, attr3, attr4, attr5, attr6, attr7, attr8, attr9, attr10, attr11, attr12 FROM `user_operation_log` LIMIT 1, 1000000

查询结果如下：

sql	花费时间
第一条	15676ms
第二条	7298ms
第三条	15960ms

It can be seen from the results that by reducing unnecessary columns, the query efficiency can also be significantly improved

The first and third query speeds are almost the same. At this time, you will definitely complain, then I Why write so many fields? Just * and you’re done.

Note that my MySQL server and client are on the same machine, so the query data is similar. Qualified students can test it. Test the client separately from MySQL

SELECT * Isn’t it delicious?

By the way, I would like to add why SELECT * is banned. Isn't it delicious because it's simple and mindless?

Two main points:

Using "SELECT * " the database needs to parse more objects, fields, permissions, attributes and other related content. When the SQL statements are complex and there are many hard parses, it will put a heavy burden on the database.
Increases network overhead, * Sometimes useless and large text fields such as log and IconMD5 are mistakenly added, and the data transmission size will increase geometrically. Especially since MySQL and the application are not on the same machine, this overhead is very obvious.

The above is the detailed content of Interviewer: How did you query 10 million data?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress images for free

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Oguri Cap Build Guide | A Pretty Derby Musume

1 months ago By Jack chen

Agnes Tachyon Build Guide | A Pretty Derby Musume

3 weeks ago By Jack chen

Grass Wonder Build Guide | Uma Musume Pretty Derby

3 weeks ago By Jack chen

How to Fix KB5060829 Not Installing & Other Reported Bugs

4 weeks ago By DDD

Roblox: 99 Nights In The Forest - All Badges And How To Unlock Them

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Laravel Tutorial

1594

PHP Tutorial

1476

nyt mini crossword answers

267

587

nyt connections hints and answers

125

836

ene.sys

Related knowledge

Laravel lazy loading vs eager loading Jul 28, 2025 am 04:23 AM

Lazy loading only queries when accessing associations can easily lead to N 1 problems, which is suitable for scenarios where the associated data is not determined whether it is needed; 2. Emergency loading uses with() to load associated data in advance to avoid N 1 queries, which is suitable for batch processing scenarios; 3. Emergency loading should be used to optimize performance, and N 1 problems can be detected through tools such as LaravelDebugbar, and the $with attribute of the model is carefully used to avoid unnecessary performance overhead.

python memory management example Jul 28, 2025 am 01:10 AM

Python's memory management is based on reference counting and garbage collection mechanisms. 1. The reference counting mechanism ensures that objects are released immediately when the reference number is 0. The return value of sys.getrefcount() is 1 more than the actual reference because it increases its reference itself; 2. Circular references cannot be cleaned through reference counting, and it depends on the generational recycling of the gc module. Calling gc.collect() can recycle unreachable objects; 3. In actual development, long-term holding of large object references should be avoided. We can use weakref weak references, timely place None to release memory, and use tracemalloc to monitor memory allocation; 4. Summary: Python combines reference counting and garbage collection to manage memory, developers can use tools and optimize reference pipes.

Integrating PHP with Machine Learning Models Jul 28, 2025 am 04:37 AM

UseaRESTAPItobridgePHPandMLmodelsbyrunningthemodelinPythonviaFlaskorFastAPIandcallingitfromPHPusingcURLorGuzzle.2.RunPythonscriptsdirectlyfromPHPusingexec()orshell_exec()forsimple,low-trafficusecases,thoughthisapproachhassecurityandperformancelimitat

Laravel raw SQL query example Jul 29, 2025 am 02:59 AM

Laravel supports the use of native SQL queries, but parameter binding should be preferred to ensure safety; 1. Use DB::select() to execute SELECT queries with parameter binding to prevent SQL injection; 2. Use DB::update() to perform UPDATE operations and return the number of rows affected; 3. Use DB::insert() to insert data; 4. Use DB::delete() to delete data; 5. Use DB::statement() to execute SQL statements without result sets such as CREATE, ALTER, etc.; 6. It is recommended to use whereRaw, selectRaw and other methods in QueryBuilder to combine native expressions to improve security

Notepad find and replace with regex capture groups Jul 28, 2025 am 02:17 AM

Use regular expression capture group in Notepad to effectively reorganize text. First, you need to open the replacement dialog box (Ctrl H), select "Search Mode" as "regular expression", 1. Use () to define the capture group, such as (\w ) to capture words; 2. Use \1 and \2 to reference the corresponding group in the replacement box; 3. Example: Exchange the name "JohnDoe" as "Doe, John", find (\w )\s (\w ), replace it with \2,\1; 4. Date format conversion 2023-12-25 to 25/12/2023, find (\d{4})-(\d{2})-(\d{2}), replace it with \3/\2/\1; 5. Log reordering can extract time, level, ID and other information

Optimizing Memory Usage in Java Applications Jul 28, 2025 am 02:40 AM

UseefficientdatastructureslikeArrayListoverLinkedListandprimitivecollectionstoreduceoverhead;2.Minimizeobjectcreationbyreusingobjects,usingStringBuilderforconcatenation,andcachingexpensiveobjects;3.Preventmemoryleaksbynullifyingreferences,usingstatic

python ternary operator example Jul 28, 2025 am 02:57 AM

Python's ternary operator is used to concisely implement if-else judgment, and its syntax is "value_if_trueif conditionelsevalue_if_false"; 1. It can be used for simple assignment, such as returning the corresponding string based on positive and negative values; 2. It can avoid division errors, such as determining that the denominator is non-zero and then division; 3. It can select content according to conditions in string format; 4. It can assign labels to different elements in list derivation formula; it should be noted that this operator is only suitable for binary branches and should not be nested multiple layers. Complex logic should use the traditional if-elif-else structure to ensure readability.

css table-layout fixed example Jul 29, 2025 am 04:28 AM

table-layout:fixed will force the table column width to be determined by the cell width of the first row to avoid the content affecting the layout. 1. Set table-layout:fixed and specify the table width; 2. Set the specific column width ratio for the first row th/td; 3. Use white-space:nowrap, overflow:hidden and text-overflow:ellipsis to control text overflow; 4. Applicable to background management, data reports and other scenarios that require stable layout and high-performance rendering, which can effectively prevent layout jitter and improve rendering efficiency.

See all articles