Table of Contents
Create data script" >Create data script
Start testing " >Start testing
普通分页查询" >普通分页查询
相同偏移量,不同数据量" >相同偏移量,不同数据量
相同数据量,不同偏移量" >相同数据量,不同偏移量
如何优化 " >如何优化
优化偏移量大问题" >优化偏移量大问题
采用子查询方式" >采用子查询方式
采用 id 限定方式" >采用 id 限定方式
优化数据量大问题" >优化数据量大问题
Home Java JavaInterview questions Interviewer: How did you query 10 million data?

Interviewer: How did you query 10 million data?

Aug 15, 2023 pm 04:34 PM
java java interview

Recently I have been doing mock interviews and resume optimization for everyone, and I found that many people see what tens of millions of data and the like Interview questions will make you weak.

Maybe some people have never encountered a table with tens of millions of data, and they don’t know what will happen when querying tens of millions of data.

Today I will take you through a practical operation. This time it is based on MySQL 5.7.26 for testing

Preparing data

What should I do if I don’t have 10 million data?

Can’t you create it without data?

Is it difficult to create data?

The code creates 10 million?

That's impossible, it's too slow, and it might take a whole day. You can use database scripts to execute much faster.

Create table
CREATE TABLE `user_operation_log`  (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `user_id` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `ip` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `op_data` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr1` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr2` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr3` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr4` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr5` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr6` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr7` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr8` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr9` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr10` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr11` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr12` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;
Create data script

Use batch Insert, the efficiency will be much faster, and every 1000 items will be committed. If the amount of data is too large, it will also lead to slow batch insertion efficiency.

DELIMITER ;;
CREATE PROCEDURE batch_insert_log()
BEGIN
  DECLARE i INT DEFAULT 1;
  DECLARE userId INT DEFAULT 10000000;
 set @execSql = 'INSERT INTO `test`.`user_operation_log`(`user_id`, `ip`, `op_data`, `attr1`, `attr2`, `attr3`, `attr4`, `attr5`, `attr6`, `attr7`, `attr8`, `attr9`, `attr10`, `attr11`, `attr12`) VALUES';
 set @execData = '';
  WHILE i<=10000000 DO
   set @attr = "&#39;测试很长很长很长很长很长很长很长很长很长很长很长很长很长很长很长很长很长的属性&#39;";
  set @execData = concat(@execData, "(", userId + i, ", &#39;10.0.69.175&#39;, &#39;用户登录操作&#39;", ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ")");
  if i % 1000 = 0
  then
     set @stmtSql = concat(@execSql, @execData,";");
    prepare stmt from @stmtSql;
    execute stmt;
    DEALLOCATE prepare stmt;
    commit;
    set @execData = "";
   else
     set @execData = concat(@execData, ",");
   end if;
  SET i=i+1;
  END WHILE;

END;;
DELIMITER ;

Start testing

My computer configuration is relatively low: win10 standard pressure i5, read and write about 500MB SSD

Due to the low configuration, I only prepared for this test 3148000 pieces of data were obtained, occupying 5G of disk (without indexing), and ran for 38 minutes. Students with good computer configuration can insert multiple points of data for testing

SELECT count(1) FROM `user_operation_log`

Return result: 3148000

The three query times are:

  • 14060 ms
  • 13755 ms
  • 13447 ms

普通分页查询

MySQL 支持 LIMIT 语句来选取指定的条数数据, Oracle 可以使用 ROWNUM 来选取。

MySQL分页查询语法如下:

SELECT * FROM table LIMIT [offset,] rows | rows OFFSET offset
  • 第一个参数指定第一个返回记录行的偏移量
  • 第二个参数指定返回记录行的最大数目

下面我们开始测试查询结果:

SELECT * FROM `user_operation_log` LIMIT 10000, 10

查询3次时间分别为:

  • 59 ms
  • 49 ms
  • 50 ms

这样看起来速度还行,不过是本地数据库,速度自然快点。

换个角度来测试

相同偏移量,不同数据量

SELECT * FROM `user_operation_log` LIMIT 10000, 10
SELECT * FROM `user_operation_log` LIMIT 10000, 100
SELECT * FROM `user_operation_log` LIMIT 10000, 1000
SELECT * FROM `user_operation_log` LIMIT 10000, 10000
SELECT * FROM `user_operation_log` LIMIT 10000, 100000
SELECT * FROM `user_operation_log` LIMIT 10000, 1000000

查询时间如下:

##1000 items61ms74ms60ms10000 items164ms180ms217ms100000 Articles1609ms1741ms1764ms##1000000 Articles

从上面结果可以得出结束:数据量越大,花费时间越长

相同数据量,不同偏移量

SELECT * FROM `user_operation_log` LIMIT 100, 100
SELECT * FROM `user_operation_log` LIMIT 1000, 100
SELECT * FROM `user_operation_log` LIMIT 10000, 100
SELECT * FROM `user_operation_log` LIMIT 100000, 100
SELECT * FROM `user_operation_log` LIMIT 1000000, 100
QuantityFirst timeSecond timeThird time
10 items53ms52ms47ms
100 items50ms60ms55ms
16219ms16889ms17081ms
偏移量第一次第二次第三次
10036ms40ms36ms
100031ms38ms32ms
1000053ms48ms51ms
100000622ms576ms627ms
10000004891ms5076ms4856ms

从上面结果可以得出结束:偏移量越大,花费时间越长

SELECT * FROM `user_operation_log` LIMIT 100, 100
SELECT id, attr FROM `user_operation_log` LIMIT 100, 100

如何优化

既然我们经过上面一番的折腾,也得出了结论,针对上面两个问题:偏移大、数据量大,我们分别着手优化

优化偏移量大问题

采用子查询方式

我们可以先定位偏移位置的 id,然后再查询数据

SELECT * FROM `user_operation_log` LIMIT 1000000, 10

SELECT id FROM `user_operation_log` LIMIT 1000000, 1

SELECT * FROM `user_operation_log` WHERE id >= (SELECT id FROM `user_operation_log` LIMIT 1000000, 1) LIMIT 10

查询结果如下:

sql花费时间
第一条4818ms
第二条(无索引情况下)4329ms
第二条(有索引情况下)199ms
第三条(无索引情况下)4319ms
第三条(有索引情况下)201ms

从上面结果得出结论:

  • 第一条花费的时间最大,第三条比第一条稍微好点
  • 子查询使用索引速度更快

缺点:只适用于id递增的情况

id非递增的情况可以使用以下写法,但这种缺点是分页查询只能放在子查询里面

注意:某些 mysql 版本不支持在 in 子句中使用 limit,所以采用了多个嵌套select

SELECT * FROM `user_operation_log` WHERE id IN (SELECT t.id FROM (SELECT id FROM `user_operation_log` LIMIT 1000000, 10) AS t)
采用 id 限定方式

这种方法要求更高些,id必须是连续递增,而且还得计算id的范围,然后使用 between,sql如下

SELECT * FROM `user_operation_log` WHERE id between 1000000 AND 1000100 LIMIT 100

SELECT * FROM `user_operation_log` WHERE id >= 1000000 LIMIT 100

查询结果如下:

sql花费时间
第一条22ms
第二条21ms

从结果可以看出这种方式非常快

注意:这里的 LIMIT 是限制了条数,没有采用偏移量

优化数据量大问题

返回结果的数据量也会直接影响速度

SELECT * FROM `user_operation_log` LIMIT 1, 1000000

SELECT id FROM `user_operation_log` LIMIT 1, 1000000

SELECT id, user_id, ip, op_data, attr1, attr2, attr3, attr4, attr5, attr6, attr7, attr8, attr9, attr10, attr11, attr12 FROM `user_operation_log` LIMIT 1, 1000000

查询结果如下:

sql 花费时间
第一条 15676ms
第二条 7298ms
第三条 15960ms

It can be seen from the results that by reducing unnecessary columns, the query efficiency can also be significantly improved

The first and third query speeds are almost the same. At this time, you will definitely complain, then I Why write so many fields? Just * and you’re done.

Note that my MySQL server and client are on the same machine, so the query data is similar. Qualified students can test it. Test the client separately from MySQL

SELECT * Isn’t it delicious?

By the way, I would like to add why SELECT * is banned. Isn't it delicious because it's simple and mindless?

Two main points:

  1. Using "SELECT * " the database needs to parse more objects, fields, permissions, attributes and other related content. When the SQL statements are complex and there are many hard parses, it will put a heavy burden on the database.
  2. Increases network overhead, * Sometimes useless and large text fields such as log and IconMD5 are mistakenly added, and the data transmission size will increase geometrically. Especially since MySQL and the application are not on the same machine, this overhead is very obvious.

The above is the detailed content of Interviewer: How did you query 10 million data?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress AI Tool

Undress images for free

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Laravel lazy loading vs eager loading Laravel lazy loading vs eager loading Jul 28, 2025 am 04:23 AM

Lazy loading only queries when accessing associations can easily lead to N 1 problems, which is suitable for scenarios where the associated data is not determined whether it is needed; 2. Emergency loading uses with() to load associated data in advance to avoid N 1 queries, which is suitable for batch processing scenarios; 3. Emergency loading should be used to optimize performance, and N 1 problems can be detected through tools such as LaravelDebugbar, and the $with attribute of the model is carefully used to avoid unnecessary performance overhead.

python memory management example python memory management example Jul 28, 2025 am 01:10 AM

Python's memory management is based on reference counting and garbage collection mechanisms. 1. The reference counting mechanism ensures that objects are released immediately when the reference number is 0. The return value of sys.getrefcount() is 1 more than the actual reference because it increases its reference itself; 2. Circular references cannot be cleaned through reference counting, and it depends on the generational recycling of the gc module. Calling gc.collect() can recycle unreachable objects; 3. In actual development, long-term holding of large object references should be avoided. We can use weakref weak references, timely place None to release memory, and use tracemalloc to monitor memory allocation; 4. Summary: Python combines reference counting and garbage collection to manage memory, developers can use tools and optimize reference pipes.

Integrating PHP with Machine Learning Models Integrating PHP with Machine Learning Models Jul 28, 2025 am 04:37 AM

UseaRESTAPItobridgePHPandMLmodelsbyrunningthemodelinPythonviaFlaskorFastAPIandcallingitfromPHPusingcURLorGuzzle.2.RunPythonscriptsdirectlyfromPHPusingexec()orshell_exec()forsimple,low-trafficusecases,thoughthisapproachhassecurityandperformancelimitat

Laravel raw SQL query example Laravel raw SQL query example Jul 29, 2025 am 02:59 AM

Laravel supports the use of native SQL queries, but parameter binding should be preferred to ensure safety; 1. Use DB::select() to execute SELECT queries with parameter binding to prevent SQL injection; 2. Use DB::update() to perform UPDATE operations and return the number of rows affected; 3. Use DB::insert() to insert data; 4. Use DB::delete() to delete data; 5. Use DB::statement() to execute SQL statements without result sets such as CREATE, ALTER, etc.; 6. It is recommended to use whereRaw, selectRaw and other methods in QueryBuilder to combine native expressions to improve security

Notepad   find and replace with regex capture groups Notepad find and replace with regex capture groups Jul 28, 2025 am 02:17 AM

Use regular expression capture group in Notepad to effectively reorganize text. First, you need to open the replacement dialog box (Ctrl H), select "Search Mode" as "regular expression", 1. Use () to define the capture group, such as (\w ) to capture words; 2. Use \1 and \2 to reference the corresponding group in the replacement box; 3. Example: Exchange the name "JohnDoe" as "Doe, John", find (\w )\s (\w ), replace it with \2,\1; 4. Date format conversion 2023-12-25 to 25/12/2023, find (\d{4})-(\d{2})-(\d{2}), replace it with \3/\2/\1; 5. Log reordering can extract time, level, ID and other information

Optimizing Memory Usage in Java Applications Optimizing Memory Usage in Java Applications Jul 28, 2025 am 02:40 AM

UseefficientdatastructureslikeArrayListoverLinkedListandprimitivecollectionstoreduceoverhead;2.Minimizeobjectcreationbyreusingobjects,usingStringBuilderforconcatenation,andcachingexpensiveobjects;3.Preventmemoryleaksbynullifyingreferences,usingstatic

python ternary operator example python ternary operator example Jul 28, 2025 am 02:57 AM

Python's ternary operator is used to concisely implement if-else judgment, and its syntax is "value_if_trueif conditionelsevalue_if_false"; 1. It can be used for simple assignment, such as returning the corresponding string based on positive and negative values; 2. It can avoid division errors, such as determining that the denominator is non-zero and then division; 3. It can select content according to conditions in string format; 4. It can assign labels to different elements in list derivation formula; it should be noted that this operator is only suitable for binary branches and should not be nested multiple layers. Complex logic should use the traditional if-elif-else structure to ensure readability.

css table-layout fixed example css table-layout fixed example Jul 29, 2025 am 04:28 AM

table-layout:fixed will force the table column width to be determined by the cell width of the first row to avoid the content affecting the layout. 1. Set table-layout:fixed and specify the table width; 2. Set the specific column width ratio for the first row th/td; 3. Use white-space:nowrap, overflow:hidden and text-overflow:ellipsis to control text overflow; 4. Applicable to background management, data reports and other scenarios that require stable layout and high-performance rendering, which can effectively prevent layout jitter and improve rendering efficiency.

See all articles