Retrieve the last record of each group using MySQL

Question

There is a table called messages which contains data like this: IdNameOther_Columns-------------------------1AA_data_12AA_data_23AA_data_34BB_data_15BB_data_26CC_data_1 If I run the query select*frommessagesgroupbyname , you will get the following results: 1AA_data_14BB_data_16

P粉973899567 · Answer

UPD: 2017-03-31, MySQL version 5.7.5 has the ONLY_FULL_GROUP_BY switch enabled by default (therefore, non-deterministic GROUP BY queries are disabled). Additionally, they updated the GROUP BY implementation and the solution may no longer work as expected even with the switch disabled. An inspection is required.

Bill Karwin's solution works well when the number of items within the group is small, but the performance of the query becomes worse when the group is larger because the solution takes about n*n/2 n/2timesIS NULLComparison.

I tested on an InnoDB table containing 18684446 rows and 1182 groups. This table contains test results for functional tests, and (test_id, request_id) is the primary key. So test_id is a group and I'm looking for the last request_id for each test_id.

Bill's solution has been running on my Dell e4310 for a few hours now and I don't know when it will be complete, although it operates on a covering index (hence the using index shown in EXPLAIN ).

I also have several solutions based on the same idea:

If the underlying index is a BTREE index (which is usually the case), the largest (group_id, item_value) pair in each group_id is that of each group_id The last value, if we traverse the index in descending order, is the first value for each group_id;
If we read values covered by an index, the values will be read in the order of the index;
Each index implicitly contains additional primary key columns (i.e. the primary key is in a covering index). In the solution below I manipulate the primary key directly, in your case you just need to add the primary key column in the result.
In many cases, a cheaper approach is to collect the required row IDs in the desired order in a subquery and concatenate the results of the subquery with the IDs. Since MySQL requires a single fetch based on the primary key for each row in the subquery result, the subquery will be placed in the join first, and the rows will be output in order of ID in the subquery (if we omit the explicit ORDER BY of the join)

3 ways MySQL uses indexes is a good article to learn some details.

Solution 1

This solution is very fast, taking about 0.8 seconds for my 18 million rows of data:

SELECT test_id, MAX(request_id) AS request_id
FROM testresults
GROUP BY test_id DESC;

If you want to change the order to ascending order, put it into a subquery, return only the ID, and join it as a subquery with other columns:

SELECT test_id, request_id
FROM (
    SELECT test_id, MAX(request_id) AS request_id
    FROM testresults
    GROUP BY test_id DESC) as ids
ORDER BY test_id;

For my data, this solution takes about 1.2 seconds.

Solution 2

Here is another solution, for my table it takes about 19 seconds:

SELECT test_id, request_id
FROM testresults, (SELECT @group:=NULL) as init
WHERE IF(IFNULL(@group, -1)=@group:=test_id, 0, 1)
ORDER BY test_id DESC, request_id DESC

It also returns test results in descending order. It's slower because it does a full index scan, but it can give you an idea of how to output the N maximum rows for each group.

The disadvantage of this query is that its results cannot be cached by the query.

P粉267791326 · Answer

MySQL 8.0 now supports window functions, as are nearly all popular SQL implementations. Using this standard syntax, we can write a max-n-per-group query:

WITH ranked_messages AS (
  SELECT m.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn
  FROM messages AS m
)
SELECT * FROM ranked_messages WHERE rn = 1;

The MySQL manual demonstrates this method and other methods of finding the grouped largest row.

The following is the original answer I wrote for this question in 2009:

I wrote the solution like this:

SELECT m1.*
FROM messages m1 LEFT JOIN messages m2
 ON (m1.name = m2.name AND m1.id < m2.id)
WHERE m2.id IS NULL;

Regarding performance, depending on the nature of the data, one of the solutions may be better. Therefore, you should test both queries and choose the better one based on the performance of your database.

For example, I have a copy of the StackOverflow August data dump. I will use it for benchmarking. There are 1,114,357 rows of data in the Posts table. This is running MySQL 5.0.75 on my Macbook Pro 2.40GHz.

I will write a query to find the latest posts for a given user ID (mine).

First used Eric's technique of using GROUP BY in a subquery:

SELECT p1.postid
FROM Posts p1
INNER JOIN (SELECT pi.owneruserid, MAX(pi.postid) AS maxpostid
            FROM Posts pi GROUP BY pi.owneruserid) p2
  ON (p1.postid = p2.maxpostid)
WHERE p1.owneruserid = 20860;

1行结果（1分17.89秒）

Even EXPLAINAnalysis takes more than 16 seconds:

+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| id | select_type | table      | type   | possible_keys              | key         | key_len | ref          | rows    | Extra       |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
|  1 | PRIMARY     |  | ALL    | NULL                       | NULL        | NULL    | NULL         |   76756 |             | 
|  1 | PRIMARY     | p1         | eq_ref | PRIMARY,PostId,OwnerUserId | PRIMARY     | 8       | p2.maxpostid |       1 | Using where | 
|  2 | DERIVED     | pi         | index  | NULL                       | OwnerUserId | 8       | NULL         | 1151268 | Using index | 
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
3行结果（16.09秒）

Now using LEFT JOINUsing MY TECHNIQUEproduces the same query results:

SELECT p1.postid
FROM Posts p1 LEFT JOIN posts p2
  ON (p1.owneruserid = p2.owneruserid AND p1.postid < p2.postid)
WHERE p2.postid IS NULL AND p1.owneruserid = 20860;

1行结果（0.28秒）

EXPLAINAnalysis shows that both tables can use their indexes:

+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| id | select_type | table | type | possible_keys              | key         | key_len | ref   | rows | Extra                                |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
|  1 | SIMPLE      | p1    | ref  | OwnerUserId                | OwnerUserId | 8       | const | 1384 | Using index                          | 
|  1 | SIMPLE      | p2    | ref  | PRIMARY,PostId,OwnerUserId | OwnerUserId | 8       | const | 1384 | Using where; Using index; Not exists | 
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
2行结果（0.00秒）

This is the DDL for my Posts table:

CREATE TABLE `posts` (
  `PostId` bigint(20) unsigned NOT NULL auto_increment,
  `PostTypeId` bigint(20) unsigned NOT NULL,
  `AcceptedAnswerId` bigint(20) unsigned default NULL,
  `ParentId` bigint(20) unsigned default NULL,
  `CreationDate` datetime NOT NULL,
  `Score` int(11) NOT NULL default '0',
  `ViewCount` int(11) NOT NULL default '0',
  `Body` text NOT NULL,
  `OwnerUserId` bigint(20) unsigned NOT NULL,
  `OwnerDisplayName` varchar(40) default NULL,
  `LastEditorUserId` bigint(20) unsigned default NULL,
  `LastEditDate` datetime default NULL,
  `LastActivityDate` datetime default NULL,
  `Title` varchar(250) NOT NULL default '',
  `Tags` varchar(150) NOT NULL default '',
  `AnswerCount` int(11) NOT NULL default '0',
  `CommentCount` int(11) NOT NULL default '0',
  `FavoriteCount` int(11) NOT NULL default '0',
  `ClosedDate` datetime default NULL,
  PRIMARY KEY  (`PostId`),
  UNIQUE KEY `PostId` (`PostId`),
  KEY `PostTypeId` (`PostTypeId`),
  KEY `AcceptedAnswerId` (`AcceptedAnswerId`),
  KEY `OwnerUserId` (`OwnerUserId`),
  KEY `LastEditorUserId` (`LastEditorUserId`),
  KEY `ParentId` (`ParentId`),
  CONSTRAINT `posts_ibfk_1` FOREIGN KEY (`PostTypeId`) REFERENCES `posttypes` (`PostTypeId`)
) ENGINE=InnoDB;

Commenter Note: If you want to run another benchmark using a different version of MySQL, a different data set, or a different table design, feel free to do it yourself. I've demonstrated the above technique. The purpose of Stack Overflow is to show you how to do software development work, not to do all the work for you.