SQL query to select only rows with maximum value

Question

I have a document table (here is a simplified version): idrevcontent11...21...12...13... How do I select one row per id and only the largest rev? Based on the above data, the result should contain two rows: [1,3,...] and [2,1,..]. I'm using MySQL. Currently I'm using a check in a while loop to detect and overwrite old revs in the result set. But is this the only way to achieve results? Isn't there a SQL solution?

P粉667649253 · Answer

I prefer to use as little code as possible...

You can use IN to achieve Try this:

SELECT * 
FROM t1 WHERE (id,rev) IN 
( SELECT id, MAX(rev)
  FROM t1
  GROUP BY id
)

In my opinion, this is simpler... easier to read and maintain.

P粉517475670 · Answer

At first glance...

You only need to use the MAX aggregate function in the GROUP BY clause:

SELECT id, MAX(rev)
FROM YourTable
GROUP BY id

Things are never that simple, are they?

I just noticed that you also need the content column.

This is a very common problem in SQL: find the complete data corresponding to the maximum value in a column based on a certain grouping identifier. I've heard this question a lot in my career. In fact, I answered one of these questions during a technical interview at my current job.

This question is actually so common that the Stack Overflow community created a tag specifically to deal with this type of problem: greatest-n-per-group.

Basically, you have two ways to solve this problem:

Use simple `group-identifier, max-value-in-group`Subquery to connect

In this approach, you first find the group-identifier, max-value-in-group (already solved above) in a subquery. You then join your table with the subquery, using group-identifier and max-value-in-group for an equijoin:

SELECT a.id, a.rev, a.contents
FROM YourTable a
INNER JOIN (
    SELECT id, MAX(rev) rev
    FROM YourTable
    GROUP BY id
) b ON a.id = b.id AND a.rev = b.rev

Use self-join for left join, adjust connection conditions and filtering conditions

In this approach, you left join the table to itself. Equijoin on group-identifier. Then, there are two clever steps:

The second connection condition is that the value on the left is less than the value on the right
When you do step 1, the row that actually has the largest value will have NULL on the right (remember this is a LEFT JOIN). We then filter the join results to only show rows with NULL on the right.

So, you end up with:

SELECT a.*
FROM YourTable a
LEFT OUTER JOIN YourTable b
    ON a.id = b.id AND a.rev < b.rev
WHERE b.id IS NULL;

in conclusion

The results obtained by these two methods are exactly the same.

If you have two rows with the same group-identifier and max-value-in-group, both methods will include both rows in the result.

Both methods are compatible with SQL ANSI, so no matter which RDBMS you are using, you can use both methods regardless of its "style".

Both methods are very efficient, but the specific effects may be different (RDBMS, database structure, index, etc.). So benchmark when choosing one of these methods. And make sure to choose the method that makes the most sense for you.

At first glance...

Things are never that simple, are they?

Use simple group-identifier, max-value-in-groupSubquery to connect

Use self-join for left join, adjust connection conditions and filtering conditions

in conclusion

Use simple `group-identifier, max-value-in-group`Subquery to connect