How to select the first/least/max row per group in SQL

Thu, Dec 7, 2006 in Databases

Here are some common SQL problems, all of which have related solutions: how do I find the most recent log entry for each program? How do I find the most popular item from each category? How do I find the top score for each player? In general, these types of “select the extreme from each group” queries can be solved with the same techniques. I’ll explain how to do that in this article, including the harder problem of selecting the top N entries, not just the top 1.

This topic is related to numbering rows, which I just wrote about (see my articles about MySQL-specific and generic techniques to assign a number to each row in a group). Therefore I’ll use nearly the same table and data as I used in those articles, with the addition of a price column:

+——–+————+——-+

| type | variety | price |

+——–+————+——-+

| apple | gala | 2.79 |

| apple | fuji | 0.24 |

| apple | limbertwig | 2.87 |

| orange | valencia | 3.59 |

| orange | navel | 9.36 |

| pear | bradford | 6.05 |

| pear | bartlett | 2.14 |

| cherry | bing | 2.55 |

| cherry | chelan | 6.33 |

+——–+————+——-+

Selecting the one maximum row from each group

Let’s say I want to select the most recent log entry for each program, or the most recent changes in an audit table, or something of the sort. This question comes up over and over on IRC channels and mailing lists. I’ll re-phrase the question in terms of fruits. I want to select the cheapest fruit from each type. Here’s the desired result:

+——–+———-+——-+

| type | variety | price |

+——–+———-+——-+

| apple | fuji | 0.24 |

| orange | valencia | 3.59 |

| pear | bartlett | 2.14 |

| cherry | bing | 2.55 |

+——–+———-+——-+

There are a few common solutions to this problem. All involve two steps: finding the desired value of price, and then selecting the rest of the row based on that.

One common solution is a so-called self-join. Step one is to group the fruits by type (apple, cherry etc) and choose the minimum price:

select type, min(price) as minprice

from fruits

group by type;

+——–+———-+

| type | minprice |

+——–+———-+

| apple | 0.24 |

| cherry | 2.55 |

| orange | 3.59 |

| pear | 2.14 |

+——–+———-+

Step two is to select the rest of the row by joining these results back to the same table. Since the first query is grouped, it needs to be put into a subquery so it can be joined against the non-grouped table:

select f.type, f.variety, f.price

from (

select type, min(price) as minprice

from fruits group by type

) as x inner join fruits as f on f.type = x.type and f.price = x.minprice;

+——–+———-+——-+

| type | variety | price |

+——–+———-+——-+

| apple | fuji | 0.24 |

| cherry | bing | 2.55 |

| orange | valencia | 3.59 |

| pear | bartlett | 2.14 |

+——–+———-+——-+

Another common way to do this is with a correlated subquery. This can be much less efficient, depending on how good your system’s query optimizer is. You might find it clearer, though.

select type, variety, price

from fruits

where price = (select min(price) from fruits as f where f.type = fruits.type);

+——–+———-+——-+

| type | variety | price |

+——–+———-+——-+

| apple | fuji | 0.24 |

| orange | valencia | 3.59 |

| pear | bartlett | 2.14 |

| cherry | bing | 2.55 |

+——–+———-+——-+

Both queries are logically equivalent, though they may not perform the same.

Select the top N rows from each group

This is a slightly harder problem to solve. Finding a single row from each group is easy with SQL’s aggregate functions (MIN(), MAX(), and so on). Finding the first several from each group is not possible with that method because aggregate functions only return a single value. Still, it’s possible to do.

Let’s say I want to select the two cheapest fruits from each type. Here’s a first try:

select type, variety, price

from fruits

where price = (select min(price) from fruits as f where f.type = fruits.type)

or price = (select min(price) from fruits as f where f.type = fruits.type

and price > (select min(price) from fruits as f2 where f2.type = fruits.type));

+——–+———-+——-+

| type | variety | price |

+——–+———-+——-+

| apple | gala | 2.79 |

| apple | fuji | 0.24 |

| orange | valencia | 3.59 |

| orange | navel | 9.36 |

| pear | bradford | 6.05 |

| pear | bartlett | 2.14 |

| cherry | bing | 2.55 |

| cherry | chelan | 6.33 |

+——–+———-+——-+

Yuck! That can be written as a self-join, but it’s just as bad (I leave it as an exercise for the reader). This gets worse as you go to higher numbers (top 3, top 4…). There are other ways to phrase the statement, but they all boil down to the same thing, and they’re all pretty unwieldy and inefficient.

There’s a better way: select the variety from each type where the variety is no more than the second-cheapest of that type.

select type, variety, price

from fruits

where (

select count(*) from fruits as f

where f.type = fruits.type and f.price <= fruits.price

) <= 2;

This is elegant, and lets you vary N without rewriting your query (a very good thing!), but it’s functionally the same as the previous query. Both are essentially a quadratic algorithm relative to the number of varieties in each type. And again, some query optimizers may not do well with this and make it quadratic with respect to the number of rows in the table overall (especially if no useful index is defined), and the server might get clobbered. Are there better ways? Can it be done with one pass through the data, instead of the many passes required by a correlated subquery? You know it can, or I wouldn’t be writing this, now would I?