Home Uncategorized Who’s On First? Solving the Top per Group Problem (Part 1: Technique)

Who’s On First? Solving the Top per Group Problem (Part 1: Technique)

February 8, 2008

3118

Relative comparison is a simple matter of human nature. From early childhood we compare and contrast what we see in the world around us, building a means by which to rate what we experience. And as it turns out, this desire to discover top and bottom, rightmost and leftmost, or best and worst happens to extend quite naturally into business scenarios. Which product is the top seller? How about the one that’s simply not moving off the shelves? Which of our customers has placed the most expensive order? What are the most recent orders placed at each of our outlets?

In the world of common business questions, the edge cases are generally of most interest. What’s in the middle is unimportant; it’s often too difficult for the mind to compare and comprehend when there are hundreds, thousands, or even millions of items, transactions, or facts that are all within a similar range. Instead, we focus on those that stick out in some extraordinary way.

Those of us who work with SQL products on a regular basis are faced with solving this same problem time and again as we work through various business requirements. Over time, I have noticed four basic query patterns that can be used to solve the problem; each are logically equivalent (within certain restrictions — more on that later), but can have surprisingly different performance characteristics depending on the data being queried. In this first post, I will outline the available patterns/methods. In the following posts, I will show the results of testing each pattern against a variety of scenarios in an attempt to discover where and when each should be used.

The four basic patterns are outlined below. Each of the methods is illustrated using a query to show all customers’ names, plus their most recent order date, and the amount of that order. I’ve included notes that indicate where logic differences can arise among the various methods.

Method 1: Join to full group and use correlated subquery with a MIN/MAX aggregate to filter

In this method we use an inner join to get all required columns, then filter the resultant set using a correlated subquery in the WHERE clause.

SELECT
    c.FirstName,
    c.LastName,
    o.OrderDate,
    o.OrderAmount
FROM Customers c
JOIN Orders o ON o.CustomerId = c.CustomerId
WHERE 
    o.OrderDate  =
    (
        SELECT MAX(o1.OrderDate)
        FROM Orders o1
        WHERE 
            o1.CustomerId = o.CustomerId
    )

Logic notes: With this method ties are automatically included in the output, unless a tiebreaker is specified (which can be tricky given that you only have one column to work with). This method does not allow you to pull back an arbitrary number of rows, such as top 10 per customer; you are limited to the edge and any ties that might exist.

Method 1a: Join to full group and use correlated subquery with TOP(n) and ORDER BY to filter

This method is almost identical to Method 1 (which is why it is classified here as 1a), but the TOP and ORDER BY allow for a bit more flexibility than the aggregates.

SELECT
    c.FirstName,
    c.LastName,
    o.OrderDate,
    o.OrderAmount
FROM Customers c
JOIN Orders o ON o.CustomerId = c.CustomerId
WHERE 
    o.OrderDate  =
    (
        SELECT TOP(1)
            o1.OrderDate
        FROM Orders o1
        WHERE 
            o1.CustomerId = o.CustomerId
        ORDER BY
            o1.OrderDate DESC
    )

Logic notes: With this method you can more easily integrate a tiebreaker than with Method 1; the comparison column can be anything, including a primary key, and you can still order on whatever column makes most sense. In addition, you can take more rows than with Method 1 by using IN instead of = in the WHERE clause, and increasing the argument value to TOP.

Method 2: CROSS APPLY to ordered TOP(n)

In this method, SQL Server 2005’s CROSS APPLY operator is used. This operator allows us to essentially create a table-valued correlated subquery — something that impossible in previous versions of SQL Server. By using TOP in conjunction with ORDER BY we can get as many rows per group as needed.

SELECT
    c.FirstName,
    c.LastName,
    x.OrderDate,
    x.OrderAmount
FROM Customers c
CROSS APPLY
(
    SELECT TOP(1)
        o.OrderDate,
        o.OrderAmount
    FROM Orders o
    WHERE
        o.CustomerId = c.CustomerId
    ORDER BY
        o.OrderDate DESC
) x

Logic notes: This method is almost identical, from a logic point of view, with Method 1a modified to use IN on a primary key column. With both methods WITH TIES can be added to the TOP in order to get ties.

Method 3: Join to derived table that uses a partitioned, ordered windowing function, and filter in the outer query based on the row number

In this method a derived table or CTE is used, in conjunction with a windowing function partitioned based on the required grain of the final query. So for the “most recent order per customer” query, the row number is partitioned based on the customer. This gives us a count starting at 1 for each customer, which can be filtered in the outer query.

SELECT
    c.FirstName,
    c.LastName,
    x.OrderDate,
    x.OrderAmount
FROM Customers c
INNER JOIN
(
    SELECT 
        o.OrderDate,
        o.OrderAmount,
        o.CustomerId,
        ROW_NUMBER() OVER
        (
            PARTITION BY o.CustomerId
            ORDER BY o.OrderDate DESC
        ) AS r
    FROM Orders o
) x ON
    x.CustomerId = c.CustomerId
    AND x.r = 1

Logic notes: If ties are important, use DENSE_RANK instead of ROW_NUMBER. ROW_NUMBER is good for arbitrary TOP(n), similar to Method 2. Unlike the previously described methods, in conjunction with DENSE_RANK this method can return an arbitrary TOP(n) rows, all of which can include ties. So if you would like to see the three most recent order dates and each happens to have multiple orders, this method will be able to return them all by simply filtering on x.r = 3. This would not be directly possible with any of the other methods described here.

Method 4: “Carry-along sort”

This is the only “tricky” method, and not one that I recommend using, except as a last resort. I’m including it here only for completeness and comparison because it happens to be a very high performance method in some cases. This method involves converting each of the required inner columns into a string, concatenating them, then applying an aggregate to the string as a whole. By putting the “sort” column first, the other data is “carried along” — thus the name for the method. The concatenated data is then “unpacked” in the outer query. This can be surprisingly efficient from an I/O standpoint, but the resultant code is a maintenance nightmare and it is quite easy to introduce errors. In addition, this method can only return the top 1 per group — no ties or multiple return items are supported.

SELECT
    c.FirstName,
    c.LastName,
    CONVERT(DATETIME, SUBSTRING(x.OrderInfo, 1, 8)) AS OrderDate,
    CONVERT(MONEY, SUBSTRING(x.OrderInfo, 9, 15)) AS OrderAmount
FROM Customers c
INNER JOIN
(
    SELECT 
        o.CustomerId,
        MAX
        (
            CONVERT(CHAR(8), OrderDate, 112) +
            CONVERT(CHAR(15), SubTotal)
        ) OrderInfo
    FROM Orders o
    GROUP BY
        o.CustomerId
) x ON
    x.CustomerId = c.CustomerId

This post is just the beginning; watch this space in the coming days for a series of performance tests and analysis of these methods, and some results that I personally found to be quite surprising.

27 COMMENTS

Dave Markle February 10, 2008 At 5:56 am

Adam:
I use the top-per-group problem (off of the Northwind database) to help evaluate potential developer candidates. We give them one simple top-per-group question, BOL, and Query Analyzer, and NO time limit. You’d be shocked at how FEW candidates can get any one of these solutions. I’d say 5-10% of the people that make it though the phone interview get past this question. I’d normally curse you for giving our secret sauce away, but the sad truth is that none of the 90-95% of people that don’t get it will be reading this…
Gil February 10, 2008 At 9:21 am

In the method 3 discussion, did you mean x.r<=3 ?
Adam Machanic February 10, 2008 At 9:59 am

Gil: Absolutely! Thanks for pointing that out.
Adam Machanic February 10, 2008 At 10:02 am

And in retrospect, I think that discussion is slightly incorrect — you should get the same results using Method 1A, by using IN in the outer query and SELECT DISTINCT TOP(3) in the inner query. I’ll make sure to revisit that in a later part of the series.
Adam Machanic February 10, 2008 At 10:05 am

David: Having done a fair amount of interviewing myself, I can only say that I feel your pain and agree that this post will not make the slightest bit of difference 🙂
jerryhung February 11, 2008 At 10:07 pm

Good post, I have done 1, 1a, 3 myself, same logic, different usages
I really should *think* of using cross join often, but it’s new, and so bad (n x m size) that I normally don’t think about it
steve dassin February 12, 2008 At 8:05 am

For all those who would fail this developer test fret not. The whole weight of MS is not to further the 5% who pass but to enlighten the rest. I will be driving home this point to the AZ user group, http://www.azsqlserver.com/. Isn’t fair and balanced wonderful 🙂
Christopher Tan February 23, 2008 At 3:45 pm

It’s ok
Prabhakaran.S March 23, 2008 At 7:53 am

I have read the blog about "Carry-along sort" which was written by ADAM.
I am looking for more information about this sorting in more detail including performance as well.
Could you provide me some link or any pdf to get to know more about this sorting method
Expecting your response on this is much appreciated.
Regards,
Prab
Peter April 21, 2008 At 11:14 pm

Thanks for the reminder about CROSS and OUTER APPLY. I was struggling with some really hard to read code where there were a ton of correlated subqueries to pull in values for about 20-30 columns, all against the same data. No really good way of changing it because it is a view, but the OUTER APPLY let me pull in the appropriate data with little trouble and about 1/2 the cost of the original query.
Ben November 10, 2010 At 4:26 pm

Thanks for this info. Very useful.
But I would like to say I had MASSIVE improvement in query runtimes when using the "tricky" method. FROM 25 seconds using the method 1a to under one second using "tricks". Well ok, perhaps the indexes were not THAT great.
A tip for all trickers. I use bit-math instead of strings. The highest priority goes into the higest bits. Plenty of space for states and such in a BigInt.
All good wishes to you all
Ben
Joseph September 6, 2012 At 11:50 pm

Thank you so much for this! Great content!
Steve October 26, 2012 At 6:32 pm

I’m a developer, not a DBA, but I’ve worked with SQL Server for 15 years. I’d like to know if my solution is correct for this problem. My head hurt when I tried to understand your solutions so I went back to basics and came up with something that seems simpler to me.
We have a table of employees where the INACTIVE field can be 1 or 0. Sometimes an employee moves to a different store and (right or wrong) HR sets the old record to INACTIVE and creates a new active record with a new EMPLOYID (which is the primary key). I want to find the most recent EMPLOYID that is inactive to update the user’s account in ADSI. (I think that’s a TOP 1 of GROUP problem, but I’m not sure) So I added a PRIOREMPID field to hold the most recent EMPLOYID into my query of active employees. Here’s how I did it and it seems to work:
SELECT A.SOCSCNUM, A.EMPLOYID, B.EMPLOYID AS PRIOREMPID
FROM EMPLOYEES A
LEFT JOIN (SELECT SOCSCNUM, EMPLOYID, MAX(DEMPINAC) AS MAXDATE FROM EMPLOYEES WHERE INACTIVE = 1 GROUP BY SOCSCNUM, EMPLOYID) B
ON A.SOCSCNUM = B.SOCSCNUM
WHERE A.INACTIVE = 0
I doubt that I would pass Dave Markle’s candidate test, but I think he’d be passing on a good developer. Anyway, I’m not available right now.
http://www.stevegaines.us
Steve October 26, 2012 At 6:44 pm

After glancing at your solutions again, I think mine looks a lot like Method 4. Do you agree? You say it can only return the top 1 per group, which is exactly what I wanted. Multiples would be a problem in my case.
Thanks
Adam Machanic October 27, 2012 At 6:45 pm

Hi Steve,
What you’re describing does sound like a top 1 per group problem, but your solution doesn’t, to me, look correct. It’s unclear to me what the purpose is of finding the maximum "DEMPINAC" per social security number and employee ID combination. Also, what if HR marks TWO employee IDs inactive for a given SSN? Then there will be three different employee IDs, and your query will return two rows for that SSN. Is that correct?
Either way, this is not similar to method 4. Probably closest to method 1 — finding the max value of some correlation column and then using it to identify the "top" row(s). But that’s not exactly what your query is doing.
–Adam
crokusek March 7, 2013 At 9:47 pm

This is also a hard/fun problem to optimize when the top(n) values need to be distinct.
It might be an interesting article to see various ways of dealing with "distinct top(n) P, V" efficiently.
One way around this is to use row_number() on partition P atop a distinct non-top sub query. From my experience, SQL Server is not able to end the subquery for distinct values early enough (e.g. predicate pushing of row_number()).
dakra March 12, 2013 At 7:37 pm

If you only need one row per group, just do:
SELECT DISTINCT ColA, MAX(ColB), MIN(ColC), AVG(ColD), Sum(ColD), Count(ColE), VAR(colF)
FROM Table1
GROUP BY ColA
ORDER BY ColA;
I included the MIN(ColC), AVG(ColD), Sum(ColD), Count(ColE) to show that you can mix and match aggregation functions in one query.
Adam Machanic March 12, 2013 At 8:54 pm

@dakra: That’s not the same thing. You risk mixing and matching values from different rows if you do that. That’s the whole point of these techniques — to get exactly one row, with only that row’s values. No mix and match allowed.
Larry Rosenfeld June 25, 2013 At 1:14 am

this was very helpful. thanks so much!
Allan December 13, 2013 At 9:30 pm

Great job, I’ve worked endlessly to solve this with queries not as effective. Cheers
Brad Stiritz May 9, 2014 At 4:18 am

Hi Adam,
Thanks for the great post. I’m sorry, I couldn’t find your promised follow-up :
>This post is just the beginning; watch this space in the coming days for a series of performance tests and analysis of these methods, and some results that I personally found to be quite surprising.
If you did actually post that follow-up, would you mind please posting a link here? I’m very curious to see description & analysis of performance figures for your various solutions.
Thanks for your consideration
Ariski May 30, 2014 At 5:19 am

May I find the second part of this nice work please? Really need to dig on which one has the lowest I/O usage. Big thanks!
Maciej July 29, 2014 At 5:02 pm

I not tested but:
Query return only customers that have order not "all customers’ names".
What if many orders have the same date? Metod 1 return all of them, method 4 only one with max amount?
What about the First in Group when there are two or more criteria (filter) August 29, 2014 At 4:59 pm

Searching the internet for a solution.
There are two columns in a table that must only contain specific values (using an IN statement)
Now that the filter is applied, only the top 1 for the FK is required.
Looking for a FK with two null columns or a FK with any of the two filtered values.
For now I am using the general format of:
SELECT DISTINCT ColA, MAX(ColB), MIN(ColC), AVG(ColD), Sum(ColD), Count(ColE), VAR(colF)
FROM Table1
GROUP BY ColA
ORDER BY ColA;
since I somewhat don’t care about specific values being mixed, it is more of a rule check that there is a record with this criteria or there isn’t.
When I take your excellent examples and try to add the filters, I just cant get it to work. Any advice or suggestions would be very welcome.
Tom September 16, 2014 At 1:38 pm

Great example it just solved a problem of getting the last update for each group/tag for me. I modified example 1a.
Mark Mazz March 9, 2015 At 11:39 pm

How would this problem be solved in a SQL View (where you can’t use ORDER BY)?
Adam Machanic March 10, 2015 At 3:26 am

@Mark
You can use ORDER BY in a view for logical purposes, just not for presentation purposes. (In other words, an ORDER BY in a subquery or as part of a window function is fine; it’s only output ORDER BY that’s not supported.)

Comments are closed.

Who’s On First? Solving the Top per Group Problem (Part 1: Technique)

27 COMMENTS

Popular Posts

Looking Forward 100 Months (T-SQL Tuesday #100): The Roundup

T-SQL Tuesday #200 (v1.0): Last of the DBAs

Invitation: T-SQL Tuesday #100 – Looking Forward 100 Months

The Career Influence of Usenet Strangers (T-SQL Tuesday #096)

Playing the Third-Party Recruiter Game (T-SQL Tuesday #093)

The SQL Hall of Shame

The Guru (T-SQL Tuesday #089)

Solving the Net Changes Problem with Temporal Tables (T-SQL Tuesday #087)