Home Uncategorized Who’s On First? Solving the Top per Group Problem (Part 1: Technique)

    Who’s On First? Solving the Top per Group Problem (Part 1: Technique)

    626
    27

    Relative comparison is a simple matter of human nature. From early childhood we compare and contrast what we see in the world around us, building a means by which to rate what we experience. And as it turns out, this desire to discover top and bottom, rightmost and leftmost, or best and worst happens to extend quite naturally into business scenarios. Which product is the top seller? How about the one that’s simply not moving off the shelves? Which of our customers has placed the most expensive order? What are the most recent orders placed at each of our outlets?

    In the world of common business questions, the edge cases are generally of most interest. What’s in the middle is unimportant; it’s often too difficult for the mind to compare and comprehend when there are hundreds, thousands, or even millions of items, transactions, or facts that are all within a similar range. Instead, we focus on those that stick out in some extraordinary way.

    Those of us who work with SQL products on a regular basis are faced with solving this same problem time and again as we work through various business requirements. Over time, I have noticed four basic query patterns that can be used to solve the problem; each are logically equivalent (within certain restrictions — more on that later), but can have surprisingly different performance characteristics depending on the data being queried. In this first post, I will outline the available patterns/methods. In the following posts, I will show the results of testing each pattern against a variety of scenarios in an attempt to discover where and when each should be used.

    The four basic patterns are outlined below. Each of the methods is illustrated using a query to show all customers’ names, plus their most recent order date, and the amount of that order. I’ve included notes that indicate where logic differences can arise among the various methods.

    Method 1: Join to full group and use correlated subquery with a MIN/MAX aggregate to filter

    In this method we use an inner join to get all required columns, then filter the resultant set using a correlated subquery in the WHERE clause.

    SELECT
        c.FirstName,
        c.LastName,
        o.OrderDate,
        o.OrderAmount
    FROM Customers c
    JOIN Orders o ON o.CustomerId = c.CustomerId
    WHERE 
        o.OrderDate  =
        (
            SELECT MAX(o1.OrderDate)
            FROM Orders o1
            WHERE 
                o1.CustomerId = o.CustomerId
        )
    

    Logic notes: With this method ties are automatically included in the output, unless a tiebreaker is specified (which can be tricky given that you only have one column to work with). This method does not allow you to pull back an arbitrary number of rows, such as top 10 per customer; you are limited to the edge and any ties that might exist.

    Method 1a: Join to full group and use correlated subquery with TOP(n) and ORDER BY to filter

    This method is almost identical to Method 1 (which is why it is classified here as 1a), but the TOP and ORDER BY allow for a bit more flexibility than the aggregates.

    SELECT
        c.FirstName,
        c.LastName,
        o.OrderDate,
        o.OrderAmount
    FROM Customers c
    JOIN Orders o ON o.CustomerId = c.CustomerId
    WHERE 
        o.OrderDate  =
        (
            SELECT TOP(1)
                o1.OrderDate
            FROM Orders o1
            WHERE 
                o1.CustomerId = o.CustomerId
            ORDER BY
                o1.OrderDate DESC
        )

    Logic notes: With this method you can more easily integrate a tiebreaker than with Method 1; the comparison column can be anything, including a primary key, and you can still order on whatever column makes most sense. In addition, you can take more rows than with Method 1 by using IN instead of = in the WHERE clause, and increasing the argument value to TOP.

    Method 2: CROSS APPLY to ordered TOP(n)

    In this method, SQL Server 2005’s CROSS APPLY operator is used. This operator allows us to essentially create a table-valued correlated subquery — something that impossible in previous versions of SQL Server. By using TOP in conjunction with ORDER BY we can get as many rows per group as needed.

    SELECT
        c.FirstName,
        c.LastName,
        x.OrderDate,
        x.OrderAmount
    FROM Customers c
    CROSS APPLY
    (
        SELECT TOP(1)
            o.OrderDate,
            o.OrderAmount
        FROM Orders o
        WHERE
            o.CustomerId = c.CustomerId
        ORDER BY
            o.OrderDate DESC
    ) x
    

    Logic notes: This method is almost identical, from a logic point of view, with Method 1a modified to use IN on a primary key column. With both methods WITH TIES can be added to the TOP in order to get ties.

    Method 3: Join to derived table that uses a partitioned, ordered windowing function, and filter in the outer query based on the row number

    In this method a derived table or CTE is used, in conjunction with a windowing function partitioned based on the required grain of the final query. So for the “most recent order per customer” query, the row number is partitioned based on the customer. This gives us a count starting at 1 for each customer, which can be filtered in the outer query.

    SELECT
        c.FirstName,
        c.LastName,
        x.OrderDate,
        x.OrderAmount
    FROM Customers c
    INNER JOIN
    (
        SELECT 
            o.OrderDate,
            o.OrderAmount,
            o.CustomerId,
            ROW_NUMBER() OVER
            (
                PARTITION BY o.CustomerId
                ORDER BY o.OrderDate DESC
            ) AS r
        FROM Orders o
    ) x ON
        x.CustomerId = c.CustomerId
        AND x.r = 1
    

    Logic notes: If ties are important, use DENSE_RANK instead of ROW_NUMBER. ROW_NUMBER is good for arbitrary TOP(n), similar to Method 2. Unlike the previously described methods, in conjunction with DENSE_RANK this method can return an arbitrary TOP(n) rows, all of which can include ties. So if you would like to see the three most recent order dates and each happens to have multiple orders, this method will be able to return them all by simply filtering on x.r = 3. This would not be directly possible with any of the other methods described here.

    Method 4: “Carry-along sort”

    This is the only “tricky” method, and not one that I recommend using, except as a last resort. I’m including it here only for completeness and comparison because it happens to be a very high performance method in some cases. This method involves converting each of the required inner columns into a string, concatenating them, then applying an aggregate to the string as a whole. By putting the “sort” column first, the other data is “carried along” — thus the name for the method. The concatenated data is then “unpacked” in the outer query. This can be surprisingly efficient from an I/O standpoint, but the resultant code is a maintenance nightmare and it is quite easy to introduce errors. In addition, this method can only return the top 1 per group — no ties or multiple return items are supported.

    SELECT
        c.FirstName,
        c.LastName,
        CONVERT(DATETIME, SUBSTRING(x.OrderInfo, 1, 8)) AS OrderDate,
        CONVERT(MONEY, SUBSTRING(x.OrderInfo, 9, 15)) AS OrderAmount
    FROM Customers c
    INNER JOIN
    (
        SELECT 
            o.CustomerId,
            MAX
            (
                CONVERT(CHAR(8), OrderDate, 112) +
                CONVERT(CHAR(15), SubTotal)
            ) OrderInfo
        FROM Orders o
        GROUP BY
            o.CustomerId
    ) x ON
        x.CustomerId = c.CustomerId
    

    This post is just the beginning; watch this space in the coming days for a series of performance tests and analysis of these methods, and some results that I personally found to be quite surprising.
     

    Previous articleA Gift of Script for 2008: Who’s Active, What Are They Doing, and Who is Blocked?
    Next articleFaster, More Scalable SQLCLR String Splitting
    Adam Machanic helps companies get the most out of their SQL Server databases. He creates solid architectural foundations for high performance databases and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has contributed to numerous books on SQL Server development. A long-time Microsoft MVP for SQL Server, he speaks and trains at IT conferences across North America and Europe.

    27 COMMENTS

    1. Adam:
      I use the top-per-group problem (off of the Northwind database) to help evaluate potential developer candidates.  We give them one simple top-per-group question, BOL, and Query Analyzer, and NO time limit.  You’d be shocked at how FEW candidates can get any one of these solutions.  I’d say 5-10% of the people that make it though the phone interview get past this question.  I’d normally curse you for giving our secret sauce away, but the sad truth is that none of the 90-95% of people that don’t get it will be reading this…

    2. And in retrospect, I think that discussion is slightly incorrect — you should get the same results using Method 1A, by using IN in the outer query and SELECT DISTINCT TOP(3) in the inner query.  I’ll make sure to revisit that in a later part of the series.

    3. David: Having done a fair amount of interviewing myself, I can only say that I feel your pain and agree that this post will not make the slightest bit of difference 🙂

    4. Good post, I have done 1, 1a, 3 myself, same logic, different usages
      I really should *think* of using cross join often, but it’s new, and so bad (n x m size) that I normally don’t think about it

    5. For all those who would fail this developer test fret not. The whole weight of MS is not to further the 5% who pass but to enlighten the rest. I will be driving home this point to the AZ user group, http://www.azsqlserver.com/. Isn’t fair and balanced wonderful 🙂

    6. I have read the blog about "Carry-along sort" which was written by ADAM.
      I am looking for more information about this sorting in more detail including performance as well.
      Could you provide me some link or any pdf to get to know more about this sorting method
      Expecting your response on this is much appreciated.
      Regards,
      Prab

    7. Thanks for the reminder about CROSS and OUTER APPLY.  I was struggling with some really hard to read code where there were a ton of correlated subqueries to pull in values for about 20-30 columns, all against the same data.  No really good way of changing it because it is a view, but the OUTER APPLY let me pull in the appropriate data with little trouble and about 1/2 the cost of the original query.

    8. Thanks for this info. Very useful.
      But I would like to say I had MASSIVE improvement in query runtimes when using the "tricky" method. FROM 25 seconds using the method 1a to under one second using "tricks". Well ok, perhaps the indexes were not THAT great.
      A tip for all trickers. I use bit-math instead of strings. The highest priority goes into the higest bits. Plenty of space for states and such in a BigInt.
      All good wishes to you all
      Ben

    9. I’m a developer, not a DBA, but I’ve worked with SQL Server for 15 years.  I’d like to know if my solution is correct for this problem.  My head hurt when I tried to understand your solutions so I went back to basics and came up with something that seems simpler to me.  
      We have a table of employees where the INACTIVE field can be 1 or 0.  Sometimes an employee moves to a different store and (right or wrong) HR sets the old record to INACTIVE and creates a new active record with a new EMPLOYID (which is the primary key).  I want to find the most recent EMPLOYID that is inactive to update the user’s account in ADSI.  (I think that’s a TOP 1 of GROUP problem, but I’m not sure)  So I added a PRIOREMPID field to hold the most recent EMPLOYID into my query of active employees.  Here’s how I did it and it seems to work:
      SELECT A.SOCSCNUM, A.EMPLOYID, B.EMPLOYID AS PRIOREMPID
      FROM EMPLOYEES A
      LEFT JOIN (SELECT SOCSCNUM, EMPLOYID, MAX(DEMPINAC) AS MAXDATE FROM EMPLOYEES WHERE INACTIVE = 1 GROUP BY SOCSCNUM, EMPLOYID) B
      ON A.SOCSCNUM = B.SOCSCNUM
      WHERE A.INACTIVE = 0
      I doubt that I would pass Dave Markle’s candidate test, but I think he’d be passing on a good developer.  Anyway, I’m not available right now.
      http://www.stevegaines.us

    10. After glancing at your solutions again, I think mine looks a lot like Method 4.  Do you agree?  You say it can only return the top 1 per group, which is exactly what I wanted.  Multiples would be a problem in my case.
      Thanks

    11. Hi Steve,
      What you’re describing does sound like a top 1 per group problem, but your solution doesn’t, to me, look correct. It’s unclear to me what the purpose is of finding the maximum "DEMPINAC" per social security number and employee ID combination. Also, what if HR marks TWO employee IDs inactive for a given SSN? Then there will be three different employee IDs, and your query will return two rows for that SSN. Is that correct?
      Either way, this is not similar to method 4. Probably closest to method 1 — finding the max value of some correlation column and then using it to identify the "top" row(s). But that’s not exactly what your query is doing.
      –Adam

    12. This is also a hard/fun problem to optimize when the top(n) values need to be distinct.  
      It might be an interesting article to see various ways of dealing with "distinct top(n) P, V" efficiently.
      One way around this is to use row_number() on partition P atop a distinct non-top sub query.  From my experience, SQL Server is not able to end the subquery for distinct values early enough (e.g. predicate pushing of row_number()).

    13. If you only need one row per group, just do:
      SELECT DISTINCT ColA, MAX(ColB), MIN(ColC), AVG(ColD), Sum(ColD), Count(ColE), VAR(colF)
      FROM Table1
      GROUP BY  ColA
      ORDER BY  ColA;
      I included the MIN(ColC), AVG(ColD), Sum(ColD), Count(ColE) to show that you can mix and match aggregation functions in one query.

    14. @dakra: That’s not the same thing. You risk mixing and matching values from different rows if you do that. That’s the whole point of these techniques — to get exactly one row, with only that row’s values. No mix and match allowed.

    15. Hi Adam,
      Thanks for the great post. I’m sorry, I couldn’t find your promised follow-up :
      >This post is just the beginning; watch this space in the coming days for a series of performance tests and analysis of these methods, and some results that I personally found to be quite surprising.
      If you did actually post that follow-up, would you mind please posting a link here? I’m very curious to see description & analysis of performance figures for your various solutions.
      Thanks for your consideration

    16. May I find the second part of this nice work please? Really need to dig on which one has the lowest I/O usage. Big thanks!

    17. I not tested but:
      Query return only customers that have order not "all customers’ names".
      What if many orders have the same date? Metod 1 return all of them, method 4 only one with max amount?

    18. Searching the internet for a solution.
      There are two columns in a table that must only contain specific values (using an IN statement)
      Now that the filter is applied, only the top 1 for the FK is required.
      Looking for a FK with two null columns or a FK with any of the two filtered values.
      For now I am using the general format of:
      SELECT DISTINCT ColA, MAX(ColB), MIN(ColC), AVG(ColD), Sum(ColD), Count(ColE), VAR(colF)
      FROM Table1
      GROUP BY  ColA
      ORDER BY  ColA;
      since I somewhat don’t care about specific values being mixed, it is more of a rule check that there is a record with this criteria or there isn’t.
      When I take your excellent examples and try to add the filters, I just cant get it to work. Any advice or suggestions would be very welcome.

    19. @Mark
      You can use ORDER BY in a view for logical purposes, just not for presentation purposes. (In other words, an ORDER BY in a subquery or as part of a window function is fine; it’s only output ORDER BY that’s not supported.)

    Comments are closed.