Relative comparison is a simple matter of human nature. From early childhood we compare and contrast what we see in the world around us, building a means by which to rate what we experience. And as it turns out, this desire to discover top and bottom, rightmost and leftmost, or best and worst happens to extend quite naturally into business scenarios. Which product is the top seller? How about the one that’s simply not moving off the shelves? Which of our customers has placed the most expensive order? What are the most recent orders placed at each of our outlets?
In the world of common business questions, the edge cases are generally of most interest. What’s in the middle is unimportant; it’s often too difficult for the mind to compare and comprehend when there are hundreds, thousands, or even millions of items, transactions, or facts that are all within a similar range. Instead, we focus on those that stick out in some extraordinary way.
Those of us who work with SQL products on a regular basis are faced with solving this same problem time and again as we work through various business requirements. Over time, I have noticed four basic query patterns that can be used to solve the problem; each are logically equivalent (within certain restrictions — more on that later), but can have surprisingly different performance characteristics depending on the data being queried. In this first post, I will outline the available patterns/methods. In the following posts, I will show the results of testing each pattern against a variety of scenarios in an attempt to discover where and when each should be used.
The four basic patterns are outlined below. Each of the methods is illustrated using a query to show all customers’ names, plus their most recent order date, and the amount of that order. I’ve included notes that indicate where logic differences can arise among the various methods.
Method 1: Join to full group and use correlated subquery with a MIN/MAX aggregate to filter
In this method we use an inner join to get all required columns, then filter the resultant set using a correlated subquery in the WHERE clause.
SELECT c.FirstName, c.LastName, o.OrderDate, o.OrderAmount FROM Customers c JOIN Orders o ON o.CustomerId = c.CustomerId WHERE o.OrderDate = ( SELECT MAX(o1.OrderDate) FROM Orders o1 WHERE o1.CustomerId = o.CustomerId )
Logic notes: With this method ties are automatically included in the output, unless a tiebreaker is specified (which can be tricky given that you only have one column to work with). This method does not allow you to pull back an arbitrary number of rows, such as top 10 per customer; you are limited to the edge and any ties that might exist.
Method 1a: Join to full group and use correlated subquery with TOP(n) and ORDER BY to filter
This method is almost identical to Method 1 (which is why it is classified here as 1a), but the TOP and ORDER BY allow for a bit more flexibility than the aggregates.
SELECT c.FirstName, c.LastName, o.OrderDate, o.OrderAmount FROM Customers c JOIN Orders o ON o.CustomerId = c.CustomerId WHERE o.OrderDate = ( SELECT TOP(1) o1.OrderDate FROM Orders o1 WHERE o1.CustomerId = o.CustomerId ORDER BY o1.OrderDate DESC )
Logic notes: With this method you can more easily integrate a tiebreaker than with Method 1; the comparison column can be anything, including a primary key, and you can still order on whatever column makes most sense. In addition, you can take more rows than with Method 1 by using IN instead of = in the WHERE clause, and increasing the argument value to TOP.
Method 2: CROSS APPLY to ordered TOP(n)
In this method, SQL Server 2005’s CROSS APPLY operator is used. This operator allows us to essentially create a table-valued correlated subquery — something that impossible in previous versions of SQL Server. By using TOP in conjunction with ORDER BY we can get as many rows per group as needed.
SELECT c.FirstName, c.LastName, x.OrderDate, x.OrderAmount FROM Customers c CROSS APPLY ( SELECT TOP(1) o.OrderDate, o.OrderAmount FROM Orders o WHERE o.CustomerId = c.CustomerId ORDER BY o.OrderDate DESC ) x
Logic notes: This method is almost identical, from a logic point of view, with Method 1a modified to use IN on a primary key column. With both methods WITH TIES can be added to the TOP in order to get ties.
Method 3: Join to derived table that uses a partitioned, ordered windowing function, and filter in the outer query based on the row number
In this method a derived table or CTE is used, in conjunction with a windowing function partitioned based on the required grain of the final query. So for the “most recent order per customer” query, the row number is partitioned based on the customer. This gives us a count starting at 1 for each customer, which can be filtered in the outer query.
SELECT c.FirstName, c.LastName, x.OrderDate, x.OrderAmount FROM Customers c INNER JOIN ( SELECT o.OrderDate, o.OrderAmount, o.CustomerId, ROW_NUMBER() OVER ( PARTITION BY o.CustomerId ORDER BY o.OrderDate DESC ) AS r FROM Orders o ) x ON x.CustomerId = c.CustomerId AND x.r = 1
Logic notes: If ties are important, use DENSE_RANK instead of ROW_NUMBER. ROW_NUMBER is good for arbitrary TOP(n), similar to Method 2. Unlike the previously described methods, in conjunction with DENSE_RANK this method can return an arbitrary TOP(n) rows, all of which can include ties. So if you would like to see the three most recent order dates and each happens to have multiple orders, this method will be able to return them all by simply filtering on x.r = 3. This would not be directly possible with any of the other methods described here.
Method 4: “Carry-along sort”
This is the only “tricky” method, and not one that I recommend using, except as a last resort. I’m including it here only for completeness and comparison because it happens to be a very high performance method in some cases. This method involves converting each of the required inner columns into a string, concatenating them, then applying an aggregate to the string as a whole. By putting the “sort” column first, the other data is “carried along” — thus the name for the method. The concatenated data is then “unpacked” in the outer query. This can be surprisingly efficient from an I/O standpoint, but the resultant code is a maintenance nightmare and it is quite easy to introduce errors. In addition, this method can only return the top 1 per group — no ties or multiple return items are supported.
SELECT c.FirstName, c.LastName, CONVERT(DATETIME, SUBSTRING(x.OrderInfo, 1, 8)) AS OrderDate, CONVERT(MONEY, SUBSTRING(x.OrderInfo, 9, 15)) AS OrderAmount FROM Customers c INNER JOIN ( SELECT o.CustomerId, MAX ( CONVERT(CHAR(8), OrderDate, 112) + CONVERT(CHAR(15), SubTotal) ) OrderInfo FROM Orders o GROUP BY o.CustomerId ) x ON x.CustomerId = c.CustomerId
This post is just the beginning; watch this space in the coming days for a series of performance tests and analysis of these methods, and some results that I personally found to be quite surprising.