ROW_NUMBER(), RANK(), and DENSE_RANK() – Flexibility at a Price
One of the most handy features introduced in SQL 2005 were the ranking functions; ROW_NUMBER(), RANK(), and DENSE_RANK(). For anyone who hasn’t been introduced to these syntactic gems, here’s a quick rundown (for those of you who are very familiar with these functions already, feel free to read through, or skip right down to “There’s No Such Thing as a Free Ride” below).
OK – so the general syntax for any one of these commands is more or less the same:
ROW_NUMBER() OVER ([<partition_by_clause>] <order_by_clause>)
RANK() OVER ([<partition_by_clause>] <order_by_clause>)
DENSE_RANK() OVER ([<partition_by_clause>] <order_by_clause>)
So the PARTITION BY part is optional, but everything else is required. An example of a non partitioned, and then a partitioned ROW_NUMBER() clause are listed below:
ROW_NUMBER() OVER (ORDER BY TotalDue DESC)
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY TotalDue DESC)
The difference between the three functions is best explained using an example. Here’s the data I’m using for this example, in case you want to follow the bouncing ball at home ;-)
CREATE TABLE OrderRanking
(
OrderID INT IDENTITY(1,1) NOT NULL,
CustomerID INT,
OrderTotal decimal(15,2)
)
INSERT OrderRanking (CustomerID, OrderTotal)
SELECT 1, 1000
UNION
SELECT 1, 500
UNION
SELECT 1, 650
UNION
SELECT 1, 3000
UNION
SELECT 2, 1000
UNION
SELECT 2, 2000
UNION
SELECT 2, 500
UNION
SELECT 2, 500
UNION
SELECT 3, 500
I’ll use the following (admittedly ugly) query to demonstrate the difference between each function:
SELECT *,
ROW_NUMBER() OVER (ORDER BY OrderTotal DESC) AS RN,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS RNP,
RANK() OVER (ORDER BY OrderTotal DESC) AS R,
RANK() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS RP,
DENSE_RANK() OVER (ORDER BY OrderTotal DESC) AS DR,
DENSE_RANK() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS DRP
FROM OrderRanking
ORDER BY OrderTotal DESC
Excuse the terrible aliases. Anything longer and the code snippets and output in this blog entry get really, really ugly. When we run the query, this is what we get:
So from the example above, we can see that:
- ROW_NUMBER() assigns sequential numbers to each partition in a result set (an unpartitioned result set simply has a single partition), based upon the order of the results as specified in the ORDER BY clause. If you look carefully, you’ll see that the values in column RN are based upon a simple sort of TotalDue, while the values in Column RNP (Row_Number partitioned) are first partitioned or “grouped” by CustomerID, and then numbered by TotalDue, with the row number resetting on change of customer.
- Contrary to popular belief, RANK() does not sort rows based upon how bad they smell. RANK() does much the same thing as ROW_NUMBER(), only it acknowledges ties in the columns specified in the ORDER BY clause, and assigns them the same rank. Where a tie occurs (as was the case for orders 6/3, and 1/5/8), the numbers that would otherwise have been “used up” are skipped, and numbering resumes at the next available number. As you can see, RANK() leaves a gap whenever there is a tie.
- DENSE_RANK() doesn’t like gaps. It’s more of an Abercrombie & Fitch kind of function (ba-dum-ching!). Ohhhhh…that was terrible. My sense of humour may give me up for lent. You might follow it. Anyway….DENSE_RANK() “fills in the gaps”. It starts from the next number after a tie occurs, so instead of 1, 2, 3, 3, 5 you get 1, 2, 3, 3, 4.
There’s No Such Thing as a Free Ride
Ranking functions are not only useful for simple ranking – they’re also great for solving complex problems. In fact, once you get to know them, you’ll find that you’re using them for waaaaaay more than just ranking. In anything from splitting strings to deleting duplicates, ranking functions are the cat’s meow.
But just like most things in life SQL Server, less is more, when you can get away with it. For example, let’s say that you need to get the top order (by TotalDue) for each Customer in AdventureWorks (AdventureWorks2008 in my examples below). You can definitely use ROW_NUMBER (or any of the other ranking functions, for that matter) to do this:
SELECT soh.*
FROM (SELECT CustomerID, SalesOrderID, TotalDue,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY TotalDue DESC) AS RowNumber
FROM Sales.SalesOrderHeader) AS soh
WHERE soh.RowNumber = 1
The WHERE soh.RowNumber = 1 restricts our results to the top order for each customer. Lovely. And the really beautiful thing about this is, if you need the top 2 orders for each customer, or 3, or 4, or x, all you need to do is replace the = 1 with <=2 (for example), and you’re good to go. Now, with that in mind, let’s look at this query:
SELECT soh.CustomerID, soh.TotalDue
FROM Sales.SalesOrderHeader soh
JOIN (SELECT CustomerID, MAX(TotalDue) AS MaxTotalDue
FROM Sales.SalesOrderHeader
GROUP BY CustomerID) AS ttls ON soh.CustomerID = ttls.CustomerID
AND soh.TotalDue = ttls.MaxTotalDue
If you plug this bad boy in, and run it, you might be surprised by the outcome. It’s actually about half as expensive as the Row_Number solution – but why? Well, as you may or may not know, sorts can be very, very expensive in SQL Server. If we’re only fetching the highest $ sales order for a given customer, the MAX solution does it without a sort, whereas the ROW_NUMBER solution needs to sort (the ORDER BY clause is mandatory, remember).
But there are some caveats to the MAX solution – most notably, how in the world can we get the top 5 orders for each customer? Well…the short answer is, we can’t. We need to change the query up, and in doing so, we’re once again going to incur a sort. once we get beyond a query that the MAX or MIN tricks can satisfy – for instance, if we need to fetch the top 5 orders for each customer, we may as well take advantage of the ease of coding, and the improved readability of the Row_Number solution. If we want a solution for the “top 5” problem without invoking a ranking function, we’re going to end up with something like this:
SELECT soh.CustomerID, soh.TotalDue
FROM Sales.SalesOrderHeader soh
WHERE soh.SalesOrderID IN
(SELECT TOP 5 SalesOrderID
FROM Sales.SalesOrderHeader soh2
WHERE soh2.CustomerID = soh.CustomerID
ORDER BY TotalDue DESC)
Which in this case is a very, very crappy alternative to a ranking function. Not only is it uglier, but the query plan isn’t nearly as efficient, and the execution times were consistently about 20% longer in my tests.
Now that said, my tests are against a single data set only, and based upon the nature of your data, your mileage may vary. As a general rule, I would use aggregate functions if I’m only looking for the highest or lowest data point in a series, and a ranking function for anything that can’t be solved by simple aggregation.
'메모' 카테고리의 다른 글
북한 여행 가능한가요?...펌 (0) | 2013.07.20 |
---|---|
애니팡 토파즈...펌 (0) | 2013.07.17 |
mssql backup 쿼리, noskip, noformat, noinit, norewind, nounload, stats...펌 (0) | 2013.07.16 |
create iif function in sql...펌 (0) | 2013.07.16 |
도스 dir /s 파일 폴더 찾기...펌 (0) | 2013.07.13 |
11 comments:
Hi,
How can I convert these two rows to access sql?
dense_rank() over(partition by field1 order by field2) as name1,
row_number() over(partition by fld1, (dense_rank() over(partition by fldnm1 order by fldnm2)) order by fld2) as name2
thanks in advance
Hi hatic,
Sorry - I don't know Access SQL syntax from a hole in the wall. You can give the Microsoft Access Newsgroups a try though. Good luck!
Hi Aaron...
For the TOP 5 orders per customer, the following CROSS APPLY method is a more efficient approach than the SalesOrderID IN method. The query plan is roughly half the cost. (Excuse the formatting... it may not come through):
SELECT CustomerID,TotalDue
FROM (SELECT DISTINCT CustomerID
FROM Sales.SalesOrderHeader) soh
CROSS APPLY (SELECT TOP 5 TotalDue
FROM Sales.SalesOrderHeader soh2
WHERE soh2.CustomerID=soh.CustomerID
ORDER BY TotalDue DESC) F1
That being said, though, the ROW_NUMBER() approach is the way to go (no question) for this type of thing.
--Brad
I've been using CROSS APPLY for only one record at a time. I forgot/didn't realize it can return an entire set.
Thank you so much.
Excellent Article ThankQ so much for posting this article
Excellent post, thanks for the article.
Hi Great article.This tutorial saved me a lot in trying to write my own custom ranking function.By simply following your example i had to substitute row_number() with rank() in three of my stored procedures and voila!!!, my problem of detecting ties in records was solved.Keep up the good work if you were near i would have bought you a cup of coffee.
Thanks your article it saved me a lot of time that I would have spent tying to write a custom ranking function.Substituted the ROW_NUMBER() function with RANK() function in three of my stored procedures and voila!!!!, ties are being smoothly detected.Would have bought you a cup of coffee if you were near.We need people in the world like you.
I know this is a bit late ...
But I'm curious about the queries you tested in the 'no free ride' section. The 'MAX' query is actually sensibly less-expensive, namely because everything but the inner subquery is extraneous. I wonder what your performance test results would be like if the outer query also returned an additional column from the SalesOrderHeader table. I have a hunch that they'll be different and I think that's what I've observed before.
Very Helpful :)
Thank You