13
Ranking Functions and Performance in SQL Server 2005 By Alex Kozak 20 April 2006 Ranking functions, introduced in SQL Server 2005 , are a great enhancement to Transact-SQL. Many tasks, like creating arrays, generating sequential numbers, finding ranks, and so on, which in pre-2005 versions requires many lines of code, now can be implemented much easier and faster. Let's look at the syntax of ranking functions: ROW_NUMBER () OVER ([<partition_by_clause>] <order_by_clause>) RANK () OVER ([<partition_by_clause>] <order_by_clause>) DENSE_RANK () OVER ([<partition_by_clause>] <order_by_clause>) NTILE (integer_expression) OVER ([<partition_by_clause>] <order_by_clause>) All four functions have "partition by" and "order by" clauses and that makes these functions very flexible and useful. However, there is one nuance in syntax that deserves your attention: the "order by" clause is not an option. Why should you worry about the "order by" clause? Well, as a DBA or database programmer you know that sorting is a fairly expensive operation in terms of time and resources. And if you were forced to use it always, even in a situation where you didn't need it, you could expect degradation of performance, especially in large databases .

Ranking,DensRanking,NTILE Functions and Performance in SQL Server 2005

Embed Size (px)

Citation preview

Page 1: Ranking,DensRanking,NTILE Functions and Performance in SQL Server 2005

Ranking Functions and Performance in SQL Server 2005

By Alex Kozak 20 April 2006

Ranking functions, introduced in SQL Server 2005, are a great enhancement to Transact-SQL. Many tasks, like creating arrays, generating sequential numbers, finding ranks, and so on, which in pre-2005 versions requires many lines of code, now can be implemented much easier and faster.

Let's look at the syntax of ranking functions:

ROW_NUMBER () OVER ([<partition_by_clause>] <order_by_clause>) RANK () OVER ([<partition_by_clause>] <order_by_clause>) DENSE_RANK () OVER ([<partition_by_clause>] <order_by_clause>) NTILE (integer_expression) OVER ([<partition_by_clause>] <order_by_clause>)

All four functions have "partition by" and "order by" clauses and that makes these functions very flexible and useful. However, there is one nuance in syntax that deserves your attention: the "order by" clause is not an option.

Why should you worry about the "order by" clause?

Well, as a DBA or database programmer you know that sorting is a fairly expensive operation in terms of time and resources. And if you were forced to use it always, even in a situation where you didn't need it, you could expect degradation of performance, especially in large databases.

Is it possible to avoid sorting in ranking functions? If possible, how would it improve performance?

Let's try to answer these questions.

Page 2: Ranking,DensRanking,NTILE Functions and Performance in SQL Server 2005

How to Avoid Sorting in Ranking Functions

Create a sample table (Listing 1):

-- Listing 1. Create a sample table. CREATE TABLE RankingFunctions(orderID int NOT NULL); INSERT INTO RankingFunctions VALUES(7); INSERT INTO RankingFunctions VALUES(11); INSERT INTO RankingFunctions VALUES(4); INSERT INTO RankingFunctions VALUES(21); INSERT INTO RankingFunctions VALUES(15);

Run the next query with the ROW_NUMBER() function:

SELECT ROW_NUMBER () OVER (ORDER BY orderID) AS rowNum, orderID      FROM RankingFunctions;

If you check the execution plan for that query (see Figure 1), you will find that the Sort operator is very expensive and costs 78 percent.

Run the same query, leaving the OVER() clause blank:

SELECT ROW_NUMBER () OVER () AS rowNum, orderID      FROM RankingFunctions;

You will get an error:

Msg 4112, Level 15, State 1, Line 1 The ranking function "row_number" must have an ORDER BY clause.

Since the parser doesn't allow you to avoid the "order by" clause, maybe you can force the query optimizer to stop using the Sort operator. For example, you could create a computed column that consists of a simple integer, 1, and then use that virtual column in the "order by" clause (Listing 2):

-- Listing 2. ORDER BY computed column. -- Query 1: Using derived table. SELECT ROW_NUMBER () OVER (ORDER BY const) AS rowNum, orderID      FROM (SELECT orderID, 1 as const           FROM RankingFunctions) t1 GO -- Query 2: Using common table expression (CTE).

Page 3: Ranking,DensRanking,NTILE Functions and Performance in SQL Server 2005

WITH OriginalOrder AS (SELECT orderID, 1 as const      FROM RankingFunctions) SELECT ROW_NUMBER () OVER (ORDER BY const) AS rowNum, orderID      FROM OriginalOrder;

If you check the execution plans now (see Figure 2), you will find that query optimizer doesn't use the Sort operator anymore. Both queries will generate the row numbers and return the orderID values in the original order.

RowNum orderID

1 7

2 11

3 4

4 21

5 15

There is a small problem with the queries in Listing 2 — they need time (resources) to create and populate the virtual column. As a result, the performance gains that you achieve by avoiding the sort operation may disappear when you populate the computed column. Is there any other way to skip the sort operation?

Let's try to answer this question.

The "order by" clause allows the expressions. The expression can be simple, constant, variable, column, and so on. Simple expressions can be organized into complex ones.

What if you talk to query optimizer using the expression's language? For example, try to use the subquery as an expression:

SELECT ROW_NUMBER () OVER (ORDER BY (SELECT orderID FROM RankingFunctions)) AS rowNum, orderID      FROM RankingFunctions;

No, you can't bypass the parser. You will get an error:

Msg 512, Level 16, State 1, Line 1 Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.

Page 4: Ranking,DensRanking,NTILE Functions and Performance in SQL Server 2005

O-o-o-p-s, here's the hint! The expression (or in our case, the subquery) has to produce a single value.

This should work:

SELECT ROW_NUMBER () OVER (ORDER BY (SELECT MAX(OrderID) FROM RankingFunctions)) AS rowNum, orderID      FROM rankingFunctions;

Bingo! That query is working exactly as you wanted — no Sort operator has been used.

Now you can write an expression in the "order by" clause that returns a single value, forcing the query optimizer to refrain from using a sort operation.

By the way, the solutions in Listing 2 worked because the integer values in computed columns have been duplicated in all the rows and for that reason were considered a single value.

Here are some more examples of expression usage in an "order by" clause (Listing 3):

-- Listing 3. Using an expression in an ORDER BY clause. SELECT ROW_NUMBER () OVER (ORDER BY (SELECT 1 FROM sysobjects WHERE 1<>1)) AS rowNum, orderID      FROM RankingFunctions; GO SELECT ROW_NUMBER () OVER (ORDER BY (SELECT 1)) AS rowNum, orderID      FROM RankingFunctions; GO DECLARE @i as bit; SELECT @i = 1; SELECT ROW_NUMBER () OVER (ORDER BY @i) AS rowNum, orderID      FROM RankingFunctions;

Figure 3 shows the execution plans for the queries in Listing 3.

RANK(), DENSE_RANK() and NTILE() Functions with Expressions in an ORDER BY Clause

Page 5: Ranking,DensRanking,NTILE Functions and Performance in SQL Server 2005

Before we move forward, we should check the correctness of the solutions for the rest of the ranking functions.

Let's create a few duplicates in the RankingFunctions table and start testing the RANK() and DENSE_RANK() functions:

-- Listing 4. RANK() and DENSE_RANK() functions with expressions in an ORDER BY clause. -- Create duplicates in table RankingFunctions. INSERT INTO RankingFunctions VALUES(11); INSERT INTO RankingFunctions VALUES(4); INSERT INTO RankingFunctions VALUES(4); GO -- Query 1: (ORDER BY orderID). SELECT RANK () OVER (ORDER BY orderID) AS rankNum,           DENSE_RANK () OVER (ORDER BY orderID) AS denseRankNum,           orderID      FROM RankingFunctions; GO -- Query 2: (ORDER BY expression). SELECT RANK () OVER (ORDER BY (SELECT 1)) AS rankNum,           DENSE_RANK () OVER (ORDER BY (SELECT 1)) AS denseRankNum,           orderID      FROM RankingFunctions; GO

If you check the execution plans (see Figure 4), you will find that the first query in Listing 4 requires a lot of resources for sorting. The second query doesn't have a Sort operator. So the queries behave as expected.

However, when you run the queries, the second result will be wrong:

Query 1 retrieves the correct result:

RankNum denseRankNum orderID

1 1 4

1 1 4

1 1 4

4 2 7

5 3 11

Page 6: Ranking,DensRanking,NTILE Functions and Performance in SQL Server 2005

5 3 11

7 4 15

8 5 21

Query 2 retrieves the wrong result:

rankNum denseRankNum orderID

1 1 7

1 1 11

1 1 4

1 1 21

1 1 15

1 1 11

1 1 4

1 1 4

Even though the expressions in the "order by" clause help to skip sorting, they can't be applied to the RANK() and DENSE_RANK() functions. Apparently, these ranking functions must have a sorted input to produce the correct result.

Now let's look at the NTILE() function:

-- Listing 5. NTILE() function with expressions in an ORDER BY clause. -- Query 1: ORDER BY orderID. SELECT NTILE(3) OVER (ORDER BY orderID) AS NTileNum, orderID

FROM RankingFunctions; GO -- Query 2: ORDER BY expression. SELECT NTILE(3) OVER (ORDER BY (SELECT 1)) AS NTileNum, orderID FROM RankingFunctions;

Analyzing the execution plans for both queries (see Figure 5), you will find that:

The second query skips sorting, meaning the solution is working.

Page 7: Ranking,DensRanking,NTILE Functions and Performance in SQL Server 2005

The results of both queries are correct. The optimizer is using Nested Loops, which in some situations

can be heavy.

Performance of Ranking Functions

Now, when you know how to avoid sorting in ranking functions you can test their performance.

Let's insert more rows into the RankingFunctions table (Listing 6):

-- Listing 6. Insert more rows into the RankingFunctions table. IF EXISTS (SELECT * FROM sys.objects      WHERE object_id = OBJECT_ID(N'[dbo].[RankingFunctions]') AND type in (N'U')) DROP TABLE RankingFunctions

SET NOCOUNT ON CREATE TABLE RankingFunctions(orderID int NOT NULL); INSERT INTO RankingFunctions VALUES(7); INSERT INTO RankingFunctions VALUES(11); INSERT INTO RankingFunctions VALUES(4); INSERT INTO RankingFunctions VALUES(21); INSERT INTO RankingFunctions VALUES(15);

DECLARE @i as int, @LoopMax int, @orderIDMax int; SELECT @i = 1, @LoopMax = 19; WHILE (@i <= @LoopMax) BEGIN      SELECT @orderIDMax = MAX(orderID) FROM RankingFunctions;      INSERT INTO RankingFunctions(OrderID)      SELECT OrderID + @orderIDMax FROM RankingFunctions;      SELECT @i = @i + 1; END

SELECT COUNT(*) FROM RankingFunctions; -- 2,621,440.

UPDATE RankingFunctions      SET orderID = orderID/5      WHERE orderID%5 = 0;

Page 8: Ranking,DensRanking,NTILE Functions and Performance in SQL Server 2005

The INSERT and SELECT parts of the INSERT…SELECT statement are using the same RankingFunctions table.

The number of generated rows can be calculated as:

generated rows number = initial rows number * power(2, number of loop iterations)

Since RankingFunctions initially has 5 rows and @LoopMax = 19, the number of generated rows will be:

5 * POWER(2,19) = 2,621,440

To increase the entropy in the row order, I changed (updated) the orderID values in the rows where orderID can be divided by 5 without the remainder.

Then I tested the INSERT and DELETE commands, using ranking functions with and without sorting (Listing 7 and Listing 8).

-- Listing 7. Performance tests 1 (Inserts, using SELECT ...INTO). -- Query 1: Using ORDER BY orderID. IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].RankingFunctionsInserts') AND type in (N'U')) DROP TABLE RankingFunctionsInserts; GO SELECT ROW_NUMBER () OVER (ORDER BY OrderID) AS rowNum, OrderID      INTO RankingFunctionsInserts      FROM RankingFunctions;

-- Drop table RankingFunctionsInserts and run Query 2. -- Query 2: Without sorting. SELECT ROW_NUMBER () OVER (ORDER BY (SELECT 1)) AS rowNum, OrderID      INTO RankingFunctionsInserts      FROM RankingFunctions;

-- Drop table RankingFunctionsInserts and run Query 3. -- Query 3: Using a pre-2005 solution. SELECT IDENTITY(int,1,1) AS rowNum, orderID      INTO RankingFunctionsInserts      FROM RankingFunctions;

Page 9: Ranking,DensRanking,NTILE Functions and Performance in SQL Server 2005

Each of the three queries in Listing 7 inserts the generated row number and orderID into the RankingFunctionsInserts table, using the SELECT…INTO statement. (This technique is very helpful when you trying to create pseudo-arrays in SQL.)

For the sake of curiosity, I tested a solution with an IDENTITY column (Query 3). That solution is very common in pre-2005 versions of SQL Server.

-- Listing 8. Performance tests 2 (Delete every fifth row in the RankingFunctions table). -- Query 1: Without sorting. -- Run the script from Listing 6 to insert 2,621,440 rows into RankingFunctions. WITH originalOrder AS (SELECT ROW_NUMBER ( ) OVER (ORDER BY (SELECT 1)) AS rowNum, OrderID      FROM RankingFunctions) DELETE originalOrder WHERE rowNum%5 = 0;

-- Query 2: With ORDER BY OrderID. -- Run the script from Listing 6 to insert 2,621,440 rows into RankingFunctions. WITH originalOrder AS (SELECT ROW_NUMBER ( ) OVER (ORDER BY OrderID) AS rowNum, OrderID      FROM RankingFunctions) DELETE originalOrder WHERE rowNum%5 = 0;

Deleting every Nth row or duplicates in the table are common tasks for a DBA or database programmer. In Listing 8, I used CTE to delete every fifth row in the RankingFunctions table.

Test Results

Here are the results that I got on a regular Pentium 4 desktop computer with 512 MB RAM running Windows 2000 Server and Microsoft SQL Server 2005 Developer Edition:

  ROW_NUMBER() RANK() DENSE_RANK() NTILE(3)

INSERT 2,621,440 rows

       

Page 10: Ranking,DensRanking,NTILE Functions and Performance in SQL Server 2005

without sorting

5 sec. N/A N/A 35 sec.

with sorting

14 sec. 14 sec. 14 sec. 40 sec.

with IDENTITY

8 sec. N/A N/A N/A

         

DELETE each 5th row

       

without sorting

5 sec.      

with sorting

24 sec.      

As you can see, the ROW_NUMBER() function works much faster without sorting. It also performs better than the IDENTITY solution, which is unsorted as well.

The RANK() and DENSE_RANK() functions, as we found earlier, don't work properly without sorting. NTILE() shows a very small improvement, about 10 percent. This is can be explained.

As I mentioned earlier, the optimizer is using Nested Loops to implement the NTILE() function. For large data sets, without the indexes (as in our case), Nested Loops can be very inefficient. However, you will find that they are inexpensive in the execution plan (see Figure 6), because sorting helps to make Nested Loops lighter.

When sorting is missing (see Figure 7), the Nested Loops become much heavier and almost "eat" the performance gains that you achieve by avoiding sorting.

How Indexes Can Help

As you know, all the pages of non-clustered indexes, and the intermediate-level pages of clustered indexes, are linked together and sorted in key sequence order. The leaf-level of a clustered index consists of data pages that are physically sorted in the same key

Page 11: Ranking,DensRanking,NTILE Functions and Performance in SQL Server 2005

sequence order as the clustered index key. All that means is that you already store some part(s) of your table's data in a particular order. If your query can use that sorted data — and this is what happens when you have a covering index — you will increase the performance of your query dramatically.

Take any table with many columns and rows (or create and populate one using the technique from Listing 6). Then create different indexes and test the ranking functions. You will find that for covered queries the optimizer won't use a Sort operator. This is what makes the ranking function as fast as, or even faster than, the functions with an expression in an "order by" clause.

Conclusion

This article explains ranking functions and helps you understand how they work. The techniques shown here, in some situations, can increase the performance of ranking functions 3-5 times. In addition, this article discusses some common trends in the behavior of an ORDER BY clause with expressions.