Wednesday, November 14, 2007

Some T-SQL code in 2005 and 2000.

I have sometimes been fazed by CTE (Common Table Expressions) and PIVOT in SQL Server 2005. I like the way they are structured but hate the way they perform. I tried comparing them with our old-style drived tables. And to my surprise, code written using derived tables performed much better that the code written using CTE or PIVOT or both combined. Let me share some code here using the SQL Server 2005 Adventureworks database.

--Shows Amount per customer:

WITH Value_from_Customer (CustomerID, SalesOrderID, Total_Value_of_Sale)
AS
(SELECT SOH.CustomerID, SOD.SalesOrderID, sum(SOD.LineTotal)
FROM Sales.SalesOrderDetail AS SOD INNER JOIN Sales.SalesOrderHeader SOH
ON (SOH.SalesOrderID = SOD.SalesOrderID)
GROUP BY SOH.CustomerID, SOD.SalesOrderID
)
SELECT CustomerID
,count(SalesOrderID) as Number_of_Sales
,sum(Total_Value_of_Sale) Value_from_Customer
FROM Value_from_Customer
Group by CustomerID
ORDER BY CustomerID
GO

--The same would have been done in SQL Server 2000 as follows:

SELECT CustomerID
,count(SalesOrderID) as Number_of_Sales
,sum(Total_Value_of_Sale) as Value_from_Customer
from
(SELECT SOH.CustomerID, SOD.SalesOrderID,
sum (SOD.LineTotal) as Total_Value_of_Sale
FROM Sales.SalesOrderDetail AS SOD INNER JOIN Sales.SalesOrderHeader SOH
ON (SOH.SalesOrderID = SOD.SalesOrderID)
GROUP BY SOH.CustomerID, SOD.SalesOrderID) abc
Group by CustomerID
ORDER BY CustomerID
GO

The CTE above looks very structured and systematic. The equivalent query with derived table looks more like a block of buildings. But I have found in all my tests that derived tables always outperform CTE's. In fact, I can share a real story about it. One of my colleagues had a terabyte-plus sized database for a data warehouse. Somebody wrote a 200 line CTE that would fetch a few rows from a combination of several multi-million row tables. It gave the results but after executing for eight hours. It is hard to withstand eight hours when you know that you need just a few rows at the end. I suggested that CTE's in the query be replaced by regular derived tables. It was a big task but when it was done, the query gave out the same results in 55 seconds. That's the kind of difference I am talking about. After that some more improvements were done to make it finally run in 10 seconds. However, we would not go into the details of that story here but would stick to differences in performance between new and old T-SQL constructs.

My sample queries for 2005 and 2000 above would show only marginal difference in execution times becuase the number of rows in the referenced tables is very small. When the number of rows are increased in the underlying tables, the difference in execution times keeps on rising with the 2000 style code outperforming the 2005 CTE code.

Now let's take up a PIVOT construct.

---Shows Sale per year by territory. PIVOT.

SELECT YEAR_of_SALE
, [1] AS Trtry_1
, [2] AS Trtry_2
, [3] AS Trtry_3
, [4] AS Trtry_4
, [5] AS Trtry_5
, [6] AS Trtry_6
FROM
(SELECT TerritoryID, TotalDue, year (OrderDate) AS YEAR_of_SALE
FROM Sales.SalesOrderHeader) SOH
PIVOT (SUM (TotalDue) FOR TerritoryID IN ( [1], [2], [3], [4], [5], [6] )) AS pvt
ORDER BY YEAR_of_SALE
GO


--The same would have been done in SQL 2000 as follows:

select YEAR_of_SALE
,sum(Trtry_1) as Trtry_1
,sum(Trtry_2) as Trtry_2
,sum(Trtry_3) as Trtry_3
,sum(Trtry_4) as Trtry_4
,sum(Trtry_5) as Trtry_5
,sum(Trtry_6) as Trtry_6
from
(select YEAR_of_SALE
,isnull((select Total_Revenue where TerritoryID = 1), 0) as Trtry_1
,isnull((select Total_Revenue where TerritoryID = 2), 0) as Trtry_2
,isnull((select Total_Revenue where TerritoryID = 3), 0) as Trtry_3
,isnull((select Total_Revenue where TerritoryID = 4), 0) as Trtry_4
,isnull((select Total_Revenue where TerritoryID = 5), 0) as Trtry_5
,isnull((select Total_Revenue where TerritoryID = 6), 0) as Trtry_6

from
(SELECT TerritoryID
,year (OrderDate) AS YEAR_of_SALE
,sum(TotalDue) Total_Revenue
FROM Sales.SalesOrderHeader
where TerritoryID between 1 and 6
GROUP BY TerritoryID, year (OrderDate)) abc) def

GROUP BY YEAR_of_SALE
ORDER BY YEAR_of_SALE
GO

Now as you can see, the 2005 query with PIVOT looks simple and neat. The equivalent query performing the same functionality but written using derived tables looks tedious. However, the second query with derived tables outperforms the first one with PIVOT. I agree that the first query with PIVOT requires much less coding effort and looks much less complex than the second one using derived tables. But what are clients interested in? Coding complexity or performance. It is always performance that everybody is interested in. When a customer clicks on a link, he/she expects results fast and doesn't really care what code is executing in the background. So we have to use code which performs better.

Now lets take another example where I combine a CTE with a PIVOT structure. Please note that below the 2005 style query with CTE and PIVOT combined, I have an equivalent for 2000 using derived tables:

--2005.
--Find number of physical sales by year and their total value.
--Break down the yearly value by territory upto six territories.

WITH Value_of_Sale (SalesOrderID, Transaction_Date, Total_Value_of_Sale)
AS
(SELECT SOD.SalesOrderID, SOH.ModifiedDate, sum(SOD.LineTotal)
FROM Sales.SalesOrderDetail AS SOD INNER JOIN Sales.SalesOrderHeader SOH
ON (SOH.SalesOrderID = SOD.SalesOrderID)
GROUP BY SOD.SalesOrderID, SOH.ModifiedDate
)
SELECT count(a.SalesOrderID) Total_Number_of_Sales
,b.YEAR_of_SALE
,sum(a.Total_Value_of_Sale) Total_Value_of_Sale
,b.Trtry_1
,b.Trtry_2
,b.Trtry_3
,b.Trtry_4
,b.Trtry_5
,b.Trtry_6
FROM Value_of_Sale a
inner join
(SELECT YEAR_of_SALE
,[1] AS Trtry_1
,[2] AS Trtry_2
,[3] AS Trtry_3
,[4] AS Trtry_4
,[5] AS Trtry_5
,[6] AS Trtry_6
FROM (SELECT TerritoryID, TotalDue, year (OrderDate) AS YEAR_of_SALE
FROM
Sales.SalesOrderHeader) SOH
PIVOT (SUM (TotalDue) FOR TerritoryID IN ( [1], [2], [3], [4], [5], [6] ) ) AS pvt ) b

on year(a.Transaction_Date) = b.YEAR_of_SALE
Group By b.YEAR_of_SALE
,b.Trtry_1
,b.Trtry_2
,b.Trtry_3
,b.Trtry_4
,b.Trtry_5
,b.Trtry_6
ORDER BY b.YEAR_of_SALE Asc
GO

--The same would have been done in SQL Server 2000 as follows:
---2000.

select count(a.SalesOrderID) Number_of_Sales
,b.YEAR_of_SALE
,sum(a.Total_Value_of_Sale) Total_Value_of_Sale
,b.Trtry_1
,b.Trtry_2
,b.Trtry_3
,b.Trtry_4
,b.Trtry_5
,b.Trtry_6
from

(SELECT SOD.SalesOrderID
,year(SOH.ModifiedDate) AS Transaction_Date
,sum(SOD.LineTotal) AS Total_Value_of_Sale
FROM Sales.SalesOrderDetail AS SOD INNER JOIN Sales.SalesOrderHeader SOH
ON (SOH.SalesOrderID = SOD.SalesOrderID)
GROUP BY SOD.SalesOrderID, year(SOH.ModifiedDate)
) a
INNER JOIN
(select YEAR_of_SALE
,sum(Trtry_1) as Trtry_1
,sum(Trtry_2) as Trtry_2
,sum(Trtry_3) as Trtry_3
,sum(Trtry_4) as Trtry_4
,sum(Trtry_5) as Trtry_5
,sum(Trtry_6) as Trtry_6
from
(select YEAR_of_SALE
,isnull((select Total_Revenue where TerritoryID = 1), 0) as Trtry_1
,isnull((select Total_Revenue where TerritoryID = 2), 0) as Trtry_2
,isnull((select Total_Revenue where TerritoryID = 3), 0) as Trtry_3
,isnull((select Total_Revenue where TerritoryID = 4), 0) as Trtry_4
,isnull((select Total_Revenue where TerritoryID = 5), 0) as Trtry_5
,isnull((select Total_Revenue where TerritoryID = 6), 0) as Trtry_6
from
(SELECT TerritoryID
,year (OrderDate) AS YEAR_of_SALE
,sum(TotalDue) Total_Revenue
FROM Sales.SalesOrderHeader
where TerritoryID between 1 and 6
GROUP BY TerritoryID, year (OrderDate)) abc) def
GROUP BY YEAR_of_SALE) b

on a.Transaction_Date = b.YEAR_of_SALE
Group By b.YEAR_of_SALE
,b.Trtry_1
,b.Trtry_2
,b.Trtry_3
,b.Trtry_4
,b.Trtry_5
,b.Trtry_6
ORDER BY b.YEAR_of_SALE Asc
GO

The 2000 code with derived tables involves much more coding and code complexity than the equivalent one in 2005. But the 2000 code outperforms the 2005 code everytime. When I incease the number of rows in the underlying referenced tables within the two queries, the different in execution times gets wider.

I am not denying SQL Server 2005 credit that it deserves. But I still love the SQL Server 2000 coding style becuase of its supeior performance. Please note that all the code was run on SQL Server 2005 Adventureworks database in 90 compatibility mode.

However, there are a few coding improvements that I love in SQL Server 2005. These are the NTILE and ROW_NUMBER() OVER constructs. The reason why I like them is that not only are they short and simple, they also perform better than their equivalnt coding constructs in SQL Server 2000. Let me share some examples below.

Let me create a situation first. An Engineering College conducts nationwide enterance exams. Thousands of students appear in the tests. Let's say only the top 12 scorers qualify for admission. The college first selects the top 12 scorers in the exam. The college then needs to rank these top 12 students based on their total score first. If for any two or more of the top 12 students the total scores match, then the one with the higher score in Mathematics is ranked higher. If total scores and score in Maths are also same, then the one with higher score in Physics is ranked higher. If the total score and individual scores in Maths, Physics and Chemistry also match, then the student with the higher age is to be ranked higher as per the College policy. Once ranks have bene allotted to the top 12 students, they are to be invited in groups/batches of four to permit them to elect their field of interest (Computer Engineering, Electronics Engineering, Electrical Engineering, Mechanical Engineering, Chemical Engineering, Civil Engineering, etc). Each group is based on the ranks of the students and needs to contain exactly four students.

Taking these requirements into consideration, lets design a small table with the top twleve scorers as follows:

--Create the table.
create table Student_Ranks
(Student_Name varchar(2)
,Maths smallint
,Physics smallint
,Chemistry smallint
,Total_Marks AS Maths + Physics + Chemistry
,Age tinyint)
GO

--Populate the table with marks.
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('A', 90, 50, 70, 17)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('B', 70, 60, 90, 16)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('C', 50, 70, 90, 17)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('D', 50, 90, 70, 16)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('E', 100, 90, 45, 16)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('F', 90, 50, 70, 16)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('G', 45, 80, 60, 16)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('H', 52, 49, 35, 17)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('I', 75, 35, 45, 17)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('J', 35, 35, 25, 17)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('K', 10, 45, 35, 17)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('L', 5, 35, 56, 17)
GO

select * from Student_Ranks
GO

--SQL Server 2005 code for ranking students.
--Ranks students by Total Marks. If equal, then checks for Maths, then Physics, then
--Chemistry. If all equal, then ranks the older one up.

select Row_Number() Over (Order by Total_Marks desc
,Maths desc
,Physics desc
,Chemistry desc
,Age Desc)
as Student_Rank, Student_Name
from
Student_Ranks
GO

--The code otherwise in SQL 2000 that would do the ranking.
--Identity allowed only with SELECT INTO.
--To gaurentee order, Top 100 Percent, Maxdop 1 and Order By are required.

select top 100 percent Student_Name
,Student_Rank = identity(int, 1, 1)
into #Temp_Student_Ranks
from Student_Ranks
Order by Total_Marks desc, Maths desc, Physics desc, Chemistry desc, Age Desc
option (maxdop 1)

select * from #Temp_Student_Ranks
GO
--Clean up.
Drop table #Temp_Student_Ranks
GO

--The college has to call candidates for disciple allotment in batches of 4.
--Who is in which batch/group as per their rank? The first group is called first, the second next
--and so on.

select NTILE(3) Over(Order by Total_Marks desc
,Maths desc
,Physics desc
,Chemistry desc
,Age Desc)
as Batch, Student_Name from Student_Ranks
GO

--In 2000, the same grouping of batches would have been done in the following way:
select top 100 percent Student_Name
,Student_Rank = identity(int, 1, 1)
,0 as Batch
into #Temp_Student_Ranks
from Student_Ranks
Order by Total_Marks desc, Maths desc, Physics desc, Chemistry desc, Age Desc
option (maxdop 1)
GO

update #Temp_Student_Ranks
set Batch = case
when Student_Rank between 1 and 4 then 1
when Student_Rank between 5 and 8 then 2
when Student_Rank between 9 and 12 then 3
else 4
end

select Student_Name, Batch from #Temp_Student_Ranks
GO
--Clean up.
Drop table #Temp_Student_Ranks
GO

You may observe, how easy it is in 2005 to rank and group students using ROW_NUMBER OVER() and NTILE() respectively. The same holds good for other items as far as ranking and grouping is concerned and as far as I have tried. And it performs better with SQL Server 2005 code than with SQL Server 2000 style code.

In my next article, I will come up with some tricky T-SQL code. Please let me know your opinions for scope of improvement.

No comments: