Tuesday, December 11, 2007

MERGE statement in SQL Server 2008

Since the last few days, I have been spending some time on studying new features in the Katmai betas. One of the better coding introductions I like in Microsoft SQL Server 2008 is the MERGE statement. I like the way it provides option to UPDATE, DELETE and INSERT data simultaneously in a single T-SQL statement.

Using it for simple operations is a costly affair when compared to equivalent T-SQL code in 2000/2005. However using it for complex operations is simple and efficient compared to equivalent T-SQL code in 2000/2005. This is the paradox. I find most prolific coders prefer to write compact code even if is is very complex rather than using lengthy but simple code. The reason most often is that the shorter the code, the faster it performs in most cases. Here is where the paradox associated with MERGE statement comes in handy.

Let me create a simple situation first:

--Create a small table.
create table abc
(productid int identity (1,1)
,Quantity decimal(4,0) default 100
,Order_Date datetime default getdate())

--Populate the tables, abc.

insert into abc default values
insert into abc (Quantity) values (125)
insert into abc (Quantity) values (150)
insert into abc (Quantity) values (175)
insert into abc (Quantity) values (200)
insert into abc (Quantity) values (225)
insert into abc (Quantity) values (250)
insert into abc (Quantity) values (275)
insert into abc (Quantity) values (300)
insert into abc (Quantity) values (325)
insert into abc (Quantity) values (350)
insert into abc (Quantity) values (375)

--Check the contents of the table, abc.
select * from abc


--Add a nullable column to the table, abc. This one column is to be populated (updated) later.
alter table abc
add OrderID int null


--Now the MERGE construct updates the OrderID column of the table abc as follows.
--Divide the ProductID by 2. If remainder comes out to be zero, then update
--OrderID with 2. If remainder comes out to a non-zero value, then update
--OrderID as ProductID minus the Quotient.
--The code is given below:

merge abc as master_table
using (select ProductId, ProductId / @l_OrderID_value, ProductId % @l_OrderID_value
from abc) as input (ProductId, Quotient, Remainder)
ON master_Table.ProductId = Input.ProductID
When Matched
THEN UPDATE set master_table.OrderID = case
when input.Remainder = 0 then @l_OrderID_value
else master_Table.ProductId - Input.Quotient
end;

It was very structured but long. This was a simple operation and didn't need such length of code.

--Now let's backup the data in the tables abc.

select * into def from abc

Now, let's write the equivalent T-SQL code as a replacement of the MERGE statement above.

declare @l_OrderID_value int
set @l_OrderID_value = 2

update abc
set OrderID = case
when ProductID%@l_OrderID_value = 0 then @l_OrderID_value
else ProductId - (ProductID/@l_OrderID_value)
end;

This was short, simple and just one UPDATE statement was required to do the entire operation.

If you look for the query cost relative to batch when you run both the statements together and get actual execution plans, the MERGE costs 76% while the single UPDATE costs 24%. The MERGE statement had a SELECT from the table abc, then it uses a derived table, and then it does the UPDATE. In the second statement above, there is a direct UPDATE. So the second statement is expected to be faster. So in this example, MERGE was not really useful. The above example was a simple case, so we won't bother ourselves much with MERGE in simple scenarios.

Now, let's take up something slightly more involved where MERGE can be really useful.

--Create a new table.
create table abc_Summary
(OrderID int
,Order_Date datetime
,Quantity int)

--It is empty.
select * from abc_Summary


MERGE abc_summary as TARGET_Table
using (select OrderID, convert(varchar, Order_Date, 112), sum(Quantity) from abc
group by OrderID, convert(varchar, Order_Date, 112)) as Data (OrderID, Order_Date, Total_Quantity)
on TARGET_Table.OrderID = Data.OrderID
When NOT MATCHED THEN INSERT (OrderID, Order_Date, Quantity)
VALUES (Data.OrderID, Data.Order_Date, Data.Total_Quantity)
WHEN MATCHED THEN
UPDATE SET TARGET_Table.Quantity = Data.Total_Quantity;

--Verify that the empty table has gotten populated with aggregated data from the main table, abc.
select * from abc_Summary

--So this time the INSERT took place. Now lets UPDATE the column abc_Summary.Quantity to --zero. And then we will run the above code again.

update abc_summary
set Quantity = 0
GO

--Verify that Quantity is 0.
select * from abc_Summary
GO

--Run the same code again:
MERGE abc_summary as TARGET_Table
using (select OrderID, convert(varchar, Order_Date, 112), sum(Quantity) from abc
group by OrderID, convert(varchar, Order_Date, 112)) as Data (OrderID, Order_Date, Total_Quantity)
on TARGET_Table.OrderID = Data.OrderID
When NOT MATCHED THEN INSERT (OrderID, Order_Date, Quantity)
VALUES (Data.OrderID, Data.Order_Date, Data.Total_Quantity)
WHEN MATCHED THEN
UPDATE SET TARGET_Table.Quantity = Data.Total_Quantity;

--Verify that the Quantity column in the abc_Summarfy table is updated now with actual summed quantities per OrderID from the parent table, abc.

select * from abc_Summary


Lets do something different now, a bit more involved. UPDATE the table abc_Summary such that OrderID's 1,2 and 3 are removed and Quantity is set to zero for the remaining three rows. Also add one dirty row to the abc_summary table. Note that the table abc_summary should contain only those OrderID's that already exist in the parent table, abc. Since the newly inserted OrderID in the dirty row does not exist in the parent table, abc, it should be deleted from abc_summary. So in all, what we are trying to do here is following:
INSERT missing rows,
UPDATE Quantity for existing three rows, and
DELETE one dirty row.

All this is to be accomplished in one query.
Let's prepare the data now.

--Preparing data.
update abc_summary
set Quantity = 0
GO
delete from abc_summary
where OrderID in (1, 2, 3)
GO
insert into abc_summary (OrderID, Order_Date, Quantity)
values (2001, getdate(), 50000)
go

--Check the data. OrderID's 4,5 and 6 have their Quantity zero'ed out. OrderID's 1,2 and 3 no ---longer exist. A new OrderID 2001 has come in but it doesn't exist in parent table, abc.

select * from abc_Summary

--Now let's get the correct data into the table abc_summary in one single T-SQL query using the MERGE statement which issues INSERT, UPDATE and DELETE simultaneously.

MERGE abc_summary as TARGET_Table
using (select OrderID, convert(varchar, Order_Date, 112), sum(Quantity) from abc
group by OrderID, convert(varchar, Order_Date, 112)) as Data (OrderID, Order_Date, Total_Quantity)
on TARGET_Table.OrderID = Data.OrderID
When NOT MATCHED THEN INSERT (OrderID, Order_Date, Quantity)
VALUES (Data.OrderID, Data.Order_Date, Data.Total_Quantity)
WHEN MATCHED THEN
UPDATE SET TARGET_Table.Quantity = Data.Total_Quantity
WHEN SOURCE NOT MATCHED THEN
DELETE;

--Check the data. OrderID's 4,5 and 6 have their Quantity UPDATEd correctly. --Rows for OrderID's 1,2 and 3 have been INSERTed now. The dirty row for OrderID 2001 has been DELETed as it didn't exist in parent table, abc.

select * from abc_Summary

So this time the INSERT took place and brought in the three mmissing ORDERID's (1,2,3) from
the table abc into the table abc_Summary. Then it UPDATEd the abc_Summary.Quantity
column for ORDERID's 3,4,5 from zero to the correct aggregated quantity from the
parent table, abc. Lastly, it also DELETEd the erroneous row with ORDERID 2001 because this
ORDERID did not exist in the parent table, abc. All these three operations were
performed in one single query.

All this can be done without the MERGE statement as well. But it would be really very lengthy and expensive code. It would contain several IF-ELSE constructs, IF [NOT] EXISTS clauses, multiple instances of usage of the tables abc and abc_Summary, apart from the INSERT, UPDATE, DELETE statements. We don't want to do that, do we? The cost of that query would be much more than the MERGE statement above. Let's verify the same. Let me try to build as short a code as possible to replace the MERGE statement above. Please see the replacement code below:


--Prepare the data first.
update abc_summary
set Quantity = 0
GO
delete from abc_summary
where OrderID in (1, 2, 3)
GO
insert into abc_summary (OrderID, Order_Date, Quantity)
values (2001, getdate(), 50000)
go

--Check the data.
select * from abc_Summary

--Now let's get the correct data into the table abc_summary.
--UPDATE Quantity to the correct aggregated value.
update abc_summary
set abc_summary.Quantity = (select sum(abc.Quantity) from abc
where abc_Summary.OrderID = abc.OrderID)
where exists (select OrderID from abc
where abc_Summary.OrderID = abc.OrderID)

--Check the data. Quantity has been updated correctly.
select * from abc_Summary

--INSERT missing ORDERID's 1,2 and 3.
insert into abc_summary (OrderID, Order_Date, Quantity)
select OrderID, convert(varchar, Order_Date, 112), sum(Quantity) from abc
where OrderID not in (select OrderID from abc_Summary)
group by OrderID, convert(varchar, Order_Date, 112)

--Check the data. Missing ORDERID's have been re-inserted correctly with the right Quantity.
select * from abc_Summary

--DELETE dirty row.
delete from abc_Summary
where OrderID not in (select OrderID from abc)

--Check the data. The row with ORDERID 2001 has been deleted as required.
select * from abc_Summary

So multiple instances of joins between the parent table abc and the child table abc_Summary had to be used.

The execution plan showed 40% cost for the MERGE statement and 60% cost for the replacement statemenets above combined. Here we are talking about just va few rows of data. The benefits rise with increase in the number of rows. You may want to try it out.

So in all, MERGE statement in Microsoft SQL Server 2008 is really good. It reduces code, does multiple operations in a single statement and is less expensive or is more efficient when we need to perform a bit complex operations. Please note that the syntax for INSERT, UPDATE and DELETE constructs within the MERGE statement are slightly different from their original counterparts. The MERGE statement opens up a lot of possibilities. Imagine how interesting it would be to use within an AFTER TRIGGER to perform UPDATE, DELETE and INSERT operations on another table in one shot through the trigger. Although it is possible to do it without the MERGE statement as well, but it gets too lengthy and tedious as I had to implement at one of my clients for certain reasons. MERGE can make it compact, simple and efficient.

In my next article, we will solve some puzzles.

Wednesday, November 14, 2007

Some T-SQL code in 2005 and 2000.

I have sometimes been fazed by CTE (Common Table Expressions) and PIVOT in SQL Server 2005. I like the way they are structured but hate the way they perform. I tried comparing them with our old-style drived tables. And to my surprise, code written using derived tables performed much better that the code written using CTE or PIVOT or both combined. Let me share some code here using the SQL Server 2005 Adventureworks database.

--Shows Amount per customer:

WITH Value_from_Customer (CustomerID, SalesOrderID, Total_Value_of_Sale)
AS
(SELECT SOH.CustomerID, SOD.SalesOrderID, sum(SOD.LineTotal)
FROM Sales.SalesOrderDetail AS SOD INNER JOIN Sales.SalesOrderHeader SOH
ON (SOH.SalesOrderID = SOD.SalesOrderID)
GROUP BY SOH.CustomerID, SOD.SalesOrderID
)
SELECT CustomerID
,count(SalesOrderID) as Number_of_Sales
,sum(Total_Value_of_Sale) Value_from_Customer
FROM Value_from_Customer
Group by CustomerID
ORDER BY CustomerID
GO

--The same would have been done in SQL Server 2000 as follows:

SELECT CustomerID
,count(SalesOrderID) as Number_of_Sales
,sum(Total_Value_of_Sale) as Value_from_Customer
from
(SELECT SOH.CustomerID, SOD.SalesOrderID,
sum (SOD.LineTotal) as Total_Value_of_Sale
FROM Sales.SalesOrderDetail AS SOD INNER JOIN Sales.SalesOrderHeader SOH
ON (SOH.SalesOrderID = SOD.SalesOrderID)
GROUP BY SOH.CustomerID, SOD.SalesOrderID) abc
Group by CustomerID
ORDER BY CustomerID
GO

The CTE above looks very structured and systematic. The equivalent query with derived table looks more like a block of buildings. But I have found in all my tests that derived tables always outperform CTE's. In fact, I can share a real story about it. One of my colleagues had a terabyte-plus sized database for a data warehouse. Somebody wrote a 200 line CTE that would fetch a few rows from a combination of several multi-million row tables. It gave the results but after executing for eight hours. It is hard to withstand eight hours when you know that you need just a few rows at the end. I suggested that CTE's in the query be replaced by regular derived tables. It was a big task but when it was done, the query gave out the same results in 55 seconds. That's the kind of difference I am talking about. After that some more improvements were done to make it finally run in 10 seconds. However, we would not go into the details of that story here but would stick to differences in performance between new and old T-SQL constructs.

My sample queries for 2005 and 2000 above would show only marginal difference in execution times becuase the number of rows in the referenced tables is very small. When the number of rows are increased in the underlying tables, the difference in execution times keeps on rising with the 2000 style code outperforming the 2005 CTE code.

Now let's take up a PIVOT construct.

---Shows Sale per year by territory. PIVOT.

SELECT YEAR_of_SALE
, [1] AS Trtry_1
, [2] AS Trtry_2
, [3] AS Trtry_3
, [4] AS Trtry_4
, [5] AS Trtry_5
, [6] AS Trtry_6
FROM
(SELECT TerritoryID, TotalDue, year (OrderDate) AS YEAR_of_SALE
FROM Sales.SalesOrderHeader) SOH
PIVOT (SUM (TotalDue) FOR TerritoryID IN ( [1], [2], [3], [4], [5], [6] )) AS pvt
ORDER BY YEAR_of_SALE
GO


--The same would have been done in SQL 2000 as follows:

select YEAR_of_SALE
,sum(Trtry_1) as Trtry_1
,sum(Trtry_2) as Trtry_2
,sum(Trtry_3) as Trtry_3
,sum(Trtry_4) as Trtry_4
,sum(Trtry_5) as Trtry_5
,sum(Trtry_6) as Trtry_6
from
(select YEAR_of_SALE
,isnull((select Total_Revenue where TerritoryID = 1), 0) as Trtry_1
,isnull((select Total_Revenue where TerritoryID = 2), 0) as Trtry_2
,isnull((select Total_Revenue where TerritoryID = 3), 0) as Trtry_3
,isnull((select Total_Revenue where TerritoryID = 4), 0) as Trtry_4
,isnull((select Total_Revenue where TerritoryID = 5), 0) as Trtry_5
,isnull((select Total_Revenue where TerritoryID = 6), 0) as Trtry_6

from
(SELECT TerritoryID
,year (OrderDate) AS YEAR_of_SALE
,sum(TotalDue) Total_Revenue
FROM Sales.SalesOrderHeader
where TerritoryID between 1 and 6
GROUP BY TerritoryID, year (OrderDate)) abc) def

GROUP BY YEAR_of_SALE
ORDER BY YEAR_of_SALE
GO

Now as you can see, the 2005 query with PIVOT looks simple and neat. The equivalent query performing the same functionality but written using derived tables looks tedious. However, the second query with derived tables outperforms the first one with PIVOT. I agree that the first query with PIVOT requires much less coding effort and looks much less complex than the second one using derived tables. But what are clients interested in? Coding complexity or performance. It is always performance that everybody is interested in. When a customer clicks on a link, he/she expects results fast and doesn't really care what code is executing in the background. So we have to use code which performs better.

Now lets take another example where I combine a CTE with a PIVOT structure. Please note that below the 2005 style query with CTE and PIVOT combined, I have an equivalent for 2000 using derived tables:

--2005.
--Find number of physical sales by year and their total value.
--Break down the yearly value by territory upto six territories.

WITH Value_of_Sale (SalesOrderID, Transaction_Date, Total_Value_of_Sale)
AS
(SELECT SOD.SalesOrderID, SOH.ModifiedDate, sum(SOD.LineTotal)
FROM Sales.SalesOrderDetail AS SOD INNER JOIN Sales.SalesOrderHeader SOH
ON (SOH.SalesOrderID = SOD.SalesOrderID)
GROUP BY SOD.SalesOrderID, SOH.ModifiedDate
)
SELECT count(a.SalesOrderID) Total_Number_of_Sales
,b.YEAR_of_SALE
,sum(a.Total_Value_of_Sale) Total_Value_of_Sale
,b.Trtry_1
,b.Trtry_2
,b.Trtry_3
,b.Trtry_4
,b.Trtry_5
,b.Trtry_6
FROM Value_of_Sale a
inner join
(SELECT YEAR_of_SALE
,[1] AS Trtry_1
,[2] AS Trtry_2
,[3] AS Trtry_3
,[4] AS Trtry_4
,[5] AS Trtry_5
,[6] AS Trtry_6
FROM (SELECT TerritoryID, TotalDue, year (OrderDate) AS YEAR_of_SALE
FROM
Sales.SalesOrderHeader) SOH
PIVOT (SUM (TotalDue) FOR TerritoryID IN ( [1], [2], [3], [4], [5], [6] ) ) AS pvt ) b

on year(a.Transaction_Date) = b.YEAR_of_SALE
Group By b.YEAR_of_SALE
,b.Trtry_1
,b.Trtry_2
,b.Trtry_3
,b.Trtry_4
,b.Trtry_5
,b.Trtry_6
ORDER BY b.YEAR_of_SALE Asc
GO

--The same would have been done in SQL Server 2000 as follows:
---2000.

select count(a.SalesOrderID) Number_of_Sales
,b.YEAR_of_SALE
,sum(a.Total_Value_of_Sale) Total_Value_of_Sale
,b.Trtry_1
,b.Trtry_2
,b.Trtry_3
,b.Trtry_4
,b.Trtry_5
,b.Trtry_6
from

(SELECT SOD.SalesOrderID
,year(SOH.ModifiedDate) AS Transaction_Date
,sum(SOD.LineTotal) AS Total_Value_of_Sale
FROM Sales.SalesOrderDetail AS SOD INNER JOIN Sales.SalesOrderHeader SOH
ON (SOH.SalesOrderID = SOD.SalesOrderID)
GROUP BY SOD.SalesOrderID, year(SOH.ModifiedDate)
) a
INNER JOIN
(select YEAR_of_SALE
,sum(Trtry_1) as Trtry_1
,sum(Trtry_2) as Trtry_2
,sum(Trtry_3) as Trtry_3
,sum(Trtry_4) as Trtry_4
,sum(Trtry_5) as Trtry_5
,sum(Trtry_6) as Trtry_6
from
(select YEAR_of_SALE
,isnull((select Total_Revenue where TerritoryID = 1), 0) as Trtry_1
,isnull((select Total_Revenue where TerritoryID = 2), 0) as Trtry_2
,isnull((select Total_Revenue where TerritoryID = 3), 0) as Trtry_3
,isnull((select Total_Revenue where TerritoryID = 4), 0) as Trtry_4
,isnull((select Total_Revenue where TerritoryID = 5), 0) as Trtry_5
,isnull((select Total_Revenue where TerritoryID = 6), 0) as Trtry_6
from
(SELECT TerritoryID
,year (OrderDate) AS YEAR_of_SALE
,sum(TotalDue) Total_Revenue
FROM Sales.SalesOrderHeader
where TerritoryID between 1 and 6
GROUP BY TerritoryID, year (OrderDate)) abc) def
GROUP BY YEAR_of_SALE) b

on a.Transaction_Date = b.YEAR_of_SALE
Group By b.YEAR_of_SALE
,b.Trtry_1
,b.Trtry_2
,b.Trtry_3
,b.Trtry_4
,b.Trtry_5
,b.Trtry_6
ORDER BY b.YEAR_of_SALE Asc
GO

The 2000 code with derived tables involves much more coding and code complexity than the equivalent one in 2005. But the 2000 code outperforms the 2005 code everytime. When I incease the number of rows in the underlying referenced tables within the two queries, the different in execution times gets wider.

I am not denying SQL Server 2005 credit that it deserves. But I still love the SQL Server 2000 coding style becuase of its supeior performance. Please note that all the code was run on SQL Server 2005 Adventureworks database in 90 compatibility mode.

However, there are a few coding improvements that I love in SQL Server 2005. These are the NTILE and ROW_NUMBER() OVER constructs. The reason why I like them is that not only are they short and simple, they also perform better than their equivalnt coding constructs in SQL Server 2000. Let me share some examples below.

Let me create a situation first. An Engineering College conducts nationwide enterance exams. Thousands of students appear in the tests. Let's say only the top 12 scorers qualify for admission. The college first selects the top 12 scorers in the exam. The college then needs to rank these top 12 students based on their total score first. If for any two or more of the top 12 students the total scores match, then the one with the higher score in Mathematics is ranked higher. If total scores and score in Maths are also same, then the one with higher score in Physics is ranked higher. If the total score and individual scores in Maths, Physics and Chemistry also match, then the student with the higher age is to be ranked higher as per the College policy. Once ranks have bene allotted to the top 12 students, they are to be invited in groups/batches of four to permit them to elect their field of interest (Computer Engineering, Electronics Engineering, Electrical Engineering, Mechanical Engineering, Chemical Engineering, Civil Engineering, etc). Each group is based on the ranks of the students and needs to contain exactly four students.

Taking these requirements into consideration, lets design a small table with the top twleve scorers as follows:

--Create the table.
create table Student_Ranks
(Student_Name varchar(2)
,Maths smallint
,Physics smallint
,Chemistry smallint
,Total_Marks AS Maths + Physics + Chemistry
,Age tinyint)
GO

--Populate the table with marks.
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('A', 90, 50, 70, 17)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('B', 70, 60, 90, 16)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('C', 50, 70, 90, 17)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('D', 50, 90, 70, 16)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('E', 100, 90, 45, 16)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('F', 90, 50, 70, 16)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('G', 45, 80, 60, 16)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('H', 52, 49, 35, 17)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('I', 75, 35, 45, 17)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('J', 35, 35, 25, 17)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('K', 10, 45, 35, 17)
insert into Student_Ranks (Student_Name, Maths, Physics, Chemistry, Age)
values ('L', 5, 35, 56, 17)
GO

select * from Student_Ranks
GO

--SQL Server 2005 code for ranking students.
--Ranks students by Total Marks. If equal, then checks for Maths, then Physics, then
--Chemistry. If all equal, then ranks the older one up.

select Row_Number() Over (Order by Total_Marks desc
,Maths desc
,Physics desc
,Chemistry desc
,Age Desc)
as Student_Rank, Student_Name
from
Student_Ranks
GO

--The code otherwise in SQL 2000 that would do the ranking.
--Identity allowed only with SELECT INTO.
--To gaurentee order, Top 100 Percent, Maxdop 1 and Order By are required.

select top 100 percent Student_Name
,Student_Rank = identity(int, 1, 1)
into #Temp_Student_Ranks
from Student_Ranks
Order by Total_Marks desc, Maths desc, Physics desc, Chemistry desc, Age Desc
option (maxdop 1)

select * from #Temp_Student_Ranks
GO
--Clean up.
Drop table #Temp_Student_Ranks
GO

--The college has to call candidates for disciple allotment in batches of 4.
--Who is in which batch/group as per their rank? The first group is called first, the second next
--and so on.

select NTILE(3) Over(Order by Total_Marks desc
,Maths desc
,Physics desc
,Chemistry desc
,Age Desc)
as Batch, Student_Name from Student_Ranks
GO

--In 2000, the same grouping of batches would have been done in the following way:
select top 100 percent Student_Name
,Student_Rank = identity(int, 1, 1)
,0 as Batch
into #Temp_Student_Ranks
from Student_Ranks
Order by Total_Marks desc, Maths desc, Physics desc, Chemistry desc, Age Desc
option (maxdop 1)
GO

update #Temp_Student_Ranks
set Batch = case
when Student_Rank between 1 and 4 then 1
when Student_Rank between 5 and 8 then 2
when Student_Rank between 9 and 12 then 3
else 4
end

select Student_Name, Batch from #Temp_Student_Ranks
GO
--Clean up.
Drop table #Temp_Student_Ranks
GO

You may observe, how easy it is in 2005 to rank and group students using ROW_NUMBER OVER() and NTILE() respectively. The same holds good for other items as far as ranking and grouping is concerned and as far as I have tried. And it performs better with SQL Server 2005 code than with SQL Server 2000 style code.

In my next article, I will come up with some tricky T-SQL code. Please let me know your opinions for scope of improvement.

Friday, October 12, 2007

SQL Server: Making a good product.

I am a new blogger. Basically an SQL Server DBA and performance tuning guy, I am also interested in poetry, literature and music. The aim of this blog is to express ideas as they come, on any of these topics i.e. SQL Server, poetry, literature and music. All are welcome to join me and express their ideas. The thoughts and ideas expressed in this blog are personal to the blogger alone and nobody need to take any offence to it.

I love MS SQL Server 2000 more than any version of SQL Server and over all other RDBMS's. It was a well made product, Performance wise and feature-wise it was a product in which every feature present worked well. I always feel that the people who developed it did it it with commitment and were very meritorious. It did not have too many bells and whistles but it worked much better than the so-called 'feature-rich' RDBMS's from any vendor. It is so important to hire well-qualified people, for e.g people with a four year bachelor degree in engineering/technology or higher from premier institutes and a good experience atleast in the technical field. Then the product they make or are asked to make would be competent and work efficiently.
These days, companies hire just about anybody based on a voluminous resume. The purpose is to save money because such people have to be paid less. In outsourcing scenarios, such people add to only to headcount and consequently to revenue of the contracted company. But what happens in the bargain is that although the assignment or the product may give results, it may not be efficient or robust. Making something and making something good are two different things. Companies get satisfied with making something and when experts start using these products after purchsing them, such products do not stand the test or pressures of production environments and the purchaser ends up hiring somebody else to make it better or purchasing higher and higher end hardware to make it work to his satisfaction. So in all, the customer loses his investment and gets poor returns. That is why it is so important to make a good product like SQL Server 2000. I remember a few years ago when I was running a poorly written code on SQL Server 7.0. It took around 13 minutes to get executed to completion. Then I ran the same lengthy query in SQL Server 2000. There was no difference in hardware or resource allocation between the two setups. The same query ran on SQL Server 2000 in 7 seconds. Of course I later optimized it and it ran in 00 seconds. But what my point here is that without any changes to code, SQL Server 2000 gave me drastic improvement out of the box with no inputs from my end. This is just one example. I have never seen something similar or other instances of performance improvement out of the box in any other version or RDBMS. Although my clients like to go to newer versions or switch RDBMS's and I participate in upgrades and migrations for them, I still feel that they should stick to a good database like SQL Server 2000 and not to get swayed by rhetoric, bells and whistles. In my opinion, nothing is greater than performance.
I had another client who felt he had an excellent 64-bit hardware and yet the performance was not upto mark. Somebody asked him to upgrade the hardware further and the nclient was not interested. Users of this website i.e. his customers had to wait 10 to 15 seconds on an average for responses. I took their top 10 SP's and optimized them. Each one of these SP's was called thousands of times a day. After optimization, testing and promoting these 10 optimized SP's to production, the total CPU utilization came down from 312 hours to 49 hours. This was in SQL Server 2000. The beauty of this product was that it gave you ample flexibility to code with your imagination only being the limiting factor. Not the same with other DB's though.
But now a days there are so many bells and whistles to RDBMS's that real T-SQL is getting drowned out. That is a sad state of affairs. T-SQL beats CLR in speed in almost all cases (except possibly string manipulation within variables) and yet nobody realises how important T-SQL is. In a good product, all other coding techniques should be part of T-SQL and NOT T-SQL being part of other methods. I again reiterate, simplicity, efficiency, flexibility and performance are more important than the cacophony that the bells and whistles make.
One more thing I like about SQL Server 2000 was the small learning curve. You learn quickly and then you use your imagination to bring out the best possible results. I have seen in some other RDBMS's that the learning curve is so involved and long and flexibiliy so little that the learners keep learning till the next version comes out and then start learning the next version without getting ample time to apply with imagination what they learnt. This is sad and ridiculous. Products should be made keeping the interests of the customer in mind and not keeping the limitations in thoughts of the architects in mind.

In my next post, we would look at some possibly cool coding in T-SQL. Bye for now.