Tuesday, December 11, 2007

MERGE statement in SQL Server 2008

Since the last few days, I have been spending some time on studying new features in the Katmai betas. One of the better coding introductions I like in Microsoft SQL Server 2008 is the MERGE statement. I like the way it provides option to UPDATE, DELETE and INSERT data simultaneously in a single T-SQL statement.

Using it for simple operations is a costly affair when compared to equivalent T-SQL code in 2000/2005. However using it for complex operations is simple and efficient compared to equivalent T-SQL code in 2000/2005. This is the paradox. I find most prolific coders prefer to write compact code even if is is very complex rather than using lengthy but simple code. The reason most often is that the shorter the code, the faster it performs in most cases. Here is where the paradox associated with MERGE statement comes in handy.

Let me create a simple situation first:

--Create a small table.
create table abc
(productid int identity (1,1)
,Quantity decimal(4,0) default 100
,Order_Date datetime default getdate())

--Populate the tables, abc.

insert into abc default values
insert into abc (Quantity) values (125)
insert into abc (Quantity) values (150)
insert into abc (Quantity) values (175)
insert into abc (Quantity) values (200)
insert into abc (Quantity) values (225)
insert into abc (Quantity) values (250)
insert into abc (Quantity) values (275)
insert into abc (Quantity) values (300)
insert into abc (Quantity) values (325)
insert into abc (Quantity) values (350)
insert into abc (Quantity) values (375)

--Check the contents of the table, abc.
select * from abc


--Add a nullable column to the table, abc. This one column is to be populated (updated) later.
alter table abc
add OrderID int null


--Now the MERGE construct updates the OrderID column of the table abc as follows.
--Divide the ProductID by 2. If remainder comes out to be zero, then update
--OrderID with 2. If remainder comes out to a non-zero value, then update
--OrderID as ProductID minus the Quotient.
--The code is given below:

merge abc as master_table
using (select ProductId, ProductId / @l_OrderID_value, ProductId % @l_OrderID_value
from abc) as input (ProductId, Quotient, Remainder)
ON master_Table.ProductId = Input.ProductID
When Matched
THEN UPDATE set master_table.OrderID = case
when input.Remainder = 0 then @l_OrderID_value
else master_Table.ProductId - Input.Quotient
end;

It was very structured but long. This was a simple operation and didn't need such length of code.

--Now let's backup the data in the tables abc.

select * into def from abc

Now, let's write the equivalent T-SQL code as a replacement of the MERGE statement above.

declare @l_OrderID_value int
set @l_OrderID_value = 2

update abc
set OrderID = case
when ProductID%@l_OrderID_value = 0 then @l_OrderID_value
else ProductId - (ProductID/@l_OrderID_value)
end;

This was short, simple and just one UPDATE statement was required to do the entire operation.

If you look for the query cost relative to batch when you run both the statements together and get actual execution plans, the MERGE costs 76% while the single UPDATE costs 24%. The MERGE statement had a SELECT from the table abc, then it uses a derived table, and then it does the UPDATE. In the second statement above, there is a direct UPDATE. So the second statement is expected to be faster. So in this example, MERGE was not really useful. The above example was a simple case, so we won't bother ourselves much with MERGE in simple scenarios.

Now, let's take up something slightly more involved where MERGE can be really useful.

--Create a new table.
create table abc_Summary
(OrderID int
,Order_Date datetime
,Quantity int)

--It is empty.
select * from abc_Summary


MERGE abc_summary as TARGET_Table
using (select OrderID, convert(varchar, Order_Date, 112), sum(Quantity) from abc
group by OrderID, convert(varchar, Order_Date, 112)) as Data (OrderID, Order_Date, Total_Quantity)
on TARGET_Table.OrderID = Data.OrderID
When NOT MATCHED THEN INSERT (OrderID, Order_Date, Quantity)
VALUES (Data.OrderID, Data.Order_Date, Data.Total_Quantity)
WHEN MATCHED THEN
UPDATE SET TARGET_Table.Quantity = Data.Total_Quantity;

--Verify that the empty table has gotten populated with aggregated data from the main table, abc.
select * from abc_Summary

--So this time the INSERT took place. Now lets UPDATE the column abc_Summary.Quantity to --zero. And then we will run the above code again.

update abc_summary
set Quantity = 0
GO

--Verify that Quantity is 0.
select * from abc_Summary
GO

--Run the same code again:
MERGE abc_summary as TARGET_Table
using (select OrderID, convert(varchar, Order_Date, 112), sum(Quantity) from abc
group by OrderID, convert(varchar, Order_Date, 112)) as Data (OrderID, Order_Date, Total_Quantity)
on TARGET_Table.OrderID = Data.OrderID
When NOT MATCHED THEN INSERT (OrderID, Order_Date, Quantity)
VALUES (Data.OrderID, Data.Order_Date, Data.Total_Quantity)
WHEN MATCHED THEN
UPDATE SET TARGET_Table.Quantity = Data.Total_Quantity;

--Verify that the Quantity column in the abc_Summarfy table is updated now with actual summed quantities per OrderID from the parent table, abc.

select * from abc_Summary


Lets do something different now, a bit more involved. UPDATE the table abc_Summary such that OrderID's 1,2 and 3 are removed and Quantity is set to zero for the remaining three rows. Also add one dirty row to the abc_summary table. Note that the table abc_summary should contain only those OrderID's that already exist in the parent table, abc. Since the newly inserted OrderID in the dirty row does not exist in the parent table, abc, it should be deleted from abc_summary. So in all, what we are trying to do here is following:
INSERT missing rows,
UPDATE Quantity for existing three rows, and
DELETE one dirty row.

All this is to be accomplished in one query.
Let's prepare the data now.

--Preparing data.
update abc_summary
set Quantity = 0
GO
delete from abc_summary
where OrderID in (1, 2, 3)
GO
insert into abc_summary (OrderID, Order_Date, Quantity)
values (2001, getdate(), 50000)
go

--Check the data. OrderID's 4,5 and 6 have their Quantity zero'ed out. OrderID's 1,2 and 3 no ---longer exist. A new OrderID 2001 has come in but it doesn't exist in parent table, abc.

select * from abc_Summary

--Now let's get the correct data into the table abc_summary in one single T-SQL query using the MERGE statement which issues INSERT, UPDATE and DELETE simultaneously.

MERGE abc_summary as TARGET_Table
using (select OrderID, convert(varchar, Order_Date, 112), sum(Quantity) from abc
group by OrderID, convert(varchar, Order_Date, 112)) as Data (OrderID, Order_Date, Total_Quantity)
on TARGET_Table.OrderID = Data.OrderID
When NOT MATCHED THEN INSERT (OrderID, Order_Date, Quantity)
VALUES (Data.OrderID, Data.Order_Date, Data.Total_Quantity)
WHEN MATCHED THEN
UPDATE SET TARGET_Table.Quantity = Data.Total_Quantity
WHEN SOURCE NOT MATCHED THEN
DELETE;

--Check the data. OrderID's 4,5 and 6 have their Quantity UPDATEd correctly. --Rows for OrderID's 1,2 and 3 have been INSERTed now. The dirty row for OrderID 2001 has been DELETed as it didn't exist in parent table, abc.

select * from abc_Summary

So this time the INSERT took place and brought in the three mmissing ORDERID's (1,2,3) from
the table abc into the table abc_Summary. Then it UPDATEd the abc_Summary.Quantity
column for ORDERID's 3,4,5 from zero to the correct aggregated quantity from the
parent table, abc. Lastly, it also DELETEd the erroneous row with ORDERID 2001 because this
ORDERID did not exist in the parent table, abc. All these three operations were
performed in one single query.

All this can be done without the MERGE statement as well. But it would be really very lengthy and expensive code. It would contain several IF-ELSE constructs, IF [NOT] EXISTS clauses, multiple instances of usage of the tables abc and abc_Summary, apart from the INSERT, UPDATE, DELETE statements. We don't want to do that, do we? The cost of that query would be much more than the MERGE statement above. Let's verify the same. Let me try to build as short a code as possible to replace the MERGE statement above. Please see the replacement code below:


--Prepare the data first.
update abc_summary
set Quantity = 0
GO
delete from abc_summary
where OrderID in (1, 2, 3)
GO
insert into abc_summary (OrderID, Order_Date, Quantity)
values (2001, getdate(), 50000)
go

--Check the data.
select * from abc_Summary

--Now let's get the correct data into the table abc_summary.
--UPDATE Quantity to the correct aggregated value.
update abc_summary
set abc_summary.Quantity = (select sum(abc.Quantity) from abc
where abc_Summary.OrderID = abc.OrderID)
where exists (select OrderID from abc
where abc_Summary.OrderID = abc.OrderID)

--Check the data. Quantity has been updated correctly.
select * from abc_Summary

--INSERT missing ORDERID's 1,2 and 3.
insert into abc_summary (OrderID, Order_Date, Quantity)
select OrderID, convert(varchar, Order_Date, 112), sum(Quantity) from abc
where OrderID not in (select OrderID from abc_Summary)
group by OrderID, convert(varchar, Order_Date, 112)

--Check the data. Missing ORDERID's have been re-inserted correctly with the right Quantity.
select * from abc_Summary

--DELETE dirty row.
delete from abc_Summary
where OrderID not in (select OrderID from abc)

--Check the data. The row with ORDERID 2001 has been deleted as required.
select * from abc_Summary

So multiple instances of joins between the parent table abc and the child table abc_Summary had to be used.

The execution plan showed 40% cost for the MERGE statement and 60% cost for the replacement statemenets above combined. Here we are talking about just va few rows of data. The benefits rise with increase in the number of rows. You may want to try it out.

So in all, MERGE statement in Microsoft SQL Server 2008 is really good. It reduces code, does multiple operations in a single statement and is less expensive or is more efficient when we need to perform a bit complex operations. Please note that the syntax for INSERT, UPDATE and DELETE constructs within the MERGE statement are slightly different from their original counterparts. The MERGE statement opens up a lot of possibilities. Imagine how interesting it would be to use within an AFTER TRIGGER to perform UPDATE, DELETE and INSERT operations on another table in one shot through the trigger. Although it is possible to do it without the MERGE statement as well, but it gets too lengthy and tedious as I had to implement at one of my clients for certain reasons. MERGE can make it compact, simple and efficient.

In my next article, we will solve some puzzles.