Sunday, February 14, 2010

Eliminating Duplicate Rows - The PARTITION BY clause

Often we come across situations where duplicate rows exist in a table. And a need arises to eliminate such duplicates/triplcates, etc. Microsoft SQL Server 2005 & of course 2008 offer a simple option of eliminating such duplicates. In SQL server 2000, a program to eliminate duplicates used to be a bit long. It is a combination of derived tables and Row_Number() Over() funtion in 2005 & 2008 that makes this elimination process so simple. Let's look into it from an example:

--create a table like then one below:

create table Emp_Details (Emp_Name varchar(10), Company varchar(15), Join_Date datetime, Resigned_Date datetime)
go

--Insert some values into this table. Ensure that you insert atleast one duplicate and one triplicate for an easier understanding of the help that the Row_Number() Over() feature offers us:

insert into Emp_Details (Emp_Name, Company, Join_Date, Resigned_Date)
values ('John', 'Software Inc', '20060101', '20061231')
,('John', 'Software Inc', '20060101', '20061231')
,('John', 'Software Inc', '20060101', '20061231')
,('John', 'SuperSoft Inc', '20070101', '20071231')
,('John', 'UltraSoft Inc', '20070201', '20080131')
,('John', 'ImproSoft Inc', '20080201', '20081231')
,('John', 'ImproSoft Inc', '20080201', '20081231')
,('Mary', 'Software Inc', '20060101', '20081231')
,('Mary', 'SuperSoft Inc', '20090101', '20090531')
,('Mary', 'SuperSoft Inc', '20090101', '20090531')
,('Mary', 'UltraSoft Inc', '20090601', '20100531')
,('Mary', 'UltraSoft Inc', '20090601', '20100531')

--Let's see the duplicates/triplicates, etc now using the old method:

select *, COUNT(*) from Emp_Details
group by Emp_Name, Company, Join_Date, Resigned_Date

Emp_Name Company Join_Date Resigned_Date Occurrences
--------- --------------- ----------------------- ----------------------- -----------
John ImproSoft Inc 2008-02-01 00:00:00.000 2008-12-31 00:00:00.000 2
John Software Inc 2006-01-01 00:00:00.000 2006-12-31 00:00:00.000 3
John SuperSoft Inc 2007-01-01 00:00:00.000 2007-12-31 00:00:00.000 1
John UltraSoft Inc 2007-02-01 00:00:00.000 2008-01-31 00:00:00.000 1
Mary Software Inc 2006-01-01 00:00:00.000 2008-12-31 00:00:00.000 1
Mary SuperSoft Inc 2009-01-01 00:00:00.000 2009-05-31 00:00:00.000 2
Mary UltraSoft Inc 2009-06-01 00:00:00.000 2010-05-31 00:00:00.000 2

--Although the table has twelve row, only seven are good ones. The remaining are duplicates or triplicates as we see from the Occurrences column above.

--A better was of seeing the duplicates & more is the query below where Row_Number() Over() has been used with teh Partition By clause:

select Emp_Name, Company, Join_Date, Resigned_Date, ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date, Resigned_Date order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details

Emp_Name Company Join_Date Resigned_Date RowNumber
---------- --------------- ----------------------- ----------------------- ----------
John ImproSoft Inc 2008-02-01 00:00:00.000 2008-12-31 00:00:00.000 1
John ImproSoft Inc 2008-02-01 00:00:00.000 2008-12-31 00:00:00.000 2
John Software Inc 2006-01-01 00:00:00.000 2006-12-31 00:00:00.000 1
John Software Inc 2006-01-01 00:00:00.000 2006-12-31 00:00:00.000 2
John Software Inc 2006-01-01 00:00:00.000 2006-12-31 00:00:00.000 3
John SuperSoft Inc 2007-01-01 00:00:00.000 2007-12-31 00:00:00.000 1
John UltraSoft Inc 2007-02-01 00:00:00.000 2008-01-31 00:00:00.000 1
Mary Software Inc 2006-01-01 00:00:00.000 2008-12-31 00:00:00.000 1
Mary SuperSoft Inc 2009-01-01 00:00:00.000 2009-05-31 00:00:00.000 1
Mary SuperSoft Inc 2009-01-01 00:00:00.000 2009-05-31 00:00:00.000 2
Mary UltraSoft Inc 2009-06-01 00:00:00.000 2010-05-31 00:00:00.000 1
Mary UltraSoft Inc 2009-06-01 00:00:00.000 2010-05-31 00:00:00.000 2

--Wherever RowNumber is greater than 1, its is a duplicate row.

--Now lets remove the duplicates/triplicates in one query in an efficient way using Row_Number() Over() using Partition By clause:

delete from a
from
(select Emp_Name, Company, Join_Date, Resigned_Date, ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date, Resigned_Date order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details) a
where a.RowNumber > 1

--Let's verify:

select *, COUNT(*) Occurrences from Emp_Details
group by Emp_Name, Company, Join_Date, Resigned_Date

Emp_Name Company Join_Date Resigned_Date Occurrences
--------- --------------- ----------------------- ----------------------- ----------
John ImproSoft Inc 2008-02-01 00:00:00.000 2008-12-31 00:00:00.000 1
John Software Inc 2006-01-01 00:00:00.000 2006-12-31 00:00:00.000 1
John SuperSoft Inc 2007-01-01 00:00:00.000 2007-12-31 00:00:00.000 1
John UltraSoft Inc 2007-02-01 00:00:00.000 2008-01-31 00:00:00.000 1
Mary Software Inc 2006-01-01 00:00:00.000 2008-12-31 00:00:00.000 1
Mary SuperSoft Inc 2009-01-01 00:00:00.000 2009-05-31 00:00:00.000 1
Mary UltraSoft Inc 2009-06-01 00:00:00.000 2010-05-31 00:00:00.000 1


The duplicates/triplcates are gone.

A slightly more involved way of eliminating them would have been using Row_Number() Over() without the Partition By clause. Please remember to reload the table, Emp_Details before proceeding further. The code below shows the extra rows that need not exist because they are either duplicates or triplicates. Note that in this query, Partition By has not been used:

select a.* from
(select Emp_Name, Company, Join_Date, Resigned_Date, ROW_NUMBER() over (order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details) a
inner join
(select Emp_Name, Company, Join_Date, Resigned_Date, ROW_NUMBER() over (order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details) b
on a.Emp_Name = b.Emp_Name
and a.Company = b.Company
and a.Join_Date = b.Join_Date
and a.Resigned_Date = b.Resigned_Date
where a.RowNumber = b.RowNumber + 1


Emp_Name Company Join_Date Resigned_Date RowNumber
---------- --------------- ----------------------- ----------------------- ----------
John ImproSoft Inc 2008-02-01 00:00:00.000 2008-12-31 00:00:00.000 2
John Software Inc 2006-01-01 00:00:00.000 2006-12-31 00:00:00.000 4
John Software Inc 2006-01-01 00:00:00.000 2006-12-31 00:00:00.000 5
Mary SuperSoft Inc 2009-01-01 00:00:00.000 2009-05-31 00:00:00.000 10
Mary UltraSoft Inc 2009-06-01 00:00:00.000 2010-05-31 00:00:00.000 12

--We can delete them as follows as well:

delete from a
from
(select Emp_Name, Company, Join_Date, Resigned_Date, ROW_NUMBER() over (order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details) a
inner join
(select Emp_Name, Company, Join_Date, Resigned_Date, ROW_NUMBER() over (order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details) b
on a.Emp_Name = b.Emp_Name
and a.Company = b.Company
and a.Join_Date = b.Join_Date
and a.Resigned_Date = b.Resigned_Date
where a.RowNumber = b.RowNumber + 1

--Verify:

select *, COUNT(*) Occurrences from Emp_Details
group by Emp_Name, Company, Join_Date, Resigned_Date

Emp_Name Company Join_Date Resigned_Date Occurrences
--------- --------------- ----------------------- ----------------------- ----------
John ImproSoft Inc 2008-02-01 00:00:00.000 2008-12-31 00:00:00.000 1
John Software Inc 2006-01-01 00:00:00.000 2006-12-31 00:00:00.000 1
John SuperSoft Inc 2007-01-01 00:00:00.000 2007-12-31 00:00:00.000 1
John UltraSoft Inc 2007-02-01 00:00:00.000 2008-01-31 00:00:00.000 1
Mary Software Inc 2006-01-01 00:00:00.000 2008-12-31 00:00:00.000 1
Mary SuperSoft Inc 2009-01-01 00:00:00.000 2009-05-31 00:00:00.000 1
Mary UltraSoft Inc 2009-06-01 00:00:00.000 2010-05-31 00:00:00.000 1

--The duplicates/triplicates are gone.

But note that difference in the two methods. Although Row_Number() Over() has been used in both the methods, Partition By has been used only in the first method.

The partition By clause was more efficient in the following way:

a.] Need to scan the Emp_Details table only once.
b.] Need to have only one derived table instead of two.
c.] No need to join two derived tables.
d.] Less CPU & Memory utilization.

Doing the same in Microsoft SQL Server 2000 and earlier would have been quite a lenghy code which I don't want to write now. However, I want to stress the contribution of the Partition By clause in keeping the code shorter , faster and less demanding on resources.