Friday, June 25, 2010

Database File Usage & Growth

In this article, I would focus on finding empty claimable spaces within databases. Once found, the space can be claimed back using various methods which we would discuss in a later article.

The following script provides the details on a database, its data and log files, the respective filegroups which they logically belong to, the current fize size, the actual space used in the file and the growth factor for the file. I have used this code with the system stored procedure, sp_msforeachdb so as to gather this data for all databases on the server. It may be noted that the temporary table that is being used here, #temp_Data_Sizes is compressed. The next SELECT statement then displays those database files that have 50% or more free space in them. The TSQL code here is for Microsoft SQL Server 2008 and later.

if OBJECT_ID ('tempdb..#temp_Data_Sizes') is not null
drop table #temp_Data_Sizes
go

create table #temp_Data_Sizes
(DatabaseName varchar(255) not null
,Fully_Qualified_File_Name varchar(512) not null
,File_Type varchar(45) not null
,FileGroup varchar(25) not null
,Current_File_Size_GB varchar(25) not null
,Used_File_Space_GB varchar(25) not null
,Growth_Factor varchar(25) not null
,Constraint PK Primary Key clustered (DatabaseName, Fully_Qualified_File_Name) with (fillfactor = 50)
)
with (data_compression = Page)
go

set nocount on;
insert into #temp_Data_Sizes (DatabaseName, Fully_Qualified_File_Name, File_Type, FileGroup, Current_File_Size_GB, Used_File_Space_GB, Growth_Factor)
exec sp_msforeachdb ' use [?]
select DB_NAME(database_id) DatabaseName
,physical_name Fully_Qualified_File_Name
,case [type]
when 0 then ''Data''
when 1 then ''Log''
end +
case fileproperty (name, ''IsLogFile'')
when 0 then case fileproperty (name, ''IsPrimaryFile'')
when 0 then '' - Secondary Data File''
when 1 then '' - Primary Data File''
end
when 1 then ''''
end File_Type
,isnull(Filegroup_name (data_space_id), '''') FileGroup
,convert(decimal(28,2), (size * 8192.0) / (1024.0 * 1024.0 * 1024.0)) Current_File_Size_GB
,convert(decimal(28,2), (fileproperty (name, ''spaceused'') * 8192.0) / (1024.0 * 1024.0 * 1024.0)) Used_File_Space_GB
,case
when is_percent_growth = 0 then convert(varchar(20), convert(decimal(28, 4), (growth * 8192.0) / (1024.0 * 1024.0 * 1024.0))) + '' GB''
when is_percent_growth = 1 then CONVERT(varchar(3), growth) + ''%''
end Growth_Factor
from sys.master_files
where database_id = DB_ID ()'

select * from #temp_Data_Sizes
where 100.00 * ((CONVERT(Decimal(28, 4), Current_File_Size_GB) - CONVERT(Decimal(28, 4), Used_File_Space_GB)) / CONVERT(Decimal(28, 4), Current_File_Size_GB)) >= 50.00
and CONVERT(Decimal(28, 4), Current_File_Size_GB) > 0.0000
order by CONVERT(Decimal(28, 4), Current_File_Size_GB) desc

set nocount off;

if OBJECT_ID ('tempdb..#temp_Data_Sizes') is not null
drop table #temp_Data_Sizes
go

Sunday, February 14, 2010

Eliminating Duplicate Rows - The PARTITION BY clause

Often we come across situations where duplicate rows exist in a table. And a need arises to eliminate such duplicates/triplcates, etc. Microsoft SQL Server 2005 & of course 2008 offer a simple option of eliminating such duplicates. In SQL server 2000, a program to eliminate duplicates used to be a bit long. It is a combination of derived tables and Row_Number() Over() funtion in 2005 & 2008 that makes this elimination process so simple. Let's look into it from an example:

--create a table like then one below:

create table Emp_Details (Emp_Name varchar(10), Company varchar(15), Join_Date datetime, Resigned_Date datetime)
go

--Insert some values into this table. Ensure that you insert atleast one duplicate and one triplicate for an easier understanding of the help that the Row_Number() Over() feature offers us:

insert into Emp_Details (Emp_Name, Company, Join_Date, Resigned_Date)
values ('John', 'Software Inc', '20060101', '20061231')
,('John', 'Software Inc', '20060101', '20061231')
,('John', 'Software Inc', '20060101', '20061231')
,('John', 'SuperSoft Inc', '20070101', '20071231')
,('John', 'UltraSoft Inc', '20070201', '20080131')
,('John', 'ImproSoft Inc', '20080201', '20081231')
,('John', 'ImproSoft Inc', '20080201', '20081231')
,('Mary', 'Software Inc', '20060101', '20081231')
,('Mary', 'SuperSoft Inc', '20090101', '20090531')
,('Mary', 'SuperSoft Inc', '20090101', '20090531')
,('Mary', 'UltraSoft Inc', '20090601', '20100531')
,('Mary', 'UltraSoft Inc', '20090601', '20100531')

--Let's see the duplicates/triplicates, etc now using the old method:

select *, COUNT(*) from Emp_Details
group by Emp_Name, Company, Join_Date, Resigned_Date

Emp_Name Company Join_Date Resigned_Date Occurrences
--------- --------------- ----------------------- ----------------------- -----------
John ImproSoft Inc 2008-02-01 00:00:00.000 2008-12-31 00:00:00.000 2
John Software Inc 2006-01-01 00:00:00.000 2006-12-31 00:00:00.000 3
John SuperSoft Inc 2007-01-01 00:00:00.000 2007-12-31 00:00:00.000 1
John UltraSoft Inc 2007-02-01 00:00:00.000 2008-01-31 00:00:00.000 1
Mary Software Inc 2006-01-01 00:00:00.000 2008-12-31 00:00:00.000 1
Mary SuperSoft Inc 2009-01-01 00:00:00.000 2009-05-31 00:00:00.000 2
Mary UltraSoft Inc 2009-06-01 00:00:00.000 2010-05-31 00:00:00.000 2

--Although the table has twelve row, only seven are good ones. The remaining are duplicates or triplicates as we see from the Occurrences column above.

--A better was of seeing the duplicates & more is the query below where Row_Number() Over() has been used with teh Partition By clause:

select Emp_Name, Company, Join_Date, Resigned_Date, ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date, Resigned_Date order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details

Emp_Name Company Join_Date Resigned_Date RowNumber
---------- --------------- ----------------------- ----------------------- ----------
John ImproSoft Inc 2008-02-01 00:00:00.000 2008-12-31 00:00:00.000 1
John ImproSoft Inc 2008-02-01 00:00:00.000 2008-12-31 00:00:00.000 2
John Software Inc 2006-01-01 00:00:00.000 2006-12-31 00:00:00.000 1
John Software Inc 2006-01-01 00:00:00.000 2006-12-31 00:00:00.000 2
John Software Inc 2006-01-01 00:00:00.000 2006-12-31 00:00:00.000 3
John SuperSoft Inc 2007-01-01 00:00:00.000 2007-12-31 00:00:00.000 1
John UltraSoft Inc 2007-02-01 00:00:00.000 2008-01-31 00:00:00.000 1
Mary Software Inc 2006-01-01 00:00:00.000 2008-12-31 00:00:00.000 1
Mary SuperSoft Inc 2009-01-01 00:00:00.000 2009-05-31 00:00:00.000 1
Mary SuperSoft Inc 2009-01-01 00:00:00.000 2009-05-31 00:00:00.000 2
Mary UltraSoft Inc 2009-06-01 00:00:00.000 2010-05-31 00:00:00.000 1
Mary UltraSoft Inc 2009-06-01 00:00:00.000 2010-05-31 00:00:00.000 2

--Wherever RowNumber is greater than 1, its is a duplicate row.

--Now lets remove the duplicates/triplicates in one query in an efficient way using Row_Number() Over() using Partition By clause:

delete from a
from
(select Emp_Name, Company, Join_Date, Resigned_Date, ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date, Resigned_Date order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details) a
where a.RowNumber > 1

--Let's verify:

select *, COUNT(*) Occurrences from Emp_Details
group by Emp_Name, Company, Join_Date, Resigned_Date

Emp_Name Company Join_Date Resigned_Date Occurrences
--------- --------------- ----------------------- ----------------------- ----------
John ImproSoft Inc 2008-02-01 00:00:00.000 2008-12-31 00:00:00.000 1
John Software Inc 2006-01-01 00:00:00.000 2006-12-31 00:00:00.000 1
John SuperSoft Inc 2007-01-01 00:00:00.000 2007-12-31 00:00:00.000 1
John UltraSoft Inc 2007-02-01 00:00:00.000 2008-01-31 00:00:00.000 1
Mary Software Inc 2006-01-01 00:00:00.000 2008-12-31 00:00:00.000 1
Mary SuperSoft Inc 2009-01-01 00:00:00.000 2009-05-31 00:00:00.000 1
Mary UltraSoft Inc 2009-06-01 00:00:00.000 2010-05-31 00:00:00.000 1


The duplicates/triplcates are gone.

A slightly more involved way of eliminating them would have been using Row_Number() Over() without the Partition By clause. Please remember to reload the table, Emp_Details before proceeding further. The code below shows the extra rows that need not exist because they are either duplicates or triplicates. Note that in this query, Partition By has not been used:

select a.* from
(select Emp_Name, Company, Join_Date, Resigned_Date, ROW_NUMBER() over (order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details) a
inner join
(select Emp_Name, Company, Join_Date, Resigned_Date, ROW_NUMBER() over (order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details) b
on a.Emp_Name = b.Emp_Name
and a.Company = b.Company
and a.Join_Date = b.Join_Date
and a.Resigned_Date = b.Resigned_Date
where a.RowNumber = b.RowNumber + 1


Emp_Name Company Join_Date Resigned_Date RowNumber
---------- --------------- ----------------------- ----------------------- ----------
John ImproSoft Inc 2008-02-01 00:00:00.000 2008-12-31 00:00:00.000 2
John Software Inc 2006-01-01 00:00:00.000 2006-12-31 00:00:00.000 4
John Software Inc 2006-01-01 00:00:00.000 2006-12-31 00:00:00.000 5
Mary SuperSoft Inc 2009-01-01 00:00:00.000 2009-05-31 00:00:00.000 10
Mary UltraSoft Inc 2009-06-01 00:00:00.000 2010-05-31 00:00:00.000 12

--We can delete them as follows as well:

delete from a
from
(select Emp_Name, Company, Join_Date, Resigned_Date, ROW_NUMBER() over (order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details) a
inner join
(select Emp_Name, Company, Join_Date, Resigned_Date, ROW_NUMBER() over (order by Emp_Name, Company, Join_Date, Resigned_Date) RowNumber from Emp_Details) b
on a.Emp_Name = b.Emp_Name
and a.Company = b.Company
and a.Join_Date = b.Join_Date
and a.Resigned_Date = b.Resigned_Date
where a.RowNumber = b.RowNumber + 1

--Verify:

select *, COUNT(*) Occurrences from Emp_Details
group by Emp_Name, Company, Join_Date, Resigned_Date

Emp_Name Company Join_Date Resigned_Date Occurrences
--------- --------------- ----------------------- ----------------------- ----------
John ImproSoft Inc 2008-02-01 00:00:00.000 2008-12-31 00:00:00.000 1
John Software Inc 2006-01-01 00:00:00.000 2006-12-31 00:00:00.000 1
John SuperSoft Inc 2007-01-01 00:00:00.000 2007-12-31 00:00:00.000 1
John UltraSoft Inc 2007-02-01 00:00:00.000 2008-01-31 00:00:00.000 1
Mary Software Inc 2006-01-01 00:00:00.000 2008-12-31 00:00:00.000 1
Mary SuperSoft Inc 2009-01-01 00:00:00.000 2009-05-31 00:00:00.000 1
Mary UltraSoft Inc 2009-06-01 00:00:00.000 2010-05-31 00:00:00.000 1

--The duplicates/triplicates are gone.

But note that difference in the two methods. Although Row_Number() Over() has been used in both the methods, Partition By has been used only in the first method.

The partition By clause was more efficient in the following way:

a.] Need to scan the Emp_Details table only once.
b.] Need to have only one derived table instead of two.
c.] No need to join two derived tables.
d.] Less CPU & Memory utilization.

Doing the same in Microsoft SQL Server 2000 and earlier would have been quite a lenghy code which I don't want to write now. However, I want to stress the contribution of the Partition By clause in keeping the code shorter , faster and less demanding on resources.