Thursday, August 25, 2011

Table_Sizes

Introduction:

Designing databases is a common activity for most of us. Assigning proper datatypes to attributes within entities is a part of that. Once promoted to production when thousands of users pound these tables, the DBA usually keeps track of the growth trends of databases and tables within them. For a DBA, maintaining these statistics (apart from numerous others) is important to see how the application is doing, how the
user-response is varying and how the hardware is being utilized.

When an application slows down and the database response is poorer, there could be several contributing factors like bad database layout, sub-optimal database design, ill-designed application (no SP's/adhoc TSQL), ill-written stored procedures, improper allocation of CPU/memory, and a few more and several
combinations of them. A common problem that I as a DBA came across is inapropriate dataypes for fields during design phase but detected a few months after implementation. This is so rampant especially when a DBA is not part of the design or is hired after the fact. I would like to share one such case with you today.

Building up a scenario:

The first step for me is to get a list of a few of the largest tables (in terms of row counts) in a database. Then I need to get the size of these individual tables (on disc) as well. Having done that, I can figure out the largest tables in terms of rowcount and size on disc.

I would also like to see how many rows (on an average) reside on each data page allocated to the table. Let us call it "row density" for our discussion.

To get this data, I used to use different methods in Microsoft SQL Server 2000 and 2005. However in SQL Server 2008, I use the undocumented system function, sys.fn_PhysLocFormatter and the system stored procedure sp_MSForEachTable. I use them for a quick view of the database tables for my own benefit rather than to send a report to the CTO as these objects are unsupported.

 

select sys.fn_PhysLocFormatter (%%physloc%%) Rows_per_Page from Orders

The results look like the following:

 

Rows_per_Page
-------------------------
(1:90536:0)
(1:90536:1)
(1:90536:2)

Each row in the result set represents one row in Orders table. So there would be as many rows in this result set as there are rows in the table, Orders.

Let's describe the three portions in the result above: 1:90536:0. First portion, 1 is the id of the data file on which the page (90536) containing this row (first row = Row 0) exists. You may observe that rows numbered 1 and 2 in the result set above also reside on the same page, 90536. Please note the first row
contained in a page is the 0th row in the result set above as the counting starts from 0 onwards instead of from 1. In effect, there are 3 rows in the page 90536 for this table. So our row_density is 3.0.

Most of us are very familiar with the system stored procedure, sp_msforeachtable. It performs the same set of prescribed activities on all tables in a given database. As it has been discussed so often on the forums, I would skip its description here.

Now let's customize the usage of sys.fn_PhysLocFormatter and sp_msforeachtable for each table so that we can obtain more user-friendly results for analysis and action:

 

/*******************************
NAME : Table Size script.
Author : Suresh K Maganti
Purpose: Get Pages, pages per row, Table size for each table. Find large tables with low row-density.
********************************/
if OBJECT_ID('tempdb..#TEMP') is not null
drop table #TEMP

create table #TEMP (Table_Name varchar(100) null
,Pages int null
,Rows_Per_Page decimal(28,4) null
,Rows_in_Table as convert(decimal(28, 0), Pages * Rows_Per_Page)
,Table_Size_MB as convert(decimal(28, 0), (convert(decimal(28,2), Pages) * 8192.00) / (1024.00 * 1024.00)))


insert into #TEMP (Table_Name, Pages, Rows_Per_Page)
exec sp_msforeachtable 'select ''?'', count (Page_Number) Numer_of_Pages_in_Table, isnull(Avg (isnull(convert(decimal(28, 4), Rows_Per_Page), 0.0000)), 0.0000) Avg_Rows_Per_Page
from
(select substring(SUBSTRING(File_Page_Row, CHARINDEX('':'', File_Page_Row, 1) + 1, 100), 1, charindex('':'', SUBSTRING(File_Page_Row, CHARINDEX('':'', File_Page_Row, 1) + 1, 100), 1)-1) Page_Number, COUNT(1) Rows_Per_Page
from
(select sys.fn_PhysLocFormatter (%%physloc%%) File_Page_Row from ?) a
group by substring(SUBSTRING(File_Page_Row, CHARINDEX('':'', File_Page_Row, 1) + 1, 100), 1, charindex('':'', SUBSTRING(File_Page_Row, CHARINDEX('':'', File_Page_Row, 1) + 1, 100), 1)-1)) b'



select * from #TEMP
order by 2 desc, 3 asc

if OBJECT_ID('tempdb..#TEMP') is not null
drop table #TEMP



Here is the result set:

Table_Name Pages Rows_Per_Page Rows_in_Table Table_Size_MB
----------------------------------- --------------- ------------------------ ----------------------- ----------------------
[dbo].[Order_Details] 100021 35.0380 3511193 783
[dbo].[Order_Master] 3345 83.1629 278180 26
[dbo].[Customer_Details] 2376 526.0118 1249804 19
[dbo].[Customer_Master] 358 404.1006 144668 3
...

 

I have shown the partial result set above for our explanation.

From past designs and analyses, I generally mark tables with under 100 rows/page as problematic. I always try to find ways to bring this figure to 100 rows/page or more in consultaion with the application development teams. Sometimes it requires further normalization as well. In our example, Order_Details
table is our largest (~3.5 million rows) occupying ~783 MB of disc space. However, the row density is just around ~35. If we could increase the row density somehow, we could auotmatically bring down the table size on disc.

Let's see the table structure now:

Sp_help [Order_Details]

Column_name Type Computed Length Prec Scale Nullable

--------------------- ------------ ---------------- -------------- ---------- ----------- -----------------
Order_ID int no 4 10 0 no
Product_ID int no 4 10 0 no
Specifications nvarchar no 400 yes
Status varchar no 255 yes

 

Observations and verifications:

The column, [Specifications] has the datatype nvarchar(400). In our case, we do not store unicode values in this column as per the application team.
There is a composite clustered index on the columns (Order_ID, Product_ID) with fillfactor = 100.
There is no compression on this table or its index.
This is a pretty normalized table with just 4 columns.
This table is not partitioned.
Recommendations:

If we could change the dataype for the column, [Specifications] to varchar(400), we could reduce the storage space.
If we could compress the lone index and the table data, we could reduce the table size further.
 

Evaluation of recommendations:
Let us estimate our gains through page compression:

sp_estimate_data_compression_savings @schema_name = 'dbo', @object_name = 'Order_Details', @index_id = 1, @partition_number = 1, @data_compression = 'Page'

Here is the result:
object_name schema_name index_id partition_number size_with_current_compression_setting(KB) size_with_requested_compression_setting(KB)
Order_Details dbo 1 1 983688 758784

 

That's almost a gain of 23% space. Let us estimate our gains through row compression:

sp_estimate_data_compression_savings @schema_name = 'dbo', @object_name = 'Order_Details', @index_id = 1, @partition_number = 1, @data_compression = 'Page'

Here is the result:
object_name schema_name index_id partition_number size_with_current_compression_setting(KB) size_with_requested_compression_setting(KB)
Order_Details dbo 1 1 983688 969192

With row compression, our gain is around 1.5% only. So we could definitely benefit from Page compression in this case.

 

Implementation:
So here we go with the implementation:

--Change the datatype from nvarchar(400) to varchar(400)
alter table [dbo].[Order_Details] alter column [Specifications] varchar(400) null

--Compress the clustered index using PAGE compression:
alter index all on [dbo].[Order_Details] rebuild with (fillfactor = 95, Data_Compression = PAGE, maxdop = 1)

--Compress the table using PAGE compression (Not necessary if the above step is done):
alter table [dbo].[Order_Details] rebuild Partition = All with (Data_Compression = PAGE, maxdop = 1)



Testing:

Now let's run our script again to see the row density and table sizes:

/*******************************
NAME : Table Size script.
Author : Suresh K Maganti
Purpose: Get Pages, pages per row, Table size for each table. Find large tables with low row-density.
********************************/
if OBJECT_ID('tempdb..#TEMP') is not null
drop table #TEMP

create table #TEMP (Table_Name varchar(100) null
,Pages int null
,Rows_Per_Page decimal(28,4) null
,Rows_in_Table as convert(decimal(28, 0), Pages * Rows_Per_Page)
,Table_Size_MB as convert(decimal(28, 0), (convert(decimal(28,2), Pages) * 8192.00) / (1024.00 * 1024.00)))


insert into #TEMP (Table_Name, Pages, Rows_Per_Page)
exec sp_msforeachtable 'select ''?'', count (Page_Number) Numer_of_Pages_in_Table, isnull(Avg (isnull(convert(decimal(28, 4), Rows_Per_Page), 0.0000)), 0.0000) Avg_Rows_Per_Page
from
(select substring(SUBSTRING(File_Page_Row, CHARINDEX('':'', File_Page_Row, 1) + 1, 100), 1, charindex('':'', SUBSTRING(File_Page_Row, CHARINDEX('':'', File_Page_Row, 1) + 1, 100), 1)-1) Page_Number, COUNT(1) Rows_Per_Page
from
(select sys.fn_PhysLocFormatter (%%physloc%%) File_Page_Row from ?) a
group by substring(SUBSTRING(File_Page_Row, CHARINDEX('':'', File_Page_Row, 1) + 1, 100), 1, charindex('':'', SUBSTRING(File_Page_Row, CHARINDEX('':'',

File_Page_Row, 1) + 1, 100), 1)-1)) b'



select * from #TEMP
order by 2 desc, 3 asc

if OBJECT_ID('tempdb..#TEMP') is not null
drop table #TEMP



Here is the result set:

Table_Name Pages Rows_Per_Page Rows_in_Table Table_Size_MB
----------------------------------- ------------- ------------------------ ----------------------- ---------------
[dbo].[Order_Details] 51103 68.7081 3511190 399
[dbo].[Order_Master] 3345 83.1629 278180 26
[dbo].[Customer_Details] 2376 526.0118 1249804 19
[dbo].[Customer_Master] 358 404.1006 144668 3
...

Please note that for the table, [dbo].[Order_Details] the row density(rows per page) has gone up from 35 to 68 now. At the same time, the table size itself has come down from 783MB to 399MB, a saving of almost 50% disc space. But that is not all. Let's try to compare the statistics before and after our changes to the Order_Details table by simply getting its rowcount using the following code:

set statistics io on
set statistics time on
set statistics profile on

select COUNT(1) from dbo.Order_Details with (nolock)

set statistics io off
set statistics time off
set statistics profile off

 

Before changes:
Table 'Order_Details'. Scan count 3, logical reads 79825, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 500 ms, elapsed time = 247 ms.

After changes:

Table 'Order_Details'. Scan count 3, logical reads 51193, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 483 ms, elapsed time = 241 ms.

 

You may note that the logical reads have come down by almost 36%. CPU time has reduced only marginally because there is a storage overhead due to compression. However reduction in logical reads enhances performance when this same table is joined with other tables.

Conclusion:

Required Design changes can be ascertained by our script.
Implemenation of optimal datatypes and appropriate compression level could reduce the disc space usage by around 50% and logical reads by around 36%. (Please note that these results could vary by tables depending on the datatypes and characteristics of the columns within the tables.)
Different studies show an increase in CPU utilization by 1% to 2% due to compression (not in our specific case though). So please do test and verify your improvements before implementation.

No comments: