Who Has Busy Files? Identifying Biggest Performance Users and Bottlenecks (Part 1)
This is Part 1 of a guest blogging series by longtime DBA, blogger, and all-around SQL Server enthusiast Merrill Aldrich. Merrill is a DBA for a financial services organization in Seattle and a graduate of the Rice University School of Architecture. Merrill started working with SQL Server creating early dynamic web sites in the mid-1990s, using some long-forgotten version of JJ Allaire’s ColdFusion. It took several years of designing and visualizing buildings with early 3D rendering and collaborative BIM applications, and teaching those tools, before he found his true calling and became a full-time SQL Server DBA. His current job centers primarily on wrangling an unreasonable number of different ISV databases and SQL Server instances, and finding or creating tools, monitoring and automation to keep them all running smoothly. You can find Merrill on Twitter as @onupdatecascade.
This is the first part of a blog series that will show how to use some SQL Server out-of-the-box tools (DMOs, SSIS and SSRS) to monitor and trend which database files in your organization are the busiest. If you are a DBA with multiple databases on multiple instances, you can help your organizationand perhaps look very smartby being able to easily pinpoint which applications are using the most storage IO performance across your enterprise.
SQL Server databases are commonly IO-bound, or at least IO-demandingthat is, disk access is one of the main performance bottlenecks. If you run on SAN storage, what often happens is that you need to communicate to your SAN admin not the capacity that you need for SQL Server (GB), but more importantly the required storage performance profile, or number of IOPS. That performance profile becomes a cost driver, determining how many of what type of media devicesfrom fewer, slower spinning disks up to faster SSDsare required to support the SQL Server workload. Because it’s a cost driver, the question that often surfaces is, “Who is using all this expensive IO capacity?”
As a DBA I find I get that question a lot, so I’ve put together a small collection of monitoring tools that can report to me, each morning, the busiest databases on the SAN, no matter which server they are running on. I’d like to share it, because it was a great way to learn how to collect this information, and also to learn how to use some tools in the SQL Server stack that might be unfamiliar to a traditional systems DBA. Consider it a nice, low-risk introduction to using Integration Services and Reporting Services, with really useful content.
Here’s what we will cover:
- Sampling the IO usage on each database file, into a local Admin database, using the system DMO fn_virtualfilestats.
- Using Integration Services to loop through the SQL Server instances in the organization and move that IO counter data into a central repository.
- A procedure that will aggregate the samples into summary data for longer blocks of time.
- Some Reporting Services reports that will graphically display a prioritized summary of the IO activity from the top n databases across the organization.
Step 1: Sample and keep data from fn_virtualfilestats
This is not a new idea, but it does seem to still be a relatively advanced DBA technique. Personally, I learned this from blogs and training, mainly Wes Brown and Paul Randal. There are many versions out on the web, but I found Wes has a particularly good version over here, which I imitated.
For consistency with the later posts in this series, I’m going to start here with the exact details of the version that I am using. First, create an admin database, if you don’t have one already defined, then create a table to hold the IO counter results. The table is composed of essentially the same columns as the function output, plus a handful of extras including an “as of date” datetime column to record the sample time each time we collect data:
USE [Your Admin Database] GO SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE TABLE dbo.TrackFileIO( AsOfDate datetime NOT NULL, database_name sysname NOT NULL, physical_name nvarchar(260) NOT NULL, sample_duration_sec decimal(4, 2) NOT NULL, num_of_reads bigint NOT NULL, avg_num_of_reads_per_sec bigint NOT NULL, num_of_KB_read bigint NOT NULL, avg_num_of_KB_read_per_sec bigint NOT NULL, io_stall_read_ms bigint NOT NULL, avg_io_stall_read_ms_per_sec bigint NOT NULL, num_of_writes bigint NOT NULL, avg_num_of_writes_per_sec bigint NOT NULL, num_of_KB_written bigint NOT NULL, avg_num_of_KB_written_per_sec bigint NOT NULL, io_stall_write_ms bigint NOT NULL, avg_io_stall_write_ms_per_sec bigint NOT NULL, io_stall bigint NOT NULL, avg_io_stall_per_sec bigint NOT NULL ) ON [PRIMARY] GO CREATE UNIQUE CLUSTERED INDEX clust ON dbo.TrackFileIO ( AsOfDate ASC, physical_name ASC ) ON [PRIMARY] GO /* Extra Credit: File space usage is also handy to know, perhaps create tracking for that at the same time. I use a table like the following, populated from the sys.database_files catalog view: */ CREATE TABLE dbo.TrackFileInfo( AsOfDate datetime NULL, database_name sysname NOT NULL, Physical_Name nvarchar(260) NOT NULL, [Type] tinyint NOT NULL, Type_Desc nvarchar(60) NULL, [State] tinyint NULL, SizeMB int NULL, Max_SizeMB numeric(18, 4) NULL, Growth numeric(18, 4) NULL, NextGrowthIncMB numeric(18, 4) NULL, SpaceUsedMB numeric(18, 4) NULL, Is_Autogrow int NOT NULL, Is_Read_Only bit NOT NULL, Is_Percent_Growth bit NOT NULL ) ON [PRIMARY] GO
Next, we need a procedure to run on intervals that will collect the most recent IO activity from the function. This is only complicated in one respect: The counters are cumulative from server startup, and what we need is really the differential in those counters over some short period of time, to be able to tell if a file is busy, and how. There are a few ways to compute that difference, but this procedure uses the following method:
- Collect all the counters into a temp table.
- Wait for some sample interval.
- Collect them all again, into a second temp table.
- Use a CTE to compute the deltas between the values in the two tables, and the time span between the samples. Insert the results, with a timestamp, into the data collection table.
This data allows us to answer questions like “What was the average reads per second for file ‘x’ around 3:00?” or “On average, which files are experiencing the most IO stalls at mid-afternoon?”
USE [Your Admin Database] GO SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE PROCEDURE [dbo].[recordFileIO] ( @sampleDuration varchar(20) = '00:00:05.00' ) AS BEGIN /* Sample virtual io stats for all files on the instance and record in the TrackFileIO table */ SET NOCOUNT ON; -- Take two samples of the file stats counters in order to be able -- to compute delta per second: SELECT files.name database_name, files.physical_name, fstats.sample_ms, fstats.num_of_reads, fstats.num_of_bytes_read, fstats.io_stall_read_ms, fstats.num_of_writes, fstats.num_of_bytes_written, fstats.io_stall_write_ms, fstats.io_stall INTO #StatsSample1 FROM master.sys.Dm_io_virtual_file_stats(-1, -1) fstats INNER JOIN master.sys.master_files files ON fstats.database_id = files.database_id AND fstats.file_id = files.file_id WHERE files.state_desc='ONLINE'; WAITFOR DELAY @sampleDuration; SELECT files.name database_name, files.physical_name, fstats.sample_ms, fstats.num_of_reads, fstats.num_of_bytes_read, fstats.io_stall_read_ms, fstats.num_of_writes, fstats.num_of_bytes_written, fstats.io_stall_write_ms, fstats.io_stall INTO #StatsSample2 FROM master.sys.Dm_io_virtual_file_stats(-1, -1) fstats INNER JOIN master.sys.master_files files ON fstats.database_id = files.database_id AND fstats.file_id = files.file_id WHERE files.state_desc = 'ONLINE'; DECLARE @now datetime; -- Note: truncate time to seconds to avoid a precision issue with SSIS -- in the package that will collect this data later: SELECT @now = CONVERT( datetime, CONVERT ( char(19), GETDATE(), 121 ) , 121 ); -- Compute delta for the counters and store the result WITH Samples AS ( SELECT @now AS AsOfDate, S1.database_name, S1.physical_name, ( S2.sample_ms - S1.sample_ms ) / 1000.0 AS sample_duration_sec, S2.num_of_reads - S1.num_of_reads AS num_of_reads, ( S2.num_of_bytes_read - S1.num_of_bytes_read ) / 1024 AS num_of_KB_read, S2.io_stall_read_ms - S1.io_stall_read_ms AS io_stall_read_ms, S2.num_of_writes - S1.num_of_writes AS num_of_writes, ( S2.num_of_bytes_written - S1.num_of_bytes_written ) / 1024 AS num_of_KB_written, S2.io_stall_write_ms - S1.io_stall_write_ms AS io_stall_write_ms, S2.io_stall - S1.io_stall AS io_stall FROM #StatsSample1 S1 INNER JOIN #StatsSample2 S2 ON S1.database_name = S2.database_name AND S1.physical_name = S2.physical_name ) INSERT INTO dbo.TrackFileIO ( AsOfDate, database_name, physical_name, sample_duration_sec, num_of_reads, avg_num_of_reads_per_sec, num_of_KB_read, avg_num_of_KB_read_per_sec, io_stall_read_ms, avg_io_stall_read_ms_per_sec, num_of_writes, avg_num_of_writes_per_sec, num_of_KB_written, avg_num_of_KB_written_per_sec, io_stall_write_ms, avg_io_stall_write_ms_per_sec, io_stall, avg_io_stall_per_sec ) SELECT AsOfDate, database_name, physical_name, sample_duration_sec, num_of_reads, CAST( num_of_reads / sample_duration_sec AS bigint ) AS avg_num_of_reads_per_sec, num_of_KB_read, CAST( num_of_KB_read / sample_duration_sec AS bigint ) AS avg_num_of_KB_read_per_sec, io_stall_read_ms, CAST( io_stall_read_ms / sample_duration_sec AS bigint ) AS avg_io_stall_read_ms_per_sec, num_of_writes, CAST( num_of_writes / sample_duration_sec AS bigint ) AS avg_num_of_writes_per_sec, num_of_KB_written, CAST( num_of_KB_written / sample_duration_sec AS bigint ) AS avg_num_of_KB_written_per_sec, io_stall_write_ms, CAST( io_stall_write_ms / sample_duration_sec AS bigint ) AS avg_io_stall_write_ms_per_sec, io_stall, CAST( io_stall / sample_duration_sec AS bigint ) AS avg_io_stall_per_sec FROM Samples ; DROP TABLE #StatsSample1; DROP TABLE #StatsSample2; END GO
The stored procedure can be scheduled through SQL Agent to run as often as is practical, taking into account the growth of the data collection table and the possible impact (though small) on performance. Depending on your environment, it might be reasonable to run intervals of 5-15 minutes, being aware that the table may become quite large if left to grow without purging.
Note: For convenience, the AsOfDate values are collected as local time with the GetDate() function. If you have servers in multiple time zones and/or you don’t want a little hiccup at changes to Daylight Savings Time, you may elect to use UTC time instead. The basic structure works either way.
Once this is tested and deployed across all the SQL Server instances, we have the raw data to aggregate across the enterprise and then report against. In the next installment of this series, we’ll look at a simple method to make an SSIS package loop through all the SQL Servers and collect this counter data into a central warehouse for reporting, purging the oldest data at the same time.