Time Intervals in SQL Server: Part 1
Given the commonality of the requirement in virtually every possible application to store temporal data, it’s surprising how involved a discussion of techniques for dealing with the data can become. Depending on the business requirements, systems may need to be able to deal with users in disparate time zones, logically deal with time intervals, or work within the possibility of late-arriving updates to data that changes the facts about what happened days or weeks prior.
T-SQL includes very limited facilities to help developers deal with temporal data, but with proper design and a bit of forethought, it should not be a problem. This and following posts will discuss various temporal data design issues, and show both how to set up your database to store various types of temporal data, and how to effectively query the data once it’s been stored.
Very few real-world events happen in a single moment. Time is continuous, and any given state change has a clearly defined start time and end time. For example, you might say, “I drove from Stockbridge to Boston at 10:00 o’clock.” But you really didn’t drive only at 10:00 o’clock, unless you happen to be in possession of some futuristic time/space-folding technology (and that’s clearly beyond the scope of this post).
When working with databases, we often consider only the start or end time of an event, rather than the full interval. A column called OrderDate is an almost ubiquitous feature in databases that handle orders; but this column only stores the date/time that the order ended—when the user submitted the final request. It does not reflect how long the user spent browsing the site, filling the shopping cart, and entering credit card information. Likewise, every time we check our e-mail we see a “Sent Date” field, which captures the moment that the sender hit the send button, but does not help identify how long that person spent thinking about or typing the e-mail, activities that constitute part of the “sending” process.
The fact is, we don’t often see this extended data because it’s generally unnecessary. For most sites, it really doesn’t matter for the purpose of order fulfillment how long the user spent browsing (although that information may be useful to UI designers and other user experience people). And it doesn’t really matter, once an e-mail is sent, how much effort went into sending it. The important thing is, it was sent (and later received, another data point that many e-mail clients don’t expose).
Despite these examples to the contrary, for many applications both start and end times are necessary for a complete analysis. Take, for instance, your employment history. As you move from job to job, you carry intervals during which you had a certain title, were paid a certain amount, or had certain job responsibilities. Failing to include both the start and end dates with this data can create some interesting challenges.
Modeling and Querying Continuous Intervals
If a table uses only a starting time or an ending time (but not both) to represent intervals, all of the rows in that table can be considered to belong to one continuous interval that spans the entire time period represented. Each row in this case would represent a subinterval during which some status change occurred. Let’s take a look at some simple examples to clarify this. Start with the following table and rows:
CREATE TABLE JobHistory ( Company VARCHAR(100), Title VARCHAR(100), Pay DECIMAL(9, 2), StartDate SMALLDATETIME ) GO INSERT JobHistory ( Company, Title, Pay, StartDate ) SELECT 'Acme Corp', 'Programmer', 50000.00, '1995-06-26' UNION SELECT 'Software Shop', 'Programmer/Analyst', 62000.00, '1998-10-05' UNION SELECT 'Better Place', 'Junior DBA', 82000.00, '2001-01-08' UNION SELECT 'Enterprise', 'Database Developer', 95000.00, '2005-11-14' GO
A note on this table: I am not going to get into the question of primary keys and how to constrain this data just yet, as the examples that follow will clarify that issue. However, I would like to take this opportunity to point out a way that you can make life somewhat easier when working with this and similar temporal data. In this case, you’ll notice that each of the dates uses a default time component. No one—except the worst micromanager—cares, looking at a job history record, if someone got in to work at 8:00 a.m. or 8:30 a.m. on the first day. What matters is that the date in the table is the start date. Encoding this rule can make querying the data a bit simpler—no need for the range queries discussed earlier in this post. The following check constraint can be added to make times other than midnight illegal:
ALTER TABLE JobHistory ADD CONSTRAINT CK_StartDate_Midnight CHECK (StartDate = DATEADD(day, DATEDIFF(day, 0, StartDate), 0))
Note: In order to make the code in this post compatible with both SQL Server 2005 and 2008 I’ve ignored the latter version’s DATE data type. Using that type will solve the time problem without the need for an additional constraint.
The data in the JobHistory table is easy enough to transform into a more logical format; to get the full subintervals we can assume that the end date of each job is the start date of the next. The end date of the final job, it can be assumed, is the present date (or, if you prefer, NULL). Converting this into a start/end report based on these rules requires T-SQL along the following lines:
SELECT J1.*, ISNULL ( ( SELECT MIN(J2.StartDate) FROM JobHistory AS J2 WHERE J2.StartDate > J1.StartDate ), GETDATE() ) AS EndDate FROM JobHistory AS J1
The outer query gets the job data and the start times, and the subquery finds the first start date after the current row’s start date. If no such start date exists, the current date is used. Of course, an obvious major problem here is lack of support for gaps in the job history. This table may, for instance, hide the fact that the subject was laid off from Software Shop in July, 2000. This is why I stressed the continuous nature of data modeled in this way.
Despite the lack of support for gaps, let’s try a bit more data and see what happens. As the subject’s career progressed, he received various title and pay changes during the periods of employment with these different companies:
INSERT JobHistory ( Company, Title, Pay, StartDate ) SELECT 'Acme Corp', 'Programmer', 55000.00, '1996-09-01' UNION SELECT 'Acme Corp', 'Programmer 2', 58000.00, '1997-09-01' UNION SELECT 'Acme Corp', 'Programmer 3', 58000.00, '1998-09-01' UNION SELECT 'Software Shop', 'Programmer/Analyst', 62000.00, '1998-10-05' UNION SELECT 'Software Shop', 'Programmer/Analyst', 67000.00, '2000-01-01' UNION SELECT 'Software Shop', 'Programmer', 40000.00, '2000-03-01' UNION SELECT 'Better Place', 'Junior DBA', 84000.00, '2002-06-01' UNION SELECT 'Better Place', 'DBA', 87000.00, '2004-06-01'
This data follows the subject along a path of relative job growth. A few raises and title adjustments—including one title adjustment with no associated pay raise—and an unfortunate demotion along with a downsized salary, just before getting laid off in 2000 (the gap which, as mentioned, is not able to be represented here). Luckily, after studying databases while laid off, the subject bounced back with a much better salary and, of course, a more satisfying career track!
Ignoring the gap, let’s see how one might answer a resume-style question using this data. As a modification to the previous query, show the start and end date of tenure with each company, along with the maximum salary earned at the company, and what title was held when the highest salary was being earned.
The first step commonly taken in tackling this kind of challenge is to use a correlated subquery to find the rows that have the maximum value per group. In this case, that means the maximum pay per company:
SELECT Pay, Title FROM JobHistory AS J2 WHERE J2.Pay = ( SELECT MAX(Pay) FROM JobHistory AS J3 WHERE J3.Company = J2.Company )
One key modification that must be made is to the basic query that finds start and end dates. Due to the fact that there are now multiple rows per job, the
MIN aggregate will have to be employed to find the real start date, and the end date subquery modified to look not only at date changes, but also company changes. The following T-SQL finds the correct start and end dates for each company:
SELECT J1.Company, MIN(J1.StartDate) AS StartDate, ISNULL ( ( SELECT MIN(J2.StartDate) FROM JobHistory AS J2 WHERE J2.Company <> J1.Company AND J2.StartDate > MIN(J1.StartDate) ), GETDATE() ) AS EndDate FROM JobHistory AS J1 GROUP BY J1.Company ORDER BY StartDate
A quick note: this query would not work properly if the person had been hired back by the same company after a period of absence during which he was working for another firm. To solve that problem, you might use a query similar to the following, in which a check is done to ensure that the “previous” row (based on
StartDate) does not have the same company name (meaning that the subject switched firms):
SELECT J1.Company, J1.StartDate AS StartDate, ISNULL ( ( SELECT MIN(J2.StartDate) FROM JobHistory AS J2 WHERE J2.Company <> J1.Company AND J2.StartDate > J1.StartDate ), GETDATE() ) AS EndDate FROM JobHistory AS J1 WHERE J1.Company <> ISNULL ( ( SELECT TOP(1) J3.Company FROM JobHistory J3 WHERE J3.StartDate < J1.StartDate ORDER BY J3.StartDate DESC ), '' ) GROUP BY J1.Company, J1.StartDate ORDER BY J1.StartDate
This example complicates things a bit too much for the sake of this post, but I feel that it is important to point this technique out in case readers find it necessary to write these kinds of queries in production applications. This pattern is useful in many scenarios, especially when logging the status of an automated system and trying to determine downtime statistics or other metrics.
Getting back to the primary task at hand, showing the employment history along with peak salaries and job titles, the next step is to merge the query that finds the correct start and end dates with the query that finds the maximum salary and associated title. The simplest way of accomplishing this is with the
CROSS APPLY operator, which behaves similarly to a correlated subquery but returns a table rather than a scalar value. The following T-SQL shows how to accomplish this:
SELECT x.Company, x.StartDate, x.EndDate, p.Pay, p.Title FROM ( SELECT J1.Company, MIN(J1.StartDate) AS StartDate, ISNULL ( ( SELECT MIN(J2.StartDate) FROM JobHistory AS J2 WHERE J2.Company <> J1.Company AND J2.StartDate > MIN(J1.StartDate) ), GETDATE() ) AS EndDate FROM JobHistory AS J1 GROUP BY J1.Company ) x CROSS APPLY ( SELECT Pay, Title FROM JobHistory AS J2 WHERE J2.StartDate >= x.StartDate AND J2.StartDate < x.EndDate AND J2.Pay = ( SELECT MAX(Pay) FROM JobHistory AS J3 WHERE J3.Company = J2.Company ) ) p ORDER BY x.StartDate
This T-SQL correlates the
CROSS APPLY subquery using the
EndDate columns from the outer query in order to find the correct employment intervals that go along with each position. The
StartDate/EndDate pair for each period of employment is a half-open interval (or semiopen, depending on which mathematics textbook you’re referring to); the
StartDate end of the interval is closed (inclusive of the end point), and the
EndDate is open (exclusive). This is because the
EndDate for one interval is actually the
StartDate for the next interval, and these intervals do not overlap.
Although the query does work, it has an issue; the
CROSS APPLY subquery will return more than one row if a title change was made at the maximum pay level, without an associated pay increase (as happens in this data set), thereby producing duplicate rows in the result. The solution is to select the appropriate row by ordering the result by the
Pay column, descending. The modified subquery, which will return only one row per position, is:
SELECT TOP(1) Pay, Title FROM JobHistory AS J2 WHERE J2.StartDate >= x.StartDate AND J2.StartDate < x.EndDate ORDER BY Pay DESC
The important things that I hope you can take away from these examples are the patterns used for manipulating the intervals, as well as the fact that modeling intervals in this way may not be sufficient for many cases.
In terms of query style, the main thing to notice is that in order to logically manipulate this data, some form of an “end” for the interval must be synthesized within the query. Any time you’re faced with a table that maps changes to an entity over time, but uses only a single date/time column to record the temporal component, try to think of how to transform the data so that you can work with the start and end of the interval. This will make querying much more straightforward.
From a modeling perspective, this setup is clearly deficient. I’ve already mentioned the issue with gaps in the sequence, which are impossible to represent in this table. Another problem is overlapping intervals. What if the subject took on some after-hours contract work during the same time period as one of the jobs? Trying to insert that data into the table would make it look as though the subject had switched companies.
This is not to say that no intervals should be modeled this way. There are many situations in which gaps and overlaps may not make sense, and the extra bytes needed for a second column will be a waste. A prime example is a server uptime monitor. Systems are often used by IT departments that ping each monitored server on a regular basis and record changes to their status. Following is a simplified example table and a few rows:
CREATE TABLE ServerStatus ( ServerName VARCHAR(50), Status VARCHAR(15), StatusTime DATETIME ) GO INSERT ServerStatus ( ServerName, Status, StatusTime ) SELECT 'WebServer', 'Available', '2005-04-20 03:00:00.000' UNION SELECT 'DBServer', 'Available', '2005-04-20 03:00:00.000' UNION SELECT 'DBServer', 'Unavailable', '2005-06-12 14:35:23.100' UNION SELECT 'DBServer', 'Available', '2005-06-12 14:38:52.343' UNION SELECT 'WebServer', 'Unavailable', '2005-06-15 09:16:03.593' UNION SELECT 'WebServer', 'Available', '2005-06-15 09:28:17.006' GO
Applying almost the exact same query as was used for start and end of employment periods, we can find out the intervals during which each server was unavailable:
SELECT S1.ServerName, S1.StatusTime, ISNULL ( ( SELECT MIN(S2.StatusTime) FROM ServerStatus AS S2 WHERE S2.StatusTime > S1.StatusTime ), GETDATE() ) AS EndTime FROM ServerStatus AS S1 WHERE S1.Status = 'Unavailable'
Some systems will send periodic status updates if the system status does not change. The monitoring system might insert additional “unavailable” rows every thirty seconds or minute until the target system starts responding again. As-is, this query would report each interim status update as a separate interval starting point. To get around this problem, the query could be modified as follows:
SELECT S1.ServerName, MIN(S1.StatusTime) AS StartTime, p.EndTime FROM ServerStatus AS S1 CROSS APPLY ( SELECT ISNULL ( ( SELECT MIN(S2.StatusTime) FROM ServerStatus AS S2 WHERE S2.StatusTime > S1.StatusTime AND S2.Status = 'Available' ), GETDATE() ) AS EndTime ) p WHERE S1.Status = 'Unavailable' GROUP BY S1.ServerName, p.EndTime
This new version finds the first “available” row that occurs after the current “unavailable” row; that row represents the actual end time for the full interval during which the server was down. The outer query uses the
MIN aggregate to find the first reported “unavailable” time for each
In the second part of this post, I’ll dive into modeling and querying independent intervals.