Time Intervals in SQL Server: Part 2
In Part 1 of this post, Machanic discussed modeling and querying continuous intervals. In this second part, he discusses independent intervals. He explains that both gaps and overlaps can be represented.
As mentioned previously, modeling intervals as a start time/end time combination may make more sense in many cases. With both a start and end time, each subinterval has no direct dependence on any other interval or subinterval. Therefore, both gaps and overlaps can be represented. The remainder of this section details how to work with intervals modeled in that way.
Going back to the employment example, assume that a system is required for a company to track internal employment histories. Following is a sample table, simplified for this example:
CREATE TABLE EmploymentHistory ( Employee VARCHAR(50) NOT NULL, Title VARCHAR(50) NOT NULL, StartDate SMALLDATETIME NOT NULL, EndDate SMALLDATETIME NULL, CONSTRAINT CK_Start_End CHECK (StartDate < EndDate) ) GO
The main thing I’ve left out of this example is proper data integrity. Ignore the obvious need for a table of names and titles to avoid duplication of that data—that would overcomplicate the example. The holes I’m referring to deal with the employment history-specific data that the table is intended for. The primary issue is that although I did include one
CHECK constraint to make sure that the
EndDate is after the
StartDate (we hope that the office isn’t so bad that people are quitting on their first day), I failed to include a primary key.
Deciding what constitutes the primary key in this case requires a bit of thought.
Employee alone is not sufficient; then employees would not be able to get new titles during the course of their employment (or at least, it would no longer be a “history” of those changes). The next candidate might be
Title, but this also has a problem. What if an employee leaves the company for a while, and later comes to his senses and begs to be re-hired with the same title? The good thing about this table is that such a gap can be represented; but constraining on both the
Title columns will not allow that situation to happen.
StartDate into the mix seems like it would fix the problem, but in actuality it creates a whole new issue. An employee cannot be in two places (or offices) at the same time, and the combination of the three columns would allow the same employee to start on the same day with two different titles. And although it’s common in our industry to wear many different hats, that fact is generally not reflected in our job title.
As it turns out, what we really need to constrain in the primary key is an employee starting on a certain day; uniqueness of the employee’s particular title is not important in that regard. The following key can be added:
ALTER TABLE EmploymentHistory ADD PRIMARY KEY (Employee, StartDate)
This primary key takes care of an employee being in two places at once on the same day, but how about different days? Even with this constraint in place, the following two rows would be valid:
INSERT EmploymentHistory ( Employee, Title, StartDate, EndDate ) SELECT 'Jones', 'Developer', '2006-01-01', NULL UNION ALL SELECT 'Jones', 'Senior Developer', '2006-06-01', NULL
According to this data, Jones is both Developer and Senior Developer, as of June 1, 2006—quite a bit of stress for one person! The first idea for a solution might be to add a unique constraint on the
EndDate columns. In SQL Server, unique constraints allow for one
NULL-valued row—so only one
NULL EndDate would be allowed per employee. That would fix the problem with these rows, but it would still allow the following rows:
INSERT EmploymentHistory ( Employee, Title, StartDate, EndDate ) SELECT 'Jones', 'Developer', '2006-01-01', '2006-07-01' UNION ALL SELECT 'Jones', 'Senior Developer', '2006-06-01', NULL
Now, Jones was both Developer and Senior Developer for a month. Again, this is probably not what was intended.
Fixing this problem will require more than some combination of primary and unique key constraints, and a bit of background is necessary before I present the solution. Therefore, I will return to this topic in the next section, which covers overlapping intervals.
To finish off this section, a few notes on basic queries on intervals represented with both start and end points. The main benefit of this type of model over the single-date model, beside support for overlaps, is support for gaps. Ignore for a moment the lack of proper constraints, and consider the following rows (which would be valid even with the constraints):
INSERT EmploymentHistory ( Employee, Title, StartDate, EndDate ) SELECT 'Jones', 'Developer', '2004-01-05', '2004-09-01' UNION ALL SELECT 'Jones', 'Senior Developer', '2004-09-01', '2005-09-01' UNION ALL SELECT 'Jones', 'Principal Developer', '2005-09-01', '2005-10-07' UNION ALL SELECT 'Jones', 'Principal Developer', '2006-02-06', NULL
The scenario shown here is an employee named Jones, who started as a developer in January of 2004 and was promoted to Senior Developer later in the year. Jones was promoted again, to Principal Developer, in 2005, but quit a month later. However, a few months after that he decided to rejoin the company, and has not yet left or been promoted again.
The two main questions that you can pose when dealing with intervals that represent gaps are “What intervals are covered by the data?” and “What holes are present?” These types of questions are ubiquitous when working with any kind of interval data. Real-world scenarios include such requirements as tracking of service-level agreements for server uptime and managing worker shift schedules—and of course, employment history.
In this case, the questions can be phrased as “During what periods did Jones work for the firm?” and the opposite, “During which periods was Jones not working for the firm?” To answer the first question, the first requirement is to find all subinterval start dates—dates that are not connected to a previous end date. The following T-SQL accomplishes that goal:
SELECT theStart.StartDate FROM EmploymentHistory theStart WHERE theStart.Employee = 'Jones' AND NOT EXISTS ( SELECT * FROM EmploymentHistory Previous WHERE Previous.EndDate = theStart.StartDate AND theStart.Employee = Previous.Employee )
This query finds rows for Jones (remember, there could be rows for other employees in the table), then filters them down to rows where there is no end date for a Jones subinterval that matches the start date of the row. The start dates for these rows are the start dates for the intervals covered by Jones’s employment.
The next step is to find the ends of the covering intervals. The end rows can be identified similarly to the starting rows; they are rows where the end date has no corresponding start date in any other rows. To match the end rows to the start rows, find the first end row that occurs after a given start row. The following T-SQL finds start dates using the query above, and end dates using a subquery that uses the algorithm just described:
SELECT theStart.StartDate, ( SELECT MIN(EndDate) FROM EmploymentHistory theEnd WHERE theEnd.EndDate > theStart.StartDate AND theEnd.Employee = theStart.Employee AND NOT EXISTS ( SELECT * FROM EmploymentHistory After WHERE After.StartDate = theEnd.EndDate AND After.Employee = theEnd.Employee ) ) AS EndDate FROM EmploymentHistory theStart WHERE theStart.Employee = 'Jones' AND NOT EXISTS ( SELECT * FROM EmploymentHistory Previous WHERE Previous.EndDate = theStart.StartDate AND theStart.Employee = Previous.Employee )
Finding non-covered intervals (i.e., holes) is a bit simpler. First find all end dates, using the same syntax used to find end dates in the covered intervals query. These dates are the start of non-covered intervals. Make sure to filter out rows where the
NULL—these subintervals have not yet ended, so it does not make sense to include them as holes. In the subquery for the end of the hole, find the first start date (if one exists) after the beginning of the hole—this is, in fact, the end of the hole. The following T-SQL finds these non-covered intervals:
SELECT theStart.EndDate AS StartDate, ( SELECT MIN(theEnd.StartDate) FROM EmploymentHistory theEnd WHERE theEnd.StartDate > theStart.EndDate AND theEnd.Employee = theStart.Employee ) AS EndDate FROM EmploymentHistory theStart WHERE theStart.Employee = 'Jones' AND theStart.EndDate IS NOT NULL AND NOT EXISTS ( SELECT * FROM EmploymentHistory After WHERE After.StartDate = theStart.EndDate )