Time Intervals in SQL Server: Part 2

In Part 1 of this post, Machanic discussed modeling and querying continuous intervals. In this second part, he discusses independent intervals. He explains that both gaps and overlaps can be represented.

As mentioned previously, modeling intervals as a start time/end time combination may make more sense in many cases. With both a start and end time, each subinterval has no direct dependence on any other interval or subinterval. Therefore, both gaps and overlaps can be represented. The remainder of this section details how to work with intervals modeled in that way.

Going back to the employment example, assume that a system is required for a company to track internal employment histories. Following is a sample table, simplified for this example:

CREATE TABLE EmploymentHistory
(
    Employee VARCHAR(50) NOT NULL,
    Title VARCHAR(50) NOT NULL,
    StartDate SMALLDATETIME NOT NULL,
    EndDate SMALLDATETIME NULL,
    CONSTRAINT CK_Start_End CHECK (StartDate < EndDate)
)
GO

The main thing I’ve left out of this example is proper data integrity. Ignore the obvious need for a table of names and titles to avoid duplication of that data—that would overcomplicate the example. The holes I’m referring to deal with the employment history-specific data that the table is intended for. The primary issue is that although I did include one CHECK constraint to make sure that the EndDate is after the StartDate (we hope that the office isn’t so bad that people are quitting on their first day), I failed to include a primary key.

Deciding what constitutes the primary key in this case requires a bit of thought. Employee alone is not sufficient; then employees would not be able to get new titles during the course of their employment (or at least, it would no longer be a “history” of those changes). The next candidate might be Employee and Title, but this also has a problem. What if an employee leaves the company for a while, and later comes to his senses and begs to be re-hired with the same title? The good thing about this table is that such a gap can be represented; but constraining on both the Employee and Title columns will not allow that situation to happen.

Adding StartDate into the mix seems like it would fix the problem, but in actuality it creates a whole new issue. An employee cannot be in two places (or offices) at the same time, and the combination of the three columns would allow the same employee to start on the same day with two different titles. And although it’s common in our industry to wear many different hats, that fact is generally not reflected in our job title.

As it turns out, what we really need to constrain in the primary key is an employee starting on a certain day; uniqueness of the employee’s particular title is not important in that regard. The following key can be added:

ALTER TABLE EmploymentHistory
ADD PRIMARY KEY (Employee, StartDate)

This primary key takes care of an employee being in two places at once on the same day, but how about different days? Even with this constraint in place, the following two rows would be valid:

INSERT EmploymentHistory
(
    Employee,
    Title,
    StartDate,
    EndDate
)
SELECT 'Jones', 'Developer', '2006-01-01', NULL
UNION ALL
SELECT 'Jones', 'Senior Developer', '2006-06-01', NULL

According to this data, Jones is both Developer and Senior Developer, as of June 1, 2006—quite a bit of stress for one person! The first idea for a solution might be to add a unique constraint on the Employee and EndDate columns. In SQL Server, unique constraints allow for one NULL-valued row—so only one NULL EndDate would be allowed per employee. That would fix the problem with these rows, but it would still allow the following rows:

INSERT EmploymentHistory
(
    Employee,
    Title,
    StartDate,
    EndDate
)
SELECT 'Jones', 'Developer', '2006-01-01', '2006-07-01'
UNION ALL
SELECT 'Jones', 'Senior Developer', '2006-06-01', NULL

Now, Jones was both Developer and Senior Developer for a month. Again, this is probably not what was intended.

Fixing this problem will require more than some combination of primary and unique key constraints, and a bit of background is necessary before I present the solution. Therefore, I will return to this topic in the next section, which covers overlapping intervals.

To finish off this section, a few notes on basic queries on intervals represented with both start and end points. The main benefit of this type of model over the single-date model, beside support for overlaps, is support for gaps. Ignore for a moment the lack of proper constraints, and consider the following rows (which would be valid even with the constraints):

INSERT EmploymentHistory
(
    Employee,
    Title,
    StartDate,
    EndDate
)
SELECT 'Jones', 'Developer', '2004-01-05', '2004-09-01'
UNION ALL
SELECT 'Jones', 'Senior Developer', '2004-09-01', '2005-09-01'
UNION ALL
SELECT 'Jones', 'Principal Developer', '2005-09-01', '2005-10-07'
UNION ALL
SELECT 'Jones', 'Principal Developer', '2006-02-06', NULL

The scenario shown here is an employee named Jones, who started as a developer in January of 2004 and was promoted to Senior Developer later in the year. Jones was promoted again, to Principal Developer, in 2005, but quit a month later. However, a few months after that he decided to rejoin the company, and has not yet left or been promoted again.

The two main questions that you can pose when dealing with intervals that represent gaps are “What intervals are covered by the data?” and “What holes are present?” These types of questions are ubiquitous when working with any kind of interval data. Real-world scenarios include such requirements as tracking of service-level agreements for server uptime and managing worker shift schedules—and of course, employment history.

In this case, the questions can be phrased as “During what periods did Jones work for the firm?” and the opposite, “During which periods was Jones not working for the firm?” To answer the first question, the first requirement is to find all subinterval start dates—dates that are not connected to a previous end date. The following T-SQL accomplishes that goal:

SELECT
    theStart.StartDate
FROM EmploymentHistory theStart
WHERE
    theStart.Employee = 'Jones'
    AND NOT EXISTS
    (
        SELECT *
        FROM EmploymentHistory Previous
        WHERE
            Previous.EndDate = theStart.StartDate
            AND theStart.Employee = Previous.Employee
    )

This query finds rows for Jones (remember, there could be rows for other employees in the table), then filters them down to rows where there is no end date for a Jones subinterval that matches the start date of the row. The start dates for these rows are the start dates for the intervals covered by Jones’s employment.

The next step is to find the ends of the covering intervals. The end rows can be identified similarly to the starting rows; they are rows where the end date has no corresponding start date in any other rows. To match the end rows to the start rows, find the first end row that occurs after a given start row. The following T-SQL finds start dates using the query above, and end dates using a subquery that uses the algorithm just described:

SELECT
    theStart.StartDate,
    (
        SELECT
            MIN(EndDate)
        FROM EmploymentHistory theEnd
        WHERE
            theEnd.EndDate > theStart.StartDate
            AND theEnd.Employee = theStart.Employee
            AND NOT EXISTS
            (
                SELECT *
                FROM EmploymentHistory After
                WHERE
                    After.StartDate = theEnd.EndDate
                    AND After.Employee = theEnd.Employee
            )
    ) AS EndDate
FROM EmploymentHistory theStart
WHERE
    theStart.Employee = 'Jones'
    AND NOT EXISTS
    (
        SELECT *
        FROM EmploymentHistory Previous
        WHERE
            Previous.EndDate = theStart.StartDate
            AND theStart.Employee = Previous.Employee
    )

Finding non-covered intervals (i.e., holes) is a bit simpler. First find all end dates, using the same syntax used to find end dates in the covered intervals query. These dates are the start of non-covered intervals. Make sure to filter out rows where the EndDate is NULL—these subintervals have not yet ended, so it does not make sense to include them as holes. In the subquery for the end of the hole, find the first start date (if one exists) after the beginning of the hole—this is, in fact, the end of the hole. The following T-SQL finds these non-covered intervals:

SELECT
    theStart.EndDate AS StartDate,
    (
        SELECT MIN(theEnd.StartDate)
        FROM EmploymentHistory theEnd
        WHERE
            theEnd.StartDate > theStart.EndDate
            AND theEnd.Employee = theStart.Employee
    ) AS EndDate
FROM EmploymentHistory theStart
WHERE
    theStart.Employee = 'Jones'
    AND theStart.EndDate IS NOT NULL
    AND NOT EXISTS
    (
        SELECT *
        FROM EmploymentHistory After
        WHERE After.StartDate = theStart.EndDate
    )