Home Uncategorized T-SQL Tuesday #001: Exploring “Fuzzy” Interval Islands Using SQLCLR

    T-SQL Tuesday #001: Exploring “Fuzzy” Interval Islands Using SQLCLR

    909
    0

    When working with time intervals, we often want to ask a couple of basic questions:

    • Which time periods are not covered by our intervals? These are known as “gaps”.
    • What are the time ranges that we are fully covering? These are known as “islands”.

    If you’re unfamiliar with “gaps” and “islands” I highly recommend reading some of Itzik Ben-Gan’s recent work in SQL Server Magazine. He’s had a great series going on the topic. But one thing that he hasn’t found a good T-SQL solution for is a problem that I call “fuzzy islands.”

    When is an island fuzzy? When it doesn’t necessarily have a fixed end time. For an example of this, consider a store, selling a number of products. Management might want to see a report showing when each product was selling. This is, effectively, an island question. The goal is to find all of the covered ranges during which sales occurred. But running such a report, you might find that way too much data is returned. A given product may have sold units on Monday, Tuesday, and Thursday, but for some reason no one bought one on Wednesday. Creating a new island every time there is a small gap will create a 300-page report where a 1-page dashboard might suffice–not a good user experience, nor a good way of representing the data. The solution? Introduce a bit of fuzziness–a rule that says, for instance, that a gap is only a gap if it’s longer than 7 days.

    Answering the fuzzy islands question is not a very difficult thing to do in T-SQL. The basic algorithm follows:

    1. Find all of the “start” dates or times. These are simply those dates or times for which a previous date or time in the fuzzy interval does not exist. So if a row is dated 2009-12-08 and our fuzzy granularity is 7 days, we know we have a start date if there is no other covered data from the end of November.
    2. For each start date identified, find the minimum date greater than the start date. This is done by looking ahead rather than behind, so if our date is 2009-12-08 and we have a granularity of 7 days, we’ll look forward until December 15th.
    3. Optionally, add the fuzzy factor to the end date. This is something that I think is a good idea, as it introduces the concept of an “active” interval–a period over which, for example, a product is considered to have been selling. The interval shouldn’t necessarily terminate the day that the last sale occurred. But of course this depends on the situation in question.

    The following query produces a fuzzy islands report for each product in the AdventureWorks (or AdventureWorks2008) Production.TransactionHistory table. You can modify the @active_interval variable to tweak the fuzziness and change the output.

    --Find all "active" product time ranges, meaning that the product
    --has sold within the previous 7 days
    DECLARE @active_interval INT = 7
    
    SELECT DISTINCT
        t_s.ProductID,
        t_s.TransactionDate AS StartDate,
        DATEADD
        (
            dd,
            @active_interval,
            (
                SELECT
                    MIN(t_e.TransactionDate)
                FROM Production.TransactionHistory AS t_e
                WHERE
                    t_e.ProductID = t_s.ProductID
                    AND t_e.TransactionDate >= t_s.TransactionDate
                    AND NOT EXISTS
                    (
                        SELECT *
                        FROM Production.TransactionHistory AS t_ae
                        WHERE
                            t_ae.ProductID = t_s.ProductID
                            AND t_ae.TransactionDate BETWEEN 
                                DATEADD(dd, 1, t_e.TransactionDate) 
                                AND DATEADD(dd, @active_interval, t_e.TransactionDate)
                    )
            )
        ) AS EndDate
    FROM 
    (
        SELECT DISTINCT
            ProductID,
            TransactionDate
        FROM Production.TransactionHistory
    ) AS t_s
    WHERE
        NOT EXISTS
        (
            SELECT *
            FROM Production.TransactionHistory AS t_ps
            WHERE
                t_ps.ProductID = t_s.ProductID
                AND t_ps.TransactionDate BETWEEN 
                    DATEADD(dd, -@active_interval, t_s.TransactionDate) 
                    AND DATEADD(dd, -1, t_s.TransactionDate)
        )
    ORDER BY
        ProductID,
        StartDate
    GO

    Running this query you’ll find that it works… But the results are returned a bit more slowly than we might desire–15 to 16 seconds on my end. Looking at the query plan, the reason for this becomes quite obvious: Lots and lots of table scans. How can we eliminate all of the overhead?

    SQLCLR to the rescue. The best way to solve this problem–at least until the SQL Server team adds proper OVER clause support (LAG and LEAD, specifically)–is to use a cursor algorithm. We could do this in a T-SQL cursor, but why bother? Cursor logic in SQLCLR is much, much faster.

    To solve the problem, I implemented an enumerator, called active_products_enumerator. The enumerator is initialized using a SqlDataReader and an “active interval” — the number of days we’re allowing for fuzziness. The DataReader is expected to return rows ordered by ProductID and TransactionDate. The enumeration process uses the following algorithm:

    1. If the current ProductID is not the same as the previous ProductID, return an end date for the previous interval and start a new one
    2. If the current period date is greater than the previous period date plus the active interval, return an end date for the previous interval and start a new one
    3. Otherwise, continue

    Following is the MoveNext method for the enumerator (the complete code is attached to this post so that you can run it on your end without my bombarding you with a gigantic code-filled post):

    
    public bool MoveNext()
    {
        try
        {
            current_results = null;
    
            while (r.Read())
            {
                new_ProductID = r.GetSqlInt32(0);
                new_period_date = r.GetDateTime(1);
    
                if (new_ProductID != ProductID)
                {
                    if (ProductID != 0)
                    {
                        current_results = new results(
                            ProductID,
                            StartDate,
                            period_plus_interval);
                    }
    
                    ProductID = new_ProductID;
                    StartDate = new_period_date;
                    period_plus_interval = new_period_date.AddDays(activeInterval);
    
                    if (current_results != null)
                        return (true);
                }
                else
                {
                    if (period_plus_interval < new_period_date)
                    {
                        current_results = new results(
                            ProductID,
                            StartDate,
                            period_plus_interval);
    
                        StartDate = new_period_date;
                    }
    
                    period_plus_interval = new_period_date.AddDays(activeInterval);
    
                    if (current_results != null)
                        return (true);
                }
            }
    
            //return the last row of data
            if (ProductID != 0)
            {
                current_results = new results(
                    ProductID,
                    StartDate,
                    period_plus_interval);
    
                //set this to 0 so we don't return another row
                ProductID = 0;
            }
    
            if (current_results != null)
                return (true);
            else
            {
                r.Dispose();
                return (false);
            }
        }
        catch
        {
            r.Dispose();
            throw;
        }
    } 

    Put into a table-valued function, as the attached code does, this algorithm will return the same data as the T-SQL query in under a third of a second on my end–around 45 times faster than the T-SQL version.

    Note that I’ve played some games with a loopback connection to get this whole thing to work. That’s a topic for a future blog post, so stay tuned. In the meantime, please realize that you’ll have to catalog the assembly with EXTERNAL_ACCESS permission to make this happen.

    This post was created for T-SQL Tuesday, the revolving SQL Server blog party, hosted this month by… me. Enjoy!

    File Attachment: AdamMachanic_TSQLTuesday_ActiveProducts.zip

    Previous articleInvitation to Participate in T-SQL Tuesday #001: Date/Time Tricks
    Next articleT-SQL Tuesday #001 (Date/Time Tricks): The Roundup
    Adam Machanic helps companies get the most out of their SQL Server databases. He creates solid architectural foundations for high performance databases and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has contributed to numerous books on SQL Server development. A long-time Microsoft MVP for SQL Server, he speaks and trains at IT conferences across North America and Europe.