Managing Bitemporal Data
In this post, Data Education founder and trainer Adam Machanic discusses how to manage bitemporal data. Here, he explains that because data isn’t always straightforward, it’s important to update it very carefully.
A central truth that we need to embrace to be successful as database developers is that not all data is as great as it could be (or as great as we might wish it to be). Sometimes we’re forced to work with incomplete or incorrect data and correct things later as a more complete picture of reality becomes available.
Modifying data in the database is simple enough in and of itself: a call to a DML statement and the work is done. But in systems that require advanced logging and reproducibility of reports between runs for auditing purposes, a straightforward update, insert, or delete may be counter-productive. Making such a data modification can destroy the possibility of re-creating the same output on consecutive runs of the same query.
To get around doing a simple update in the case of invalid data, some systems use the idea of offset transactions. An offset transaction uses the additive nature of summarization logic to fix the data in place. For example, assume that part of a financial reporting system has a table that describes customer transactions. The following is a highly simplified representation of what such a table might look like:
CREATE TABLE Transactions ( TransactionId INT, Customer VARCHAR(50), TransactionDate DATETIME, TransactionType VARCHAR(50), TransactionAmount DECIMAL(9,2) )
On June 12, 2005, customer Smith deposits $500. But a teller makes a key error that is not caught in time. By the time the reporting data is loaded, the amount that makes it into the system is $5000:
INSERT Transactions VALUES (1001, 'Smith', '2005-06-12', 'DEPOSIT', 5000.00)
The next morning, the erroneous data is detected. Although the transaction row itself could be updated in-place, this would destroy the audit trail, so an offset transaction must be issued.
There are a few ways of handling this scenario. The first method is to issue an offset transaction dated the same as the incorrect transaction:
INSERT Transactions VALUES (1001, 'Smith', '2005-06-12', 'OFFSET', -4500.00)
Back-dating the offset fixes the problem in summary reports that group any dimension (transaction number, customer, date, or transaction type), but fails to keep track of the fact that the error was actually caught on June 13. Properly dating the offset record is imperative for data auditing purposes:
INSERT Transactions VALUES (1001, 'Smith', '2005-06-13', 'OFFSET', -4500.00)
Unfortunately, proper dating does not fix all of the issues—and introduces new ones. After properly dating the offset, a query of the data for customer Smith for all business done through June 12 does not include the correction. Only by including data from June 13 would the query return the correct data. And although a correlated query could be written to return the correct summary report for June 12, the data is in a somewhat strange state when querying for ranges after June 12, e.g., June 13-15. The offset record is orphaned if June 12 is not considered in a given query along with June 13.
To get around these and similar issues, a bitemporal model is necessary. In a bitemporal table, each transaction has two dates: the actual date that the transaction took place, and a “valid” date, which represents the date that we know the updated data to be correct. The following, modified version of the Transactions table, shows the new column:
CREATE TABLE Transactions ( TransactionId INT, Customer VARCHAR(50), TransactionDate DATETIME, TransactionType VARCHAR(50), TransactionAmount DECIMAL(9,2), ValidDate DATETIME )
When inserting the data for Smith on June 12, a valid date of June 12 is also applied:
INSERT Transactions VALUES (1001, 'Smith', '2005-06-12', 'DEPOSIT', 5000.00, '2005-06-12')
Effectively, this row can be read as “as of June 12, we believe that transaction 1001, dated June 12, was a deposit for $5000.” On June 13, when the error is caught, no offset record is inserted. Instead a corrected deposit record is inserted, with a new valid date:
INSERT Transactions VALUES (1001, 'Smith', '2005-06-12', 'DEPOSIT', 500.00, '2005-06-13')
This row indicates that as of June 13, transaction 1001 has been modified. But the important difference is that the transaction still maintains its correct date—so running a report for transactions that occurred on June 13 would show no rows for June 12. In addition, this model eliminates the need for offset transactions. Rather than use an offset, queries should always find the last update for any given transaction, within the valid range.
To understand this a bit more, consider a report run on August 5, looking at all transactions that occurred on June 12. The person running the report wants the most “correct” version of the data; that is, all available corrections should be applied. This is done by taking the transaction data for each transaction from the row with the maximum valid date:
SELECT T1.TransactionId, T1.Customer, T1.TransactionType, T1.TransactionAmount FROM Transactions AS T1 WHERE T1.TransactionDate = '2005-06-12' AND T1.ValidDate = ( SELECT MAX(ValidDate) FROM Transactions AS T2 WHERE T2.TransactionId = T1.TransactionId )
By modifying the subquery, it is possible to get “snapshot” reports based on data before updates were applied. For instance, assume that this same report was run on the evening of June 12. The output for Smith would show a deposit of $5000 for transaction 1001. To reproduce that report on August 5 (or any day after June 12), change the
SELECT T1.TransactionId, T1.Customer, T1.TransactionType, T1.TransactionAmount FROM Transactions AS T1 WHERE T1.TransactionDate = '2005-06-12' AND T1.ValidDate = ( SELECT MAX(ValidDate) FROM Transactions AS T2 WHERE T2.TransactionId = T1.TransactionId AND ValidDate <= '2005-06-12' )
Note that in this case, the subquery could have been eliminated altogether, and the search argument could have become
AND T1.ValidDate = '2005-06-12'. However, the subquery is needed any time you’re querying a range of dates, so it’s a good idea to leave it in place for ease of maintenance of the query.
Using this same pattern, data can also be booked in the future, before it is actually valid. It’s common when doing wire transfers, credit card payments, and other kinds of electronic funds transactions to be able to set the “posting date” on which the business will actually be executed. By working with the valid date, Smith can make a request for an outgoing transfer on June 14, but ask that the transfer actually take place on June 16:
INSERT Transactions VALUES (1002, 'Smith', '2005-06-16', 'TRANSFER', -1000.00, '2005-06-14')
Since the transaction date is June 16, a report dealing with transactions that occurred between June 1 and June 15 will not show the transfer. But a business manager can query on June 15 to find out which transactions will hit in the coming days or weeks, and audit when the data entered the system.
Modeling data bitemporally allows for an auditable, accurate representation of historical or future knowledge as data is updated. This can be tremendously useful in many scenarios, especially in the realm of financial reconciliation, when you can be forced to deal with back-dated contracts that change the terms of previous transactions and business booked in the future to avoid certain types of auditing issues.