It would be fair to say that there aren’t a huge number of programmability features added to SQL Server 2008 that will have a great impact on most of our day-to-day lives as SQL Server developers. The release was clearly manageability themed rather than programmability themed, and I can count the most interesting and far-reaching developer enhancements on a few fingers: the DATE data type is a great time-saver, TVPs are fun (but not nearly the game changer they could have been), and Change Tracking will certainly help with caching scenarios.
MERGE is a feature that initially made this list for me. I was really looking forward to using it and thought it would make life easier for many ETL tasks requiring “upsert” functionality. My first impression of the feature was that people wouldn’t love the syntax, and even now that I’m used to it I still think it leaves a bit to be desired. My second impression came when working on a large data load–hundreds of millions of rows–and I discovered that compared with the temp table approach I’d been using in the SQL Server 2005 version of the procedure, the MERGE performance was dismal. My first attempt showed a 20x decrease in performance (that’s twenty times slower, not percent). While spending a day and a half messing with the thing and experimenting with various combinations I saw as much as a 100x decrease in performance at one point, but after losing much blood and sweat, and having shed many tears, I finally managed to tune it to the point where it was only twice as slow as the previous methodology. And that’s when MERGE left my list of interesting features.
Open-minded kind of guy that I am, I didn’t completely dismiss MERGE, and it’s a good thing; it provides us with at least two features, independent of its ability to do “upsert”, that not only put it back on my list of interesting features, but right at the top–as the number one most important developer enhancement in SQL Server 2008.
The first feature is the ability to do joined updates, without encountering the potential data quality issues that the UPDATE FROM syntax can cause. Hugo Kornelis has already done a great job of describing that ability, so I won’t discuss it here. But please, read Hugo’s post before writing another UPDATE FROM. You’ll be happy you did.
The second enhancement is equally important, but I haven’t seen coverage of it in any of the blogs or other sources I read (aside from Ward Pond mentioning it in passing). The MERGE statement’s OUTPUT clause can do a special trick that other OUTPUT clauses cannot: it has the ability to output data not only from the “inserted” and “deleted” virtual tables, but also from either the source or destination table. This means that MERGE, even without its “upsert” capabilities, is surprisingly more powerful than the INSERT, UPDATE, or DELETE statements and can effect better logging, auditing, and more precise ETL processes. A true win-win.
But before we jump into MERGE’s capabilities in this area, a bit of background is in order for those who might not be well-versed in all of the T-SQL enhancements that have been added in the past two versions of SQL Server.
Background #1: The OUTPUT Clause
Added in SQL Server 2005, the OUTPUT clause gives DML constructs–INSERT, UPDATE, DELETE, and now MERGE–the ability to redirect the results of the DML either back to the caller as a rowset or to a table. This is accomplished by leveraging the same idea that we have in triggers: “inserted” and “deleted” virtual tables. When inserting data you have access to a virtual table called “inserted”, when deleting you have access to “deleted”, and when updating you have access to both tables. This is great for getting back IDENTITY values, among other things:
DECLARE @x TABLE ( x INT IDENTITY(1,1) NOT NULL, y INT NOT NULL ); INSERT @x ( y ) OUTPUT inserted.x, inserted.y SELECT 2 AS y UNION ALL SELECT 1; UPDATE @x SET y = y + 1 OUTPUT deleted.y AS old_y, inserted.y AS new_y; DELETE @x OUTPUT deleted.x WHERE y = 3;
In addition to simply getting back values, we can, as mentioned, also redirect them to a table:
DECLARE @x TABLE ( x INT IDENTITY(1,1) NOT NULL, y INT NOT NULL ); DECLARE @y TABLE ( x INT NOT NULL, y INT NOT NULL ); INSERT @x ( y ) OUTPUT inserted.x, inserted.y INTO @y SELECT 2 AS y UNION ALL SELECT 1; SELECT * FROM @y;
As an aside, what I would really love to be able to do is reroute the output into one or more scalar variables when I know that only a single row will be affected by the operation. If you agree, please vote on the Connect item I’ve filed asking for that capability.
I suspect that if you are a SQL Server developer and can not readily find–or have not already found–uses for the OUTPUT clause then you’re not getting challenged much in your day to day work, and it’s time to go look for a new job. On the projects I’ve done in the past few years I have found it to be one of the top few enhancements; it has really made all sorts of tasks much easier to do in SQL Server 2005 and 2008 than they were in SQL Server 2000. And that brings us to our next stop, the MERGE statement…
Background #2: Basic MERGE Support for the OUTPUT Clause
Prior to the release of SQL Server 2008, Itzik Ben-Gan wrote a massive white paper covering the various T-SQL enhancements we could expect with the new version. Included therein is a short section about MERGE, including some information on the OUTPUT clause. Itzik’s coverage is centered around the fact that MERGE can do something that could never be done before with a SQL Server DML statement: cause rows in a table to be inserted, updated, and deleted atomically, in a single statement.
To make life easier for those who want to write generic code, a special function was introduced that can be used in the OUTPUT clause with MERGE. This function, $action, outputs one of the following values to let you know which action affected tho row: INSERT, UPDATE, or DELETE. You can, of course, get this same functionality by looking at columns from both the inserted and deleted virtual tables and writing a CASE experssion, but it’s much easier to simply use a built-in function. The following example shows how the $action function works in practice:
DECLARE @x TABLE ( x INT IDENTITY(1,1) NOT NULL, y INT NOT NULL ); INSERT @x ( y ) SELECT 2 AS y UNION ALL SELECT 1; MERGE INTO @x AS tgt USING ( SELECT 2 AS y UNION ALL SELECT 3 ) AS src ON src.y = tgt.y WHEN MATCHED THEN UPDATE SET tgt.y = tgt.y + 1 WHEN NOT MATCHED THEN INSERT ( y ) VALUES ( src.y ) WHEN NOT MATCHED BY SOURCE THEN DELETE OUTPUT $action AS dml_action, inserted.x AS new_x, deleted.x AS old_x, inserted.y AS new_y, deleted.y AS old_y;
I should also mention here that SQL Server 2008 includes better support for the OUTPUT clause via a feature called “composable DML.” This is not a MERGE enhancement–it works with any DML statement–but it’s tangentially related and certainly of interest to people who might be reading this post.
The composable DML feature allows you to do something very strange indeed: treat DML statements that use the OUTPUT clause as derived tables, and use the result of the OUTPUT in an outer INSERT-SELECT. This is helpful when you need a bit more precision; for example, perhaps you want to populate a history table, but you’re only interested in archiving the data that’s been updated or deleted–the newly inserted data is already in the live table. To filter the results down using the composable DML feature you would do something similar to what is shown in the following example:
DECLARE @x TABLE ( x INT IDENTITY(1,1) NOT NULL, y INT NOT NULL ); INSERT @x ( y ) SELECT 2 AS y UNION ALL SELECT 1; DECLARE @history TABLE ( x INT, y INT NOT NULL, dml_action VARCHAR(6) NOT NULL, action_date_time DATETIME2(7) NOT NULL ) INSERT @history ( x, y, dml_action, action_date_time ) SELECT x.old_x, x.old_y, x.dml_action, SYSDATETIME() FROM ( MERGE INTO @x AS tgt USING ( SELECT 2 AS y UNION ALL SELECT 3 ) AS src ON src.y = tgt.y WHEN MATCHED THEN UPDATE SET tgt.y = tgt.y + 1 WHEN NOT MATCHED THEN INSERT ( y ) VALUES ( src.y ) WHEN NOT MATCHED BY SOURCE THEN DELETE OUTPUT $action AS dml_action, deleted.x AS old_x, deleted.y AS old_y ) AS x WHERE x.dml_action IN ('UPDATE', 'DELETE'); SELECT * FROM @history;
Note that as of SQL Server 2008 the results of the outer operation must be an INSERT-SELECT. That’s a rather frustrating limitation, and I really hope the SQL Server team loosens the restrictions in the next version of the product.
Payload: MERGE, OUTPUT, and Access to Source and Destination Columns
Now that the groundwork has been laid–MERGE, OUTPUT, OUTPUT with MERGE (or is it the other way around?), and composable DML–we can finally get to the impetus behind this post, which is based on two factors. First of all, MERGE adds more power to the OUTPUT clause than just the $action function; it also allows us to access source and destination columns that we can’t get to using INSERT, UPDATE, or DELETE. And second, MERGE can be used for standalone inserts, updates, or deletes. This means that it doesn’t matter if we can’t get to the data using the other three statements; we have MERGE and it can be those other statements for us when we need it to.
So what’s so great about the ability to access source columns that you’re not using as part of the data modification? Let’s take a look at a simple example. Here’s the scenario: You’ve been asked to load a flat file into the database. Each line of this flat file contains information about a person: a name, current salary, a list of up to three companies of employment, and a list of up to three childrens’ first names, all in comma-delimited format. Here are the tables we want to load the data into:
USE tempdb GO CREATE TABLE people ( name VARCHAR(100) NOT NULL, current_salary INT NOT NULL, person_id INT IDENTITY(1,1) NOT NULL PRIMARY KEY ) GO CREATE TABLE previous_companies ( company_name VARCHAR(100) NOT NULL, person_id INT NOT NULL REFERENCES people (person_id), PRIMARY KEY (company_name, person_id) ) GO CREATE TABLE childrens_names ( child_name VARCHAR(100) NOT NULL, person_id INT NOT NULL REFERENCES people (person_id), PRIMARY KEY (child_name, person_id) ) GO
And here is some sample data:
After taking a quick glance at this data I hope you can see right away that we have some issues. Looking at the first and third lines you can see that we have duplicate names–Joe Smith–and therefore the name is not a good candidate for a primary key. Salary, with this small set, will work fine if combined with name, but of course it’s well within the realm of possibility that two people can have the same salary, especially when working for the same company as both of our Joe Smiths do. We could add childrens’ names to the mix, but of course there is even no guarantee that two people with the same name, working for the same company and making the same salary, won’t have a child with the same name. I’m sure it’s happened before and will happen again.
We’ve been reassured that each line really does represent a unique person, so we have to make it work. No problem once we’re in the database–we can deal with the issue using the surrogate key on the “people” table–but how should we handle the data on the way in?
Assuming that we’re doing all of this processing in T-SQL, the logical process is straightforward: we’re going to have to insert the names and salaries into the “people” table, get back one person_id per name, and then do the inserts first into previous_companies and then into childrens_names. Again, this wouldn’t be a big deal with the small sample data set here–we could, after inserting the names and salaries, join those back to the input data set (assume that it’s in a temp table at this point) in order to map back the surrogate keys. And that would work well enough; but as more columns are added this will get trickier and trickier, and more prone to the introduction of a nasty bug that will cause the keys to be mismapped. There must be a better way–and thanks to MERGE, there is.
The solution to this problem requires a couple of steps. First create a surrogate key on the incoming data, separate from the surrogate key that will eventually be assigned when rows are inserted into the “people” table:
CREATE TABLE #data ( name VARCHAR(100), salary INT, company1 VARCHAR(100), company2 VARCHAR(100), company3 VARCHAR(100), child1 VARCHAR(100), child2 VARCHAR(100), child3 VARCHAR(100), input_surrogate INT IDENTITY(1,1) NOT NULL PRIMARY KEY ); INSERT #data VALUES ('Joe Smith',110000,'Microsoft',NULL,NULL,'John','Sue','Ed'), ('Jane Baker',50000,'Oracle','Teradata',NULL,'George',NULL,NULL), ('Tom Jones',0,NULL,NULL,NULL,'Anne','Cathy',NULL), ('Joe Smith',100000,'Microsoft',NULL,NULL,'Steve',NULL,NULL), ('Kate Davis',150000,'Google','Oracle','Microsoft','Don','Amy',NULL);
Once we have this input_surrogate populated, we need to be able to access the values in order to match them to the surrogate generated during the insert to the “people” table. And this is where the power of MERGE kicks in:
CREATE TABLE #surrogate_map ( input_surrogate INT NOT NULL, person_id INT NOT NULL ); MERGE INTO people AS tgt USING #data AS src ON 1=0 --Never match WHEN NOT MATCHED THEN INSERT ( name, current_salary ) VALUES ( src.name, src.salary ) OUTPUT src.input_surrogate, inserted.person_id INTO #surrogate_map;
In this example I’ve forced the MERGE statement to behave as an INSERT statement by using a predicate of 1=0; since this will never match on anything, the WHEN NOT MATCHED clause will fire for every row, thereby causing that row to be inserted.
The key element on the OUTPUT side of the equation is the seemingly-innocent third-to-last line. I’m using “src.input_surrogate”–a column that is not used as part of the INSERT–as part of the OUTPUT clause. This little feature enables us to reliably map the person_id surrogate to the input_surrogate without playing any games or worrying about complex logic or coding bugs. Simply stated, it just works. And the rest of the ETL process, from here on out, is simple enough, thanks to this mapping, that I won’t have to bore you with it.
This OUTPUT feature of MERGE is not only useful for INSERTs and mapping of keys. Any time you’re doing DML based on joins between tables, you might consider using this feature to help with logging. For example, you might want to delete some rows from one table by matching the rows with another table based on a range predicate. Want to know which rows matched and caused deletions to occur? No problem:
MERGE INTO people AS tgt USING ( SELECT 100000 AS salary UNION ALL SELECT 50000 AS salary ) AS src ON tgt.current_salary BETWEEN src.salary AND src.salary * 1.10 WHEN MATCHED THEN DELETE OUTPUT src.salary AS source_salary, deleted.person_id, deleted.current_salary;
Prior to MERGE, properly doing this kind of thing would have required much more complex code: Most likely you would have to do the work in at least two steps, first finding potential matches and storing them in a temporary table, then going back to do the DELETE. MERGE and OUTPUT enables all of this work to be done in a single statement, and while this may or may not be more efficient (see my notes at the beginning of this post) it is definitely more elegant–and in most cases maintainability is more important when performance differences aren’t extreme.
There are a number of takeaways here. First of all, MERGE used as an “upsert” tool may or may not be the best choice. In my experience performance differences will be extreme enough to warrant not using it for that purpose. But I encourage you to give it a try and not rely on my experiences as a guide; and after testing, I would definitely like to hear what kinds of results you’re getting, especially if you’re migrating legacy code and testing both methods. Second, MERGE is not only an “upsert” tool. It can be used for INSERTs and DELETEs, and as Hugo showed in the post I linked to, it is the number one choice for UPDATEs when you need to do the update based on a join. Third, thanks to the power of the OUTPUT clause, MERGE is incredibly useful in the areas of ETL and logging–and probably many others that I haven’t considered yet.
I’ll conclude by saying that a lot of times the true power of a new feature can be subtle and difficult to identify without spending some time thinking and experimenting. As end-users we’re often quick to dismiss something we don’t immediately understand and while that’s probably not entirely unfair in a lot of cases, in this one it was. I initially cast MERGE aside but thanks to being somewhat open-minded I now realize that it is actually quite powerful when used in the correct context. I have already used it several times in situations similar to those described in this post and look forward to a long relationship with this new DML construct.