How to Screw Up Database Concurrency (Guest Post by Linchi Shea)
We’re delighted to welcome Linchi Shea as a guest blogger on DataEducation.com! Linchi is a longtime SQL Server MVP and works at a major Wall Street firm as the SQL Server technical lead. This NYC-based database guru is also very active with many SQL Server communities, so you may have seen him around.
You can screw up database concurrency easily in numerous ways, and people have written extensively about the subject, literally since the very first database came into existence. But it is always worth highlighting a really simple and silly screw-up when you see it in an application supposedly designed for high concurrency.
Recently, I came across an application grid made up of several application servers running essentially the same process. Among other things, they all connected to the same SQL Server instance executing the same stored procedure to persist the messages from the application grid cache into a common message table. With every application server spawning about 60 database connections, and with seven or eight application servers in the grid, we had about 400~500 database connections in total from the grid at any given time.
From the very beginning of this application, the database connections would block each other constantly. But since the data volume was initially low, the database performance could keep pace with the application grid throughput as a whole, and the developers weren’t bothered by persistent blocking on the database. Then the volume started to pick up significantly, the database suddenly became a serious bottleneck that drew everyone’s attention, and I got the call to look at what was going on.
Note that each insert from these grid connections handled a different message with a different primary key value. Intuitively, they shouldn’t have blocked each other, right? Well, it turns out that not only did they block each other, the database connections were essentially serialized, defeating the very purpose of using an application grid for high concurrency.
Here was the gut of the problem–or a simplified account, anyway.
As mentioned, these grid connections all called the same stored procedure to populate the message table with the data cached on the grid. The stored procedure did more than just a simple insert–it included a user-defined transaction covering multiple DML statements, the very first of which was an
INSERT statement into the message table.
A look at the waitresource on the blocked spids showed that the waitresource was of the format
KEY: <db_id>:<hobt_id> (index_key_hash_value). And more importantly, the hobt id and index key hash in the parentheses remained the same across all the blocked spids. The hobt id led us to a nonclustered index on a column called abc_id of the integer datatype.
Initially, I speculated that we were simply having bad luck with the values in the abc_id column hashed to the same index key hash value. But it turned out to be much simpler and sillier. Weeding through the stored procedure code, I discovered that for some business requirements, the abc_id column needed to be assigned after the message has been inserted (probably because it should be assigned the same value as another identity column). So the developer decided to give it a default value upon inserting and then grab the identity value and update the abc_id column next in the same user-defined transaction.
The result of that decision was that all the grid connections were inserting the same value into an indexed column. Of course, you ended up with the same index key hash value that SQL Server must lock until the transaction is either committed or rolled back. This essentially forced a serialization of all the connections while they were performing the user-defined transactions, ruining concurrency completely.
But that wasn’t the worst of it. The user-defined transaction was not there just for making the multiple DML statements for the same message an atomic unit, but also for batching up/processing hundreds or more messages on the database, so as to not commit a single message at a time. In most other cases, this would be an excellent design choice, since larger and less frequent transaction commits are generally better than smaller and more frequent transaction commits in terms of the transaction logging performance. Unfortunately, in this scenario, it dramatically worsened the problem, causing the blocked spids to wait dramatically longer, especially for those at the bottom of the wait list.
Once the root cause was identified, the fix(es) were rather straightforward, all with a minor change to the stored procedure code. The key of any fix, of course, was to ensure that we didn’t insert the same value into the abc_id column across different connections.
One solution was to use a random integer value as a placeholder upon insert. Another solution was to insert
@@spid into the abc_id column—again as a temporary placeholder; since the connections all had different spid values, there’d be no collision and therefore no blocking.
This experience drives home that concurrency impacting issues (such as persistent blocking) should not be ignored just because they’re not currently crippling the application. They may surface when it matters the most.