“Lonely but free I’ll be found
Drifting along with the tumbling tumbleweeds”
– Supremes, “Tumbling Tumble Weeds”
Welcome to the first installment of what I hope will be a regular feature on this blog, Anti-Patterns and Malpractices. As a consultant, I get the honor of seeing a lot of different systems, with a lot of different code. Some of it is good, and some of it — well — I’ll be featuring that which is not so good here. No names will be named, and code will be changed to protect the not-so-innocent; my goal is not to call out or embarrass anyone, but rather to expose those misguided patterns and practices which inevitably lead to problems (and a subsequent call to a consultant; perhaps if I post enough of these I’ll have fewer less-than-appealing encounters in my work!)
The topic du jour is the Tumbling Data Anti-Pattern, a name coined by my friend Scott Diehl. Much like the tumbleweed lazily blowing around in the dust, data which exhibits this pattern is slowly and painstakingly moved from place to place, gaining little value along the way.
So what exactly typifies this particular anti-pattern? Consider the following block of T-SQL, designed to count all married employees in the AdventureWorks HumanResources.Employee table and bucket them into age ranges of 20-35 and 36-50, grouped by gender. Employees older than 50 should be disregarded:
--Find all married employees SELECT * INTO #MarriedEmployees FROM HumanResources.Employee WHERE MaritalStatus = 'M' /* select * from #marriedemployees where employeeid = 20 */ --Find employees between 20 and 35 SELECT EmployeeId INTO #MarriedEmployees_20_35 FROM #MarriedEmployees WHERE DATEDIFF(year, birthdate, getdate()) BETWEEN 20 AND 35 --Find employees between 36 and 50 SELECT EmployeeId INTO #MarriedEmployees_36_50 FROM #MarriedEmployees WHERE DATEDIFF(year, birthdate, getdate()) BETWEEN 36 AND 50 --Remove the employees older than 50 DELETE FROM #MarriedEmployees WHERE EmployeeId NOT IN ( SELECT EmployeeId FROM #MarriedEmployees_20_35 ) AND EmployeeId NOT IN ( SELECT EmployeeId FROM #MarriedEmployees_36_50 ) --Count the remaining employees SELECT e.Gender, COUNT(*) AS theCount INTO #Employee_Gender_Count_20_35 FROM #MarriedEmployees e JOIN #MarriedEmployees_20_35 m ON e.EmployeeId = m.EmployeeId GROUP BY e.Gender --select * from #Employee_Gender_Count_20_35 SELECT e.Gender, COUNT(*) AS theCount INTO #Employee_Gender_Count_36_50 FROM #MarriedEmployees e JOIN #MarriedEmployees_36_50 m ON e.EmployeeId = m.EmployeeId GROUP BY e.Gender --Get the final answer SELECT a.Gender, a.theCount AS [20 to 35], b.theCount AS [36 to 50] FROM #Employee_Gender_Count_20_35 a JOIN #Employee_Gender_Count_36_50 b ON b.Gender = a.Gender
This kind of code tells us several things about the person who wrote it. Rather than thinking upfront and designing a complete solution or at least a game plan before typing, this person appears to have thought through the problem at hand in a step-by-step manner, coding along the way. A bit of debugging was done along the way, but the real goal was to spit out an answer as quickly as possible (or so it seemed at the time). No attempt was made to go back and fix the extraneous code or do any cleanup; why bother, when we already have an answer?
It’s important to mention that this is a simple example. I generally see this anti-pattern exploited when developers are tasked with producing large, complex reports against data sources that aren’t quite as well-designed as they could be. In an attempt to preserve sanity, the developer codes each tiny data transformation in a separate statement, slowly morphing the data into the final form he wishes to output. The resultant scripts are often thousands of lines long and take hours to run, during which time the server crawls (and throughout the office you can hear people muttering “the server sure is slow today”).
The solution to this problem is simple, of course, and the best software engineers do it automatically: Before writing a line of code sit back for just a moment and consider your end goal. Do you need to work in steps, or will a single query suffice? Can various join conditions and predicates be merged? Perhaps a Google search is a good idea; what is the best way to produce a histogram without a temp table?
The hurried atmosphere of many companies leads to a “get it done right now–even if it’s far from perfect” attitude that ends up wasting a lot more time than it saves. The above example code block took me around 10 minutes to put together. A single-query version took me under two minutes to code. It is less than a third of the length, runs approximately 500 times faster, and uses 0.4% of the resources. All because I spent a couple of moments reflecting on where I was going before I took the first step.
If you find yourself exploiting this anti-pattern, step back and question whether this code will have a life beyond the current query window. If it will ever be checked into source control, it’s probably a good idea to go back and do some cleanup.
If you find yourself tasked with maintaining code that looks like what I’ve posted, my suggestion is to simply re-write it from scratch. I was recently faced with a script containing over 2000 lines of this kind of thing, and I spent almost two days slowly working my way through the mess trying to make sense of it. On the evening of the second day, after talking with some of the shareholders, I realized that it was actually a simple problem to solve. One hour later and I had a new, totally functional solution — a couple of hundred lines long, and several orders of magnitude faster. Sometimes it’s best not to wade through a muddy puddle, when you can simply hop right over.