Home Uncategorized SQL University: Parallelism Week – Part 3, Settings and Options

    SQL University: Parallelism Week – Part 3, Settings and Options

    735
    14

    Congratulations! You’ve made it back for the the third and final installment of Parallelism Week here at SQL University. So far we’ve covered the fundamentals of multitasking vs. parallel processing and delved into how parallel query plans actually work. Today we’ll take a look at the settings and options that influence intra-query parallelism and discuss how best to set things up in various situations.

    Instance-Level Configuration

    Your database server probably has more than one logical processor. As a matter of fact, as of the time of this writing–May, 2010–I would be shocked to see any new piece of hardware requisitioned for use as a “database server” that contains fewer than eight cores. Processing power is relatively plentiful and thanks to SQL Server’s socket-based licensing scheme and the prevalence of multi-core CPUs, it’s almost ridiculously cheap to put together a server with a huge number of logical processors.

    So there you are with your oh-so-powerful server. You’ve done your disk configuration, set up Windows appropriately, and installed SQL Server. You’re almost done! But before you mark this project as finished, it’s important to ponder the impact of a few instance-level settings that may impact parallelism:

    • Affinity Mask controls which logical processors are used by the SQL Server instance. This may not be considered a parallelism setting per se, but it’s an important part of the story. Setting this option to too low a number obviously restricts what SQL Server can do in parallel. But it’s important to consider this option, especially for servers running multiple SQL Server instances.
    • Maximum Degree of Parallelism dictates, at an instance level, the maximum number of logical processors that will be used by a given query. More on this in the next section.
    • Cost Threshold for Parallelism is the primary threshold used by the query optimizer in determining whether it should consider parallel plans as part of the optimization process. More on this in the How Much Does it Cost section.

    I won’t discuss Affinity Mask any further at this point, except to mention that the sp_configure “affinity mask” setting is listed in Books Online as a deprecated feature. It’s been replaced with an option called ALTER SERVER CONFIGURATION SET PROCESS AFFINITY. As far as I can tell, the end result is the same.

    98 Degrees

    (Because how often, really, do you get to name a section of a technical article after what is described in Wikipedia as an “adult contemporary boy band?” And doesn’t “boy band” eliminate the possibility of said band making “adult” music? Or the other way around? But I digress…)

    Maximum Degree of Parallelism (MAXDOP) seems simple enough. You set it at the instance level and it controls the maximum number of logical processors that a given query can use, assuming that the query hasn’t overridden the setting. But here things start getting trickier. First there’s the question of what the “correct” setting actually is. And then there’s the monitoring question, and the confusion around tasks vs. processor utilization. I’ll clarify both of these as best I can.

    How do you set the thing? The default value is “0,” which means “use up to as many logical processors as have been assigned to the SQL Server instance by the Affinity Mask option.” This is, in my opinion, one of a few default settings that the SQL Server team really got wrong. SQL Server instances are usually used by a lot of people at once, and one of our goals as DBAs or DB developers is to maximize concurrency. This means getting all of the queries that are running at the same time to finish as quickly as possible, and that means being able to distribute the available resources in a relatively even manner.

    Processor time is one of the core resources that every query needs to use, and if a decent number of queries are each using every core simultaneously, processor time is going to quickly become the most limited resource on the system. In Part 1 of this week’s series I mentioned “thrashing,” and if you have 10 queries active on your system, each of which is using all 16 cores, thrashing is exactly what you’ll have. Each query’s tasks will spend as much–or more–time switching on and off of schedulers, and waiting to get back onto schedulers, as actually doing work. And that’s clearly what you don’t want to have happen in a highly concurrent environment.

    A lot of articles recommend an instance-wide MAXDOP setting of 1/2 of the number of logical processors allocated to the instance. I think that’s a good starting point, especially for instances supporting data warehouses and mixed workloads, but it’s certainly not the final answer. Some articles I’ve read recommend never going above a value of “8,” and I think that’s too Draconian a prescription. And a lot of people running OLTP workloads like to set the value to “1.” Which can also be a bit severe.

    The fact is, too much parallelism will hurt concurrency, and too little will hurt the performance of big queries. So how do you figure out what setting to use? There’s really only one surefire way: baselining, preferably in conjunction with load testing. Run a load test–or look at your production system’s metrics if you have no test environment–and establish a CPU and query performance baseline. Change the MAXDOP setting and try again. Rinse and repeat until you find something optimal.

    For OLTP workloads, a setting of “1” is, indeed, quite often a good idea for most queries. Unfortunately, this value is problematic because the MXDOP setting also controls non-query operations such as index rebuilds. You usually want your index rebuilds to process in parallel whenever possible because they’ll finish more quickly that way, thereby freeing resources for your users.

    Index operations and individual queries can override the instance-level MAXDOP setting by using the MAXDOP query hint. (A very well-named hint!) So on an instance with the sp_configure “maximum degree of parallelism” setting set to “1,” I could tell the query optimizer to use up to 10 logical processors for the following query by using the hint:

    
    SELECT *
    FROM Tbl
    WHERE
      Col > 5
    OPTION (MAXDOP 10) 

    Since you can override the setting on an individual query basis, the instance-wide value should be set in such a way as to positively influence the greater majority of queries running on the instance. But how do you find those that need special attention via hints? My friend Jimmy May pointed out the solution to me in a recent conversation: run a load test with a non-zero instance-level MAXDOP setting, baseline performance metrics for each query, then flip the value to “1,” test again, and compare the results. Or the other way around. The queries which perform better using the setting you don’t want to set at the instance level are the ones that need a hint. Simple enough!

    “I have MAXDOP set to ‘4.’ Why is my query spinning up 13 tasks?” As you become more familiar with how to monitor SQL Server you will invariably start looking at the sys.dm_os_tasks DMV and will quickly notice that MAXDOP does not seem to directly impact the number of tasks that a given request can use. This can be a bit confusing, since you might expect that a request using more tasks equates to a request using more logical processors. But that’s not exactly how it works. In truth, a query’s degree of parallelism determines how many tasks–at most–each individual iterator can spin up. And several iterators can run concurrently at any given time during a request’s life cycle. The setting also determines the maximum number of schedulers all of those tasks can–by virtue of their associated workers–be bound to.

    A query with a degree of parallelism of “4” that is using 13 tasks most likely has three active iterators, each of which is using four tasks. So why 13 and not 12? Because there is always a task dedicated to coordinating the other tasks. In sys.dm_os_tasks and other task-related DMVs it will be the task with an exec_context_id (execution context identifier) of “0”. The others will have a non-zero identifier.

    By the way, just because you ask for a given degree of parallelism doesn’t meant you’ll get it. The query processor may use a DOP lower than that which you requested. To find out the actual degree of parallelism for your query, take a look at the DegreeOfParallelism attribute of the QueryPlan node in the showplan XML.

    How Much Does it Cost?

    SQL Server uses cost-based optimization to determine how best to process your queries. This means that the optimizer considers a number of potential query plans and assigns each a cost, or a score that acts as a relative means by which to compare different plans. This number is calculated by using a fixed price for the estimated number of times every given operation that is a part of the plan is expected to take place in order to satisfy your query. This same cost-based model is used to figure out whether to process your query serially or in parallel.

    Query optimization is an expensive process, so it happens in phases. Each phase takes a bit longer than the previous and during the latter phases the query optimizer considers more complex plans that are more difficult to cost. Parallel plans are certainly more complicated than serial plans, and so are reserved for the later phases of the optimization process. And this is where the Cost Threshold for Parallelism setting comes into play.

    The parallel plan selection process can be described fairly simply: During the early parts of the optimization process, the optimizer considers serial plans. Should these plans exceed the set cost threshold, parallel plans may be considered at later phases of the process. If one of the parallel plans is cheaper than the serial version, it’s used.

    Modifying this setting can get a bit confusing, since it’s difficult to see from parallel plans what impact changing the setting will have. If your parallel plan has an estimated cost of “6.58,” setting the Cost Threshold to “7” may or may not cause the plan to be processed serially. What matters is not the parallel cost, but rather the cost of the alternative, serial version of the plan. And to discover that you must either change the instance-level MAXDOP setting, or use the MAXDOP query hint, to see how the estimated costs change.

    The default value of this setting is “5,” and tweaking it is done in a similar fashion to what I described for the instance-level MAXDOP setting–with the exception that Cost Threshold for Parallelism cannot be overridden on per-query basis. For this reason it’s even more important that you get it right, and many DBAs don’t touch it at all. An interesting way that it can be used is by setting it to a higher-than-default value (some articles suggest using a value of “20” or higher), as an alternative to an instance-wide MAXDOP setting of “1.” This guarantees that only queries estimated to be significantly expensive will be parallelized, and index rebuilds and other non-query processes will still be able to make use of parallelism without having to be overridden.

    Parallelism Inhibitors

    To finish things up I would like to point out one of the most frustrating parts of the parallelism game. You have your server configured the way you think it should be, you’ve written your massive query, and you’re ready to scale it across multiple CPUs. But alas, the optimizer just won’t generate a parallel plan. What’s going on?

    Assuming that the query’s estimated cost is higher than than Cost Threshold and that the MAXDOP setting isn’t “1,” you’re most likely encountering one of the various features that inhibit parallelism. The list below is taken from a presentation written by Craig Freedman, a member of the SQL Server query processor team.

    The following features force the entire plan to be processed serially:

    • All T-SQL UDFs*
    • CLR UDFs with data access
    • Miscellaneous built-ins such as:
      • OBJECT_ID(), ERROR_NUMBER(), @@TRANCOUNT, …
    • Dynamic cursors

    The following features force a “serial zone”, or a part of the plan to be processed serially even though other parts can be parallelized:

    • System table scans
    • Sequence functions
    • TOP
    • “Backward” scans
    • Recursive queries
    • TVFs*
    • Global scalar aggregate
    • Multi-consumer spool

    * By “UDFs,” Craig meant the scalar variety. And by “TVFs,” he meant the multi-statement flavor. Inline TVFs do not have this issue.

    If you’re having problems getting your query to go parallel, take a step back. Double-check the instance-wide settings. Double-check your query. Are you doing anything on the list above? If not, break your query into smaller component parts. Can you get them to go parallel? Are other queries operating in parallel as expected?

    This can be both tricky and frustrating, but be patient–you’ll get there. And your patience will be rewarded. When properly leveraged, parallelism makes a huge difference, allowing some queries to perform several times better than they do when processed serially.

    An End, and A Beginning

    When I was approached by Jorge Segarra and asked to write a week’s worth of posts, I chose a topic that I happened to be studying at the time and thought that it would be pretty easy to dash out a few articles. After investing over 12 hours in writing these three posts, I have to admit that I only covered around half of the topics I was hoping to, and some subjects at a much higher level than I originally intended.

    So while this is the end of my SQL University Parallelism Week contribution, you can expect several more posts on the topic of parallelism from me in the coming months. I have some very interesting things to share, and I hope you’ll keep reading. This topic, at least for me, gets more and more interesting the more I learn about it, and we’ve only scratched the surface.

    Acknowledgments

    I would like to thank the following people, each of whom who helped me in some way, either directly or indirectly, as I researched and wrote this series of posts:

    Jimmy May, Paul White, Craig Freedman, Conor Cunningham, Mladen Prajdic

    And thank you for reading. Enjoy, and as usual please let me know what points I can clarify.

    Previous articleSQL University: Parallelism Week – Part 2, Query Processing
    Next articleT-SQL Tuesday #007 and T-SQL Tuesday Has a Logo
    Adam Machanic helps companies get the most out of their SQL Server databases. He creates solid architectural foundations for high performance databases and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has contributed to numerous books on SQL Server development. A long-time Microsoft MVP for SQL Server, he speaks and trains at IT conferences across North America and Europe.

    14 COMMENTS

    1. Great series, thanks for posting it.
      Something to keep in mind is that you can use the Resource Governor to specify a MAXDOP for a Workload Group. Which will override the system default, and set a maximum for the MAXDOP hint in queries. I think that it would be a good way to go for OLTP systems, where you can put the OLTP transactions in their own workgroup with MAXDOP set to a low number, and still allow other queries to take advantage of the parallelism.

    2. Michael, great point on RG. I had that on the list of things to discuss, but it fell into the 50% of topics I wasn’t able to hit in just three blog posts. And since then I’ve thought of even more stuff. It’s amazing how deep this one area can go.

    3. Adam,
      Does the information presented here apply to all versions of SQL evenly? As in 2000, 2005, 2008 and R2?
      We have all flavors in our shop. Fortunately most of mine are 08 and 08 R2, but I do have a few 05 and one project is still 2000. That one will always be around even after upgrading (SOX sucks). Anyway. Just wondering if the "tweeking" is more efficient in new version as compared to old.
      Thanks for the articles. Great foundation to expand on. Looking forward to more articles on this subject.
      Cheers!

    4. Hi AJ,
      2005, 2008, and 2008 R2. Most if not all of it probably applies to 2000 as well but to be honest I don’t really remember. It’s been six years since I’ve done anything in-depth with a 2000 instance.

    5. Hi Adam,
      first off all thanks for your Blog, I learned a lot here.
      Today, I was Troubleshooting a query that was not going parallel. I saw no inhibitors, then went back to check with the list. But nothing from there in my query.
      I found out: Joining with a table on another SQL Server (linked) also disabled parallelism. When I copied the table to local tempdb, I got a parallel plan. The maxdop setting on the linked server (was 1 in my case) is not the cause, I tried another instance with maxdop 2, but still no parallel plan.
      Do you know this and is it a general rule?
      Maybe thats interesting, because I think its not obvious from the list above.

    6. One more info: It was not about cost of serial plan vs parallel plan, I tried querytraceon 8649.
      Greetings
      Daniel

    7. @Daniel
      Yes, linked servers do indeed inhibit parallelism — and a temp table is probably the best general case workaround. (I have another workaround, involving CLR, but you probably don’t want to go there.) And you’re right, I didn’t mention that in the list above — and I just checked some other lists around the Web and linked servers aren’t there either. Good catch!
      –Adam

    8. Greetings and thank you for teaching the parallel pre-con in Chicago this year.
      >I have MAXDOP set to ‘4.’ Why is my query spinning up 13 tasks?<
      We had a query spawning 17 threads on maxdop 2. After opening a case with Premier Support, the MS support engineer responded as follows.
      "… , you have 10 logical CPU’s per numa node. In this case, if we have max dop at 2 then we can span both numa nodes and could see up to 20 ecids …"
      This definition of maxdop seems to conflict with that above, and seems to make a specific maxdop number > 2 largely irrelevant. Is he all wet? Comments?
      (SQL 2008, 4 socket, 10 cores per socket, hyper-threaded, Instance using 40 cores of 80 available)

    9. @Dave
      My opinion is that your Microsoft support person is completely full of the excrement of a male bovine.
      But don’t trust me on this without proof. Open up the query plan. How many Repartition Streams and Gather Streams iterators can you count? Each one represents the boundary for a parallel zone. And each zone, potentially, can use 2 unique threads. (Sometimes the threads can be reused between zones.) 8 zones and you have, potentially, 17 threads. (16 + a parent.)
      Let me know 🙂
      –Adam

    10. You are correct, sir. The run-away query plan contained 10 Repartition Streams iterators and one Gather Streams iterator.
      We knew the query needed tuning, but the reason we opened the support case was that the queries NEVER completed or even consumed resources. They would cause an internal deadlock between different threads of the same query. SQL did not choose a deadlock victim and kill it. The queries (up to 60 concurrent copies) stayed in an active suspended state as shown by WhoIsActive, consuming threads until all instance workers were consumed. They had to be manually killed or the instance restarted. MS did not admit this was a bug, but just said to tune the query.
      Thanks for your insight.
      — Dave

    11. @Dave
      Internal deadlocks are resolved using a tempdb spill — no reason to kill the query. In any case it does sound like some tuning would be ideal, but I agree that there is probably a bug there too.
      –Adam

    Comments are closed.