I’m not sure how many times over the last several years I’ve seen the same tired article titles… “Climbing Trees in SQL,” “Climbing Up the SQL Tree,” or maybe, “Naked Coeds Playing in the Trees!” … Oh wait, I think that last one might be something else.
But anyway, the point is, I’m going to adhere to that standard. But I’m adding a Tarzan-esque theme to this post, because we’re not going to just climb a tree. We’re going to swing about between trees. Different types of tree representations are appropriate for different scenarios. Which is why, as I pointed out in my recent SearchSQLServer article on recursive Common Table Expressions, we have so many different ways of representing them. Adjacency Lists, Materialized Paths, Nested Sets, and Nested Intervals spring to mind. And there are probably others.
My article shows how to use the CTEs, in conjunction with a dynamically generated materialized path, to manipulate an Adjacency List, getting many of the benefits associated with using the Nested Sets Model. And that’s great. But the Nested Sets Model itself might be useful in your endeavors. So in this post, I will show how to extend the CTE discussed in that article.
The end-result CTE in the article, which can be run in the AdventureWorks database, looks something like this (renamed for the sake of this post):
WITH EmployeeLevels AS ( SELECT EmployeeId, CONVERT(VARCHAR(MAX), EmployeeId) AS thePath, 1 AS Level FROM HumanResources.Employee WHERE ManagerId IS NULL UNION ALL SELECT e.EmployeeId, x.thePath + '.' + CONVERT(VARCHAR(MAX), e.EmployeeId) AS thePath, x.Level + 1 AS Level FROM EmployeeLevels x JOIN HumanResources.Employee e on e.ManagerId = x.EmployeeId )
thePath is the materialized path, a ‘.’ delimited breadcrumb from the root to each node. We also end up with a level, which represents how many nodes away from the root in the hierarchy each employee sits (very important for those upwardly-mobile junior execs, no doubt!)
I’m going to assume that readers of this post are familiar with the Nested Sets Model, as popularized by Joe Celko. If not, read the link.
After staring at the Nested Sets for a while, I arrived at the following mathematical properties:
- Value of Lft for the root node is 1
- Value of Rgt for the root node is 2 * (Number of nodes)
- Value of Lft for any node is ((Number of nodes visited) * 2) – (Level of current node)
- Value of Rgt for any node is (Lft value) + ((Number of subnodes) * 2) + 1
I think the only factor here that requires further explanation is “(Number of nodes visited)”. By this, I mean the number of nodes that would have been visited (including the current node) if one were doing a preorder traversal of the tree. Luckily, this number is quite easy to determine. The row number for each row, as determined by ordering by the materialized path, is this number. I encourage readers to validate this with pencil and paper if it doesn’t quite make sense reading on the screen. Draw a simple tree and traverse, starting from the lefthand side of the root node.
But how to translate that into T-SQL? Luckily, SQL Server 2005, in addition to CTEs, also includes the very useful ROW_NUMBER() function. So to get the row number, we simply need to add another CTE into the chain (did you know that successive CTEs can use the results of previous CTEs?):
WITH EmployeeLevels AS ( SELECT EmployeeId, CONVERT(VARCHAR(MAX), EmployeeId) AS thePath, 1 AS Level FROM HumanResources.Employee WHERE ManagerId IS NULL UNION ALL SELECT e.EmployeeId, x.thePath + '.' + CONVERT(VARCHAR(MAX), e.EmployeeId) AS thePath, x.Level + 1 AS Level FROM EmployeeLevels x JOIN HumanResources.Employee e on e.ManagerId = x.EmployeeId ), EmployeeRows AS ( SELECT EmployeeLevels.*, ROW_NUMBER() OVER (ORDER BY thePath) AS Row FROM EmployeeLevels )
We now have current level and number of nodes visited, which gives us the Lft value for each node. But how to determine the number of subnodes, in order to get the Rgt value? Luckily, the materialized path also gives us that capability…
For any given node, number of subnodes can be determined by counting all nodes whose path value is LIKE the current path value + ‘.%’.
The resultant query should be fairly obvious at this point:
WITH EmployeeLevels AS ( SELECT EmployeeId, CONVERT(VARCHAR(MAX), EmployeeId) AS thePath, 1 AS Level FROM HumanResources.Employee WHERE ManagerId IS NULL UNION ALL SELECT e.EmployeeId, x.thePath + '.' + CONVERT(VARCHAR(MAX), e.EmployeeId) AS thePath, x.Level + 1 AS Level FROM EmployeeLevels x JOIN HumanResources.Employee e on e.ManagerId = x.EmployeeId ), EmployeeRows AS ( SELECT EmployeeLevels.*, ROW_NUMBER() OVER (ORDER BY thePath) AS Row FROM EmployeeLevels ) SELECT ER.EmployeeId, ER.thePath, ER.Level, ER.Row, (ER.Row * 2) - ER.Level AS Lft, ((ER.Row * 2) - ER.Level) + ( SELECT COUNT(*) * 2 FROM EmployeeRows ER2 WHERE ER2.thePath LIKE ER.thePath + '.%' ) + 1 AS Rgt FROM EmployeeRows ER ORDER BY thePath
… And that’s it! We have now converted the Adjacency List tree into a Nested Sets tree.
In the next installment, I will show how to determine the Nested Intervals encoding from an Adjacency List tree, also using recursive CTEs.
Questions?
Wow.. This is the best solution that I have seen for generating nested sets using SQL Server 2005!
Great work! and thanks a lot!
This model is excellent for converting a non-ordered adjacency model to nested sets. However how could you get it to work when there is an explicit sort order for the children underneath any particular parent? Ideally you’d be able to add an ORDER BY clause the the second SELECT statement but of course SQL doesn’t allow you to do this.
I’d very much appreciate your comments on this issue.
Simple Great! I was badly stuck with this problem, was abt to build the .Net tree…
But your solution saved my time and lots of work
Thank you so much
Great tip!!!
Right now, the above CTE will not work if it extends beyond 32767 (its rowcount > 32767). Can you provide a way to do this.
Thanks in advance.
Great tip!!!
Right now, the above CTE will not work if recursion extends beyond 32767 (its rowcount > 32767). Can you provide a way to do this.
Thanks in advance.
Sorry … double post
Updated ‘it’ -> resursion
Hi John,
Do you perchance have a cycle in your tree? There should be no reason for the CTE to recurse that many times, unless you have an INCREDIBLY deep hierarchy. If that’s the case, I would love to hear about it–I’ve never heard of any real-world case of a hierarchy with more than a couple of hundred levels. What kind of data are you working with?
Here’s code to test for cycles:
–Traverse up the hierarchy toward the
–leaf node
;WITH e AS
(
SELECT
e0.EmployeeId,
e0.ManagerId
FROM HumanResources.Employee AS e0
WHERE
NOT EXISTS
(
SELECT *
FROM HumanResources.Employee AS e1
WHERE
e1.ManagerId = e0.EmployeeId
)
UNION ALL
SELECT
e.EmployeeId,
et.ManagerId
FROM HumanResources.Employee et
JOIN e ON e.ManagerId = et.EmployeeId
WHERE
et.ManagerId IS NOT NULL
AND e.ManagerId <> e.EmployeeId
)
SELECT *
FROM e
WHERE
e.ManagerId = e.EmployeeId
I was trying to create a Nested Set of book categories heirarchy from
BrowseNodes.com.
Here is the web address for reference:
http://www.browsenodes.com/node–2000.html
It uses Adjacency List model to create the heirarchy. Yes, the heirarchy is incredibly deep. Amazing category heirarchy.
Again, thanks.
Hi John,
I can’t figure out how to download the full categories hierarchy, but I’ve done this for a real book Web site before, using data for around 5 million books, provided by Muze. I can tell you that in that case the hierarchy wasn’t even close to that deep–maybe 10 levels at most. Are you certain you’re not seeing the result of a cycle, or some other issue?
Hi Adam,
Thank you very much for your continuous help.
I ran your query for tree cycle test and returned 0 row.
I’ve implemented Celko’s procedure of converting AL to NS model. And it took 31 min. Its implmented in SQL/PCM and is not using the latest technology such as CTE.
I’ve read some books saying cursors are far better than CTE when doing recursion. I wonder if this is where CTE can be better.
Again, Thanks a lot.
BTW, my test is on tree pruning, removing categories with empty books.
The less dummy books in the tree, the faster it is. When I created large test books the slower it is.
thanks
Adam,
I’m struggling with some performance issues on SQL 2005 related to the final example in your article, which by the way is the most complete I’ve found, and could really use some help. We’re using it to completely rebuild the LFT and RGT columns in our employee table each night as the data is completely refreshed daily.
Our structure has 10 levels and 450K nodes and are taking up to 14 hours to complete on our development server. The EmployeeID and ManagerID columns are indexed and otherwise perform ok, but I can’t for the life of me get this to complete during our maintenance window.
Do you have any ideas? We’ve just migrated to SQL 2005 and I’m still learning about CTE and could use some assistance. We employed a different method on SQL 2000 using an ActiveX script in a DTS package and the process took less than 2 hours on a much less equipped server.
Thanks in Advance,
Bob
Hi Bob,
Grab the code samples for “Expert SQL Server 2005 Development”, here: http://apress.com/book/downloadfile/3498
… check out Chapter11.SQL, starting on line 1306. That method may be faster than the one in this post.
… Or maybe not. While researching the book I discovered that recursive CTEs are quite slow and found that I could get much better results using a temp table indexed on the “level” column. With only 450K nodes I think you could process the entire thing in a matter of minutes with a well-tuned solution, so I’m thinking that you might want to ditch the CTEs altogether.
By the way, why rebuild the whole thing every night instead of simply maintaining it in-place? Again, check out my book, or at least the samples therein 🙂 … and another by the way, I also discovered through extensive testing that Nested Sets is much slower than an optimized materialized path solution. But it may be too late for you to reconsider the design at this stage?
Best of luck!
Thanks for the tip, Adam! I’ll give it a shot tonight and let you know how it goes 🙂
Bob
Adam,
I’ve meant to get back here and thank you for sharing that second example. That took the processing time from over 14 hours to approximately 5 minutes for our whole package!
WHAT A DIFFERENCE! Thanks again!
Take care,
Bob
Cool – very nice – We use nested sets to hold our tree data (speed of access) – but this is hard to maintain – so convert from NS to AL – do our maintenance and then convert back to NS using the above. Thanks again, Paul,
Great piece of code.. this helped me in huge way 🙂 Thanks Adam 🙂
Left and Right indexes don’t work correct
declare @T2 table (RowId int , Product_Id int , Parent_ID int , LogicOrder int );
insert into @T2(RowId , Product_Id , Parent_ID , LogicOrder )
select 1,0,-1,0 UNION
select 2,1,0,3 UNION
select 3,2,0,2 UNION
select 4,3,0,1 UNION
select 5,4,1,4 UNION
select 6,5,1,3 UNION
select 7,6,1,2 UNION
select 8,7,1,1 UNION
select 9,8,7,2 UNION
select 10,9,7,1 ;
LEFT RIGHT
1 20
2 3
4 5
6 19
7 12
9 10
11 12
13 14
14 15
16 17
@miki: Looks perfect on this end, ignoring your LogicOrder column:
0 0 1 20
1 0.1 2 15
4 0.1.4 3 4
5 0.1.5 5 6
6 0.1.6 7 8
7 0.1.7 9 14
8 0.1.7.8 10 11
9 0.1.7.9 12 13
2 0.2 16 17
3 0.3 18 19
… if you want to use the LogicOrder, build your path based on a ROW_NUMBER OVER (ORDER BY LogicOrder) instead:
0 1 1 20
3 1.1 2 3
2 1.2 4 5
1 1.3 6 19
7 1.3.1 7 12
9 1.3.1.1 8 9
8 1.3.1.2 10 11
6 1.3.2 13 14
5 1.3.3 15 16
4 1.3.4 17 18
Excellent! I am building some demo hierarchies for a presentation, and it was taking forever to load a 50K node/9 level nested set model 1 row at a time, and using this code I can load it in seconds.
It definitely works with > 33000 nodes, as I just used it with a 55000 node tree, and next am going to try with a 500K one
Hi Adam, great article thanks 🙂
I use SQL Server at work but only have access to sqlite and MySQL at home. I was trying to convert your SQL to (for example) sqlite and I’m having a problem with
EmployeeRows AS
(
SELECT
EmployeeLevels.*,
ROW_NUMBER() OVER (ORDER BY thePath) AS Row
FROM EmployeeLevels
)
especially the ROW_NUMBER() OVER (ORDER BY thePath) AS Row part. I have translated this into…
(
SELECT
EmployeeLevels.*,
rowid AS Row
FROM EmployeeLevels
ORDER BY thePath
)
but the end result is the Row, Lft, Rgt aren’t populated in the results from the final SELECT at the end of the CTE query
the row number this part of the CTE query is looking for is the row number from the EmployeeRows part of the CTE isn’t it?
I tried including rowid i the creation of the earlier EmployeeLevels part of the CTE query and it gives a result but not the right result.
Any help appreciated 🙂
Hello Adam,
do you have update version of ‘Expert SQL Server 2005 Development’?
@ujwal
No, I never wrote another one. But I think the "T-SQL Querying" book that was released earlier this year comes pretty close to meeting the same goals.
–Adam
Just in case anyone is still interested in Adam’s method (and they should be if they’re working with hierarchies), I took Adam’s formula’s and tweaked the method. It’ll convert a million node Adjacency List to Nested Sets in about 54 seconds.
Here are the links to the articles. The first is for the conversion and the second is for how to skip nested sets and make a DW with most of the answers for questions you might ask of DAGs already answered. It takes about the same time as the 54 second nested set conversion.
https://www.sqlservercentral.com/articles/hierarchies-on-steroids-1-convert-an-adjacency-list-to-nested-sets
https://www.sqlservercentral.com/articles/hierarchies-on-steroids-2-a-replacement-for-nested-sets-calculations-1
I have to say it again, Adam… thank you for the formulas. They’re super easy to understand and with the extra little trip through the Hierarchical Path and the help of a special Tally Table, things absolutely fly.
Comments are closed.