Tuesday, May 26, 2009

poorman's connect by -- hirerarchical queries

See http://www.evolt.org/article/Four_ways_to_work_with_hierarchical_data/17/4047/index.html

Four ways to work with hierarchical data

Rated 4.28 (Ratings: 11) (Add your rating)

Log in to add a comment
(16 comments so far)

Want more?

Picture of ghost_inside

Kirby Fling

Member info | Full bio

User since: December 05, 2000

Last login: December 05, 2000

Articles written: 1

Many thanks to RatFace for his article BBS style recursion How-To. I have also recently been wrestling with the concept of displaying heirarchical lists, specifically in a discussion forum tool that I've been writing.

I've spent quite a bit of time working out the data structure and display methods for the concept of heirarchical data and I've come to the conclusion that there are four main programming methods to approach the problem:

  1. Recursion
  2. Stack
  3. Flat Table
  4. Modified Preorder Tree Traversal Algorithm

So let's begin our discussion of each method:

Recursion

To many computer science types, recursion is frequently found to be the most "elegant" method to drill down through a hierarchical list of data.

And I would have to agree with that summation, but then again, that might be my first year of computer science speaking. ("Welcome to CS 101: Recursion until your head explodes")

The reason it is such an "elegant" solution is that you essentually have one function that does all of your data processing, display, and iteration to possible children. You call the function once (with the root of the tree) and everything is done. The added benefit is that you can call the function with any subnode of the tree and it will display the entire subtree without an additional logic or coding effort. In pseudo-code, a hierarchical list display function would look something like this:

function DisplayChildren(uid, indent_level)
SELECT id, (info)
FROM (db)
WHERE ParentID = uid

if recordcount > 0 then
(indent by indent_level) (HTML to display info for this child)
child_id = recordset.id
DisplayChildren(child_id, indent_level+1)
end if
end function

That's really all there is to the recursive method, which is why it's seen as such an elegant solution.

Disadvantages

The problem is that the vast majority of coding solutions we have at our disposal handle recursion very poorly. The idea is that every function (ideally) gets executed in its own memory space. And have to leave cleanup to whatever middleware we are using. So let's do a quick little check on efficiency with arbitrary numbers.

Let's take the following to be true (just for example):

  • Each time we get one record from the DB, it takes n seconds.
  • Each DB connection/disconnection takes n * 2 seconds.
  • Allocating memory for a new incidence of the function takes n/10 * iteration seconds (in ASP, and Cold Fusion, recursive algorithms have demonstrated a logorithmic decay in performance).

So, if we had 1000 messages in our discussion forum, the equation would look something like this: (1000 * n) + (1000 * n * 2) + (n/10 * iteration). Take a look at the following chart for better clarification:

Record    1:   3 * n seconds
Record 250: 29 * n seconds
Record 500: 53 * n seconds
Record 750: 79 * n seconds
Record 1000: 103 * n seconds
----------------------------
Total = approx 50,000 * n seconds

As you can see, that logorithmic decay really ends up biting in the end. And granted, the numbers I used are just for the sake of example, but you can see that there is a fundamental issue of excess over head with the recursion model. It is a very elegant model, it fits in your head very well, and in the code even better, but unless the system you are using is designed for recursion (e.g. scheme, lisp, etc), you will always run into excessive overhead when the numbers start getting huge.

Using a Stack

The stack is usually the second method of approaching large, crazy lists. There are 2 types of stack programming: First In First Out and First In Last Out. When working with heirachcical lists, you have to use the FILO model. The idea is that you create a stack somewhere, which is essentially just a list. Then add the id of the record you are currently working with, and when you are done with that record, you pull it off the list. If that record has any children, then you start working with the first child and put it's id at the end of the list. etc.

Much like the recursion model, the data looks something like this:

Field:Type:Null:Comments:
UIDint, numeric, etc.nothe unique identifier for each record
name, title, etcvarchar(x)yesthe name or title of this record
ParentIDint, numeric, etc.yesthe UID of the parent to this object, NULL for the top level of the tree

But now the question is whether you want to do everything in the middlewareor closer to the data in an SQL stored procedure.

Advantages to running the stack model in your middleware:

  1. you can display your HTML inside each iteration.
  2. easier to code/read/modify?

Disadvantages:

  1. just as many DB connections being created as in the recursive model

Advantages of running the stack with a stored procedure:

  1. executes closer to the data, so it's faster
  2. gives you a compiled temp table that needs little effort in the middleware
  3. only one connection to the database

Disadvantages:

  1. you might not be able to run stored procedures on your DB system
  2. you might not have access to create stored procedured on your DB system
  3. more difficult to code/read/manage it it's remote and in a different language

Here is the code for a hierarchical stack algorithm that is outlined in the Microsost SQL 6.5 documentation as an example:

CREATE PROC expand (@current char(20)) AS
SET nocount on
DECLARE @level int, @line char(20)
CREATE TABLE #stack (item char(20), level int)
INSERT INTO #stack VALUES (@current, 1)
SELECT @level = 1
WHILE @level > 0
BEGIN
if EXISTS (SELECT * FROM #stack WHERE level = @level)
BEGIN
SELECT @current = item
FROM #stack
WHERE level = @level
SELECT @line = space(@level - 1) @current
PRINT @line
DELETE FROM #stack
WHERE level = @level
AND item = @current
INSERT #stack
SELECT child, @level 1
FROM hierarchy
WHERE parent = @current
if @@rowcount > 0
SELECT @level = @level 1
END
else
SELECT @level = @level - 1
END

You can modify the concept in this code to fit your particular middleware.

Now let's take a look at the theoretical efficiency numbers like we did in therecursive model:

Let's take the following to be true (just for example):

  • Each time we get one record from the DB, it takes n seconds.
  • Each DB connection/disconnection (from middleware) takes n * 2 seconds.
  • Time to execute the list/stack operations is almost nonexistant
  • Time to create the temporary table in the stored procedure takes n / 2 seconds

So, we have the same 1000 messages in our discussion forum, the equation would look something like this:

Running the Stack with middleware: (1000 * n) + (1000 * n * 2) = 3000 * n

Running the Stack with a stored procedure: (1000 * n) + (2 * n) + (1000 * (n / 2)) = 1502 * n

Flat Table Model

While I was building my discussion forum, I immediately went for the recursive model to get my data and display it on the screen. I successfully built my forum in both Cold Fusion and ASP. It worked in both, but I wasn't very happy with the performance of either. I decided to investigate other ways of turning a sequence of parent-child relationships into a hierarchical list of data. I started playing with the stack model, and eventually discovered the stored procedure method, and while I was working with that, I had this little flash of insight:

A flat table will always yield the fastest query.

So I started asking myself "How do I turn the data I've got (or am about to get) into a flat table?" As it turns out, it's really quite easy. Take the example of a discussion forum. What information is really on the page? If the information looks like this:

"hello world"johnDoe12/2/00
"Re: hello world"janeDoe12/4/00
"How is your foobar?"johnDoe12/3/00
"my foobar is fine"sysadmin12/2/00
"hello yourself"jackDoe12/2/00
"my app RuLeZ"idiotboy12/1/00
"learn how to spell" jacksnot12/2/00

Everybody sees that there is a subject, author and date. But the indenting on the left is usually seen as the parent-child relationship. "my foobar is fine" is a child of "How is your foobar?" etc. But you can also look at it apart from the parent-child relationship and see it as:

  1. the order in which to be displayed
  2. the level to be indented

So, we can contruct the data schema to look something like this:

Field:Type:Null:Comments:
UIDint, numeric, etc.nothe unique identifier for each record
name, title, etcvarchar(x)yesthe name or title of this record
ParentIDint, numeric, etc.yesthe UID of the parent to this object, NULL for the top level of the tree
rank (or display_order)int, numeric, etc.yesthe order to display all records
indent_levelint, numeric, etc.yeshow much to indent this item

Now, how do we work with this data structure once it's built? The selection of the data is easy:

function DisplayChildren()
SELECT id, (info), indent_level
FROM (db)
ORDER BY rank

while the recordset isn't empty {
(indent by indent_level) (HTML to display info for this child)
}
end function

But, as easy as that is, the difficulty is now shifted to the point of insertion. The indent level is easy, it is always (parent.indent_level + 1). The display_order (or rank) is a little more challenging. The rank of the new child needs to be greater than it's parent, but less than any of it's children or the next sibling at the same level. In the example above a new child of "hello world" should have a rank of 2. But if they were all ranked from 1 to 7 to begin with, that means that you need to increment the rank of all messages whose rank is greater than 1. Which just happens to be the parent of the child we are inserting. So you end up calling a stored procedure that looks something like this:

CREATE PROC increment_rank (@start_rank int) AS
BEGIN
UPDATE (table_name)
SET rank = rank + 1
WHERE rank > @start_rank
END

Note: you can do this in middleware if you need, but it's such a simply query that will gain such benefit from a pre-compiled query path that a stored procedure is the best way to go.

This solution ends up with the biggest database hit at the point of insertion, but even that hit isn't terribly bad, and the lessening of the hit for each view of the folder is VERY significant. Let's go back to our efficiency numbers for a moment:

Let's take the following to be true (just for example):

  • Each time we get one record from the DB, it takes n seconds.
  • Each DB connection/disconnection (from middleware) takes n * 2 seconds.

In order to view all of of our 1000 example messages, the equation looks like this: (1000 * n) + (2 * n) = 1002 * n

But in order to be fair, we also have to look at our numbers for the insertion of a given message...

Let's take the following to be true (just for example):

  • Each time a record is inserted, it takes n * 2 seconds.
  • Each time a record is updated, it takes n * 2 seconds.

So, for both the recursive and stack models, the cost of insertion is simply: (connection + insertion) n * 2 + n * 2 = n * 4

But for the flat table model, the cost of insertion is something akin to: (connection + updating 1-1000 records + insertion) n * 2 + (n * 2 * x) + n * 2 = [6n to 1004n]

As you can see, the cost of insertion is MUCH higher for the flat table model, but even the highest cost of insertion is not as high as the fastest viewing method that we've discussed thus far (the stored procedure/stack model @ 1502 * n). And the viewing speed is 30% faster. So if you expand the numbers to include some kind of usage metric, with an estimated 8,000 page views a month, and an estimated 300 messaged inserted a month. The flat table model is the hands down winner..... for a discussion forum....

Disadvantages

Which brings us to our disadvantages to the flat table model. It isn't easily ported to a lot of other uses. With a discussion forum, you almost always want it sorted hierarchically, then by date descending. And the rank-increment method works perfectly for this application. If you wanted to sort it by date ascending, it is still relatively easy (insert the child before the next message with the same parent id). But for other large lists of information that needs to be displayed hierarchically, it's very difficult to get them to display properly without some kind of re-ordering interface that you can expose to whoever is in charge of the information. In my experience, that is almost always the case anyway so the flat table model is once again a viable option for your hierarchical list pages.

A Modified Preorder Tree Traversal Algorithm

To be completely honest, I've never used this method for working with hierarchical data, and I don't even understand it that well. I read about in a book called SQL for Smarties where Joe Celko devotes 2 entire chapters to working with hierarchical data in databases that aren't really designed to. I figured I should at least make a reference to this method since it purportedly has so many great advantages to the traditional parent-child relationship model.

The basic idea is that you start at the right-most branch of the root of your tree and give it a 1 where the relationship starts and a 2 where it ends. Then you take that node's right-most relationship and give it a 3 where it starts and a 4 where it ends. So that node ends up with 2 numbers: left and right. Decrement the "left" and you get the "right" of it's parent, increment the "right" and you get either it's child or it's sibling at the same level. Continue that idea all the way around the tree until you are back at the root again. Each object ends up with 2 numbers and from those number you can run some fun queries and come up with lots of information about that particular node, the tree in general, paths between different nodes and lots of other stuff. But as I said, I don't fully understand it yet. Maybe when I do, I'll write another article.