Wednesday, February 07, 2007

Ragged Hierarchy Alert

So this week I was faced with writing yet another hierarchy script. I say this with affection, of course. I find that developing efficient hierarchies can sometimes be quite a challenge depending on the data set. When there are only a few records (a hundred or so) almost any method will work (SQL, procedures, or a mixture). As the data sets grow, the method becomes more and more important. Couple this with the hierarchy requirements (can a child have more than one parent? Is there a maximum depth? Will the result be used to populate a tree control? Do I need to process each branch now or can I just get the top level nodes? Etc.) and the tidy and efficient code you imagined is now becoming bloated and difficult to debug.

There are two ways to store hierarchies in a database: Using a recursive pointer or a bridge table. The recursive pointer is possibly the most widely used. Each record has a primary key and a foreign key that points to the primary key of another record in the same table (more on this in a bit). This approach is simple to maintain and intuitive. The bridge table provides much more flexibility (for example, a child can have more than one parent), but is more difficult to maintain (you do have another table to update, and there could be issues with time stamps and sequences if a child can have multiple parents). The discussion on bridge table hierarchies ends here. I may address it in a future blog, however.

The key to ragged hierarchies is that any record in the table can participate to any depth or degree. Organizational hierarchies (which I will demonstrate in a moment) are very common and take on a ragged frame; they are also perfect for the recursive pointer. The following figure will give you an idea of how a ragged hierarchy might look:



Like I said earlier in this post, there are lots of ways to snake through this diagram: from using the OVER clause in SQL Server 2005 to writing a procedural program in FoxPro, the possibilities are as varied as the structure. In my VFP applications, I always find that a mixture of SQL, some procedural code, and a recursive function provides the most flexibility and greatest performance. To demonstrate, consider the following:


CREATE CURSOR Employee (emp_id i, emp_name c(25), sup_id i)
INSERT INTO Employee VALUES (1,"Martha Jones",0)
INSERT INTO Employee VALUES (2,"Dan Brown",1)
INSERT INTO Employee VALUES (3,"Ed Smith",2)
INSERT INTO Employee VALUES (4,"Sarah Parker",2)
INSERT INTO Employee VALUES (5,"Henry Johnson",1)
INSERT INTO Employee VALUES (6,"Samual Smyth",5)
INSERT INTO Employee VALUES (7,"Ali Jennah",1)
INSERT INTO Employee VALUES (8,"Tori Heart",7)
INSERT INTO Employee VALUES (9,"Bob Jones",7)
INSERT INTO Employee VALUES (10,"Kelly Robinson",7)
INSERT INTO Employee VALUES (11,"Manny Diaz",7)
INSERT INTO Employee VALUES (12,"Jack White",11)
INSERT INTO Employee VALUES (13,"Melinda Jo",11)

CREATE CURSOR crTree (;
top_level l,;
this_id i,;
description c(25),;
parent_id i,;
deep n(2))

*-- run the recursive function
SELECT Employee
LOCATE
snake(0,0) && start at the top

FUNCTION snake (tnParenID, tnDeep)

LOCAL lcAlias
lcAlias = "crTemp_" + TRANSFORM(tnParenID)
SELECT * FROM Employee WHERE sup_id = tnParenID INTO CURSOR (lcAlias)

tnDeep = tnDeep + 1

SELECT (lcAlias)
SCAN
put_record(&lcAlias..emp_id, &lcAlias..emp_name, &lcAlias..sup_id, tnDeep)
snake(&lcAlias..emp_id, tnDeep)
ENDSCAN
USE IN (lcAlias)

ENDFUNC

FUNCTION put_record (tnThis_id, tcDescription , tnParent_id, tnDeep)
INSERT INTO crTree (;
top_level ,;
this_id ,;
description,;
parent_id ,;
deep ) ;
VALUES (;
tnParent_id=0,;
tnThis_id ,;
tcDescription,;
tnParent_id ,;
tnDeep )
ENDFUNC




I start off by creating a dummy Employee table, followed by a tree cursor (crTree)
to hold the results of my work. Next I call the recursive function 'snake', whose job it is to find each child of the passed in parent, tracking its depth as it goes. The assumption in this program is that there is at least one zero sup_id key for the top-most level. This code works well to populate treeview controls (you could easily add a unique character sequence ID to each record). You can also output a text representation:


FUNCTION show_tree()
SELECT crTree
LOCATE
SCAN
? REPL("-",crTree.deep) + " " + ALLT(crTree.description)
ENDSCAN
ENDFUNC


There you have it! For 1000 employees, the snake function takes about .8 seconds to run. Anything more than that and you should consider populating the tree only to a certain level (perhaps all parents and one child deep) and filling in the rest as needed. Another approach would be to turn this cursor into a table, relate it to your Employee table and update it as part of a save routine.

No comments: