Avoid repeated traversing of hierarchical data when parsing a big tree frequently

Question 1

I have a hierarchical/tree data structure in Oracle/RDBMS where I need to traverse the tree frequently. The tree is huge and therefore parsing it every time will not be efficient. A better way could be to parse and store the data to give constant time turn out.

Consider the relationship below

Entity T1, T2, T3, T4

They have hierarchical structure i.e

T1 --one-to-many--> T2 --one-to-many--> T3 --one-to-many-->T4

All of them have many(optional)-to-many with XYZ.

I have to traverse/update this tree frequently where some nodes may even not have zero 'xyz'. This can be a traversal of 100 of thousands of nodes. How can I avoid such repeated expensive traversal? Will a materialized view help here?

enter image description here

We have to consider that T2, T3, T4 can be millions in number but the link T1<==>XYZ, T2<==>XYZ, T3<==>XYZ, T4<==>XYZ are Thousands in number.

Maybe the data structure below is one solution:

T1[id, name]
T2[id, name, T1.id]
T3[id, name, T2.id]
T4[id, name, T3.id]
XYZ[id, name]
LINK_T1_XYZ[T1.id, XYZ.id]
LINK_T1_T2_XYZ[T1.id, T2.id, XYZ.id] with a check T1 should parent of T2 
LINK_T1_T3_XYZ[T1.id, T3.id, XYZ.id] "
LINK_T1_T4_XYZ[T1.id, T4.id, XYZ.id] "

Consider the analogy of provider-receiver here now any of the entity in hierarchy[T1, T2, T3, T4] can provide any of it. Example:

t1_a1 --> xyz_a --> t2_a2
t2_a2 --> xyz_a --> t2_a3

A very simple design could be all permutation combination give a total of 4X4=16

Concept Link ==> Provider | Stuff | Receiver
T1_XYZ_T1_LINK [T1.id, XYZ.id, T1.id]
T1_XYZ_T2_LINK [T1.id, XYZ.id, T2.id]
T1_XYZ_T3_LINK [T1.id, XYZ.id, T3.id]
T1_XYZ_T4_LINK [T1.id, XYZ.id, T4.id]
T2_XYZ_T1_LINK [T2.id, XYZ.id, T1.id]
T2_XYZ_T2_LINK [T2.id, XYZ.id, T2.id]
T2_XYZ_T3_LINK [T2.id, XYZ.id, T3.id]
T2_XYZ_T4_LINK [T2.id, XYZ.id, T4.id]
T3_XYZ_T1_LINK [T3.id, XYZ.id, T1.id]
T3_XYZ_T2_LINK [T3.id, XYZ.id, T2.id]
T3_XYZ_T3_LINK [T3.id, XYZ.id, T3.id]
T3_XYZ_T4_LINK [T3.id, XYZ.id, T4.id]
T4_XYZ_T1_LINK [T4.id, XYZ.id, T1.id]
T4_XYZ_T2_LINK [T4.id, XYZ.id, T2.id]
T4_XYZ_T3_LINK [T4.id, XYZ.id, T3.id]
T4_XYZ_T4_LINK [T4.id, XYZ.id, T4.id]

That is so many join tables, there must be a better design to do this.

Business Context

Many of us want to know this but it is quite complex than it's abstract model, I would simplify and try to give an analogy.

Consider content distribution network like Netflix. Now Netflix (T1 entity e.g T1[N001, 'Netflix']) is the Organization with head office at New York City, USA. Under this, it has again regional offices (T2 entity, say east, west, north, south) and then under that sub-regional offices (T3 entity, state level, say New York, Virginia etc) and then branch offices - (T4 entity, say city level).

Consider the analogy at international level so that T2 can go up 100, 000.

XYZ - consider it content like TV series e.g: House of cards.

Every organization has contents which are either produced in-house or purchased from outside. Example: Netflix produces 'House of cards' so it can broadcast it in its own network and may sell it to some other network too.

Now the organization O1 can make a deal to sell it's stuff/show/movies (XYZ, say HOC) to another organization O2.

You may say a tuple T1|XYZ|T2 is enough to represent that i.e O1|HOC|O2 but we also need to store how we network/stream or distribute 'HOC' from O1 to O2 and there can be 16 ways (permutation of T1, T2, T3, T4) to do it,

Let's take an example:

o1
 |---r1.1
 | |---c1.1.1
 | | |---h1.1.1.1
 | | |---h1.1.1.2
 | | |---h1.1.1.3
 | |---c2
 | | |---h1.1.2.1
 | | |---h1.1.2.2
 | |---C3
 | |---h1.1.3.1
 | |---h1.1.3.2
 | 
 |---r1.2
 |---C1.2.3
 |---h1.2.3.1
 |---h1.2.3.2
o2
 |---r2.1
 | |---c2.1.1
 | | |---h2.1.1.1
 | | |---h2.1.1.2
 | | |---h2.1.1.3
 xyz = hoc

Now some example ways in which o1 can provide hoc o2:

o1 --> hoc --> o2 
o1 --> hoc --> r2.1
o1 --> hoc --> c2.1.1.1
o1 --> hoc --> h2.1.1.1
r1.1 --> hoc --> o2 
r1.1 --> hoc --> r2.1
r1.1 --> hoc --> c2.1.1.1
r1.1 --> hoc --> h2.1.1.1

and so on.

So actually we need one structure to entertain two models:

Organization model
Distribution network

To be noted, Content is very limited here i.e XYZ may have records less than 50 mostly.

Question 2

So initially it's

T2[
 id PK, 
 name, 
 T1_id NOT NULL, -- because it is hierarchy, T2 can not exist outside the parent (T1).
 FK1 = T1_id ref T1(id) 
]
...

and you also say

LINK_T1_T2_XYZ[T1.id, T2.id, XYZ.id] with a check T1 should parent of T2

You can migrate keys down the entities hierarchy

T2[
 id, 
 name, 
 T1_id NOT NULL,
 PK2 = (id, T1_id)
 FK2 = T1_id ref T1(id) 
]
T3[
 id, 
 name, 
 T1_id NOT NULL, 
 T2_id NOT NULL,
 PK3 = (id, T1_id, T2_id)
 FK3 = (T2_id, T1_id) ref T2(id, T1_id) 
]
...

The code for supporting T1, ..T4 entities and the entities which refer them will be a bit complicated but you'll get instant access to any of T1,..T4 from any referencing entity without trading off RI.

Question 3

Thanks do I get this right?==> I need to see what T1 has I will have to join all 4 tables T1, T2, T3 and T4 and 4 link tables and find all X belonging to T1.id. I know about trigger but can we have check constraint to validate such complex validation? that will be quite interesting, I would like to know the SQL.

Question 4

At T4 you'll have keys for all upper tables T3, T2, T1. So if you need T1 info you do join T1 only. No need to join intermediate T2 and T3.

Question 5

First off, if your example is ACTUALLY a good representation of what you are trying to accomplish, then you are going about it all wrong.. You have three domains here - one is businesses, one is business offices, and one is products.

Table BUSINESS should only contain information about the actual business (Netflix, for example) ... no office information at all .. just very very high level "this is information about "NETFLIX").

Table BUSINESS_OFFICE should contain information about ALL offices .. the tier structure of the offices is completely irrelevant to this table. They are all offices of some sort of another, so they all belong in this table. Now, since there IS a hierarchy of offices, you might consider adding a "parent_id" to this table (which would point to an office's higher-up) .. obviously, if an office didn't have a parent_id, then it would be the top level office. IF the structure, however, is NOT hierarchical in nature (one office could potentially have two parents), then you would want to split that information out into a separate join table.

Next, you have the products (Movies, tv shows, etc, that each office might be able to offer).

Now things are much more simple when you need to get the information you want.

You can recursively query on id/parent_id in the offices table to build a list of ids that are all part of the tree you want.. Then you join to the products table with a where clause where the office id is in that list.

Will add example DDL statements shortly.

CREATE TABLE BUSINESS (
 ID INTEGER PRIMARY KEY,
 NAME VARCHAR2(100),
 -- OTHER COLS THAT MIGHT APPLY
)
CREATE TABLE BUSINESS_OFFICE (
 ID INTEGER PRIMARY KEY,
 BUSINESS_ID INTEGER REFERENCES BUSINESS (ID),
 NAME VARCHAR2(100),
 -- OTHER COLS, LIKE ADDRESS, ETC
)
CREATE TABLE PRODUCT (
 ID INTEGER PRIMARY KEY,
 NAME VARCHAR2(100),
 -- OTHER COLS THAT MIGHT APPLY
)

If the structure isn't quite containing the information you need, additional tables can be added quite painlessly.. For examples

CREATE TABLE BUSINESS_PRODUCT (
 BUSINESS_ID INTEGER REFERENCES BUSINESS (ID),
 PRODUCT_ID INTEGER REFERENCES PRODUCT (ID),
 IS_OWNER NUMBER(1) NOT NULL DEFAULT 0
)

etc etc

Determining what a business CAN offer as a product should be a separate issue than determining WHERE it can distributed from.

Question 6

After giving it a lot of thought, going this way is a trade off, each of level of T have a small set of different attributes than each other but a lot of relational difference like billing system tables talk to T3, hardware tables related to T4 and so on. I will strengthening network model but weaken the organization model in the end.

Question 7

@SaurabhTripathi Each type of office can have it's specific metadata stored in a different table .. you can query based off of those tables if you need a specific subset. Or have an "office_type" column.

Question 8

You can simply use IN for XYZ, so that you don't need to have one for each relation):

SELECT *
FROM T1
LEFT JOIN T2 ON (T2.T1_ID = T1.ID)
LEFT JOIN T3 ON (T3.T2_ID = T2.ID)
LEFT JOIN T4 ON (T4.T3_ID = T3.ID)
LEFT JOIN XYZ ON (XYZ.ID IN (T1.XYZ_ID, T2.XYZ_ID, T3.XYZ_ID, T4.XYZ_ID))

Serg Serg 1,2667 silver badges12 bronze badges · Answer 1 · 2016-05-25 19:34:19Z

So initially it's

T2[
 id PK, 
 name, 
 T1_id NOT NULL, -- because it is hierarchy, T2 can not exist outside the parent (T1).
 FK1 = T1_id ref T1(id) 
]
...

and you also say

LINK_T1_T2_XYZ[T1.id, T2.id, XYZ.id] with a check T1 should parent of T2

You can migrate keys down the entities hierarchy

T2[
 id, 
 name, 
 T1_id NOT NULL,
 PK2 = (id, T1_id)
 FK2 = T1_id ref T1(id) 
]
T3[
 id, 
 name, 
 T1_id NOT NULL, 
 T2_id NOT NULL,
 PK3 = (id, T1_id, T2_id)
 FK3 = (T2_id, T1_id) ref T2(id, T1_id) 
]
...

The code for supporting T1, ..T4 entities and the entities which refer them will be a bit complicated but you'll get instant access to any of T1,..T4 from any referencing entity without trading off RI.

Thanks do I get this right?==> I need to see what T1 has I will have to join all 4 tables T1, T2, T3 and T4 and 4 link tables and find all X belonging to T1.id. I know about trigger but can we have check constraint to validate such complex validation? that will be quite interesting, I would like to know the SQL.
At T4 you'll have keys for all upper tables T3, T2, T1. So if you need T1 info you do join T1 only. No need to join intermediate T2 and T3.

Joishi Bodio Joishi Bodio 3,5082 gold badges18 silver badges20 bronze badges · Answer 2 · 2016-05-25 19:21:59Z

First off, if your example is ACTUALLY a good representation of what you are trying to accomplish, then you are going about it all wrong.. You have three domains here - one is businesses, one is business offices, and one is products.

Table BUSINESS should only contain information about the actual business (Netflix, for example) ... no office information at all .. just very very high level "this is information about "NETFLIX").

Table BUSINESS_OFFICE should contain information about ALL offices .. the tier structure of the offices is completely irrelevant to this table. They are all offices of some sort of another, so they all belong in this table. Now, since there IS a hierarchy of offices, you might consider adding a "parent_id" to this table (which would point to an office's higher-up) .. obviously, if an office didn't have a parent_id, then it would be the top level office. IF the structure, however, is NOT hierarchical in nature (one office could potentially have two parents), then you would want to split that information out into a separate join table.

Next, you have the products (Movies, tv shows, etc, that each office might be able to offer).

Now things are much more simple when you need to get the information you want.

You can recursively query on id/parent_id in the offices table to build a list of ids that are all part of the tree you want.. Then you join to the products table with a where clause where the office id is in that list.

Will add example DDL statements shortly.

CREATE TABLE BUSINESS (
 ID INTEGER PRIMARY KEY,
 NAME VARCHAR2(100),
 -- OTHER COLS THAT MIGHT APPLY
)
CREATE TABLE BUSINESS_OFFICE (
 ID INTEGER PRIMARY KEY,
 BUSINESS_ID INTEGER REFERENCES BUSINESS (ID),
 NAME VARCHAR2(100),
 -- OTHER COLS, LIKE ADDRESS, ETC
)
CREATE TABLE PRODUCT (
 ID INTEGER PRIMARY KEY,
 NAME VARCHAR2(100),
 -- OTHER COLS THAT MIGHT APPLY
)

If the structure isn't quite containing the information you need, additional tables can be added quite painlessly.. For examples

CREATE TABLE BUSINESS_PRODUCT (
 BUSINESS_ID INTEGER REFERENCES BUSINESS (ID),
 PRODUCT_ID INTEGER REFERENCES PRODUCT (ID),
 IS_OWNER NUMBER(1) NOT NULL DEFAULT 0
)

etc etc

Determining what a business CAN offer as a product should be a separate issue than determining WHERE it can distributed from.

After giving it a lot of thought, going this way is a trade off, each of level of T have a small set of different attributes than each other but a lot of relational difference like billing system tables talk to T3, hardware tables related to T4 and so on. I will strengthening network model but weaken the organization model in the end.
@SaurabhTripathi Each type of office can have it's specific metadata stored in a different table .. you can query based off of those tables if you need a specific subset. Or have an "office_type" column.

score 0 · Answer 3 · 2016-05-25 23:53:58Z

You can simply use IN for XYZ, so that you don't need to have one for each relation):

SELECT *
FROM T1
LEFT JOIN T2 ON (T2.T1_ID = T1.ID)
LEFT JOIN T3 ON (T3.T2_ID = T2.ID)
LEFT JOIN T4 ON (T4.T3_ID = T3.ID)
LEFT JOIN XYZ ON (XYZ.ID IN (T1.XYZ_ID, T2.XYZ_ID, T3.XYZ_ID, T4.XYZ_ID))

Stack Exchange Network

Avoid repeated traversing of hierarchical data when parsing a big tree frequently

Business Context

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Avoid repeated traversing of hierarchical data when parsing a big tree frequently

Business Context

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions