Database Design

Lecture 14
Functional Dependencies

Functional dependencies (FD) are are type of constraint that is based on keys. A superkey is defined as in the relational schema R , where:

subkey

r(R)

t₁

t₂

t₁

t₂

t₁[K]

t₂[K].

Or, no two rows (tuples) have the same value in the attribute(s) K, which is the key. Now, if there are two attributes (or sets of attributes) A and B that are legal in the relation schema R, we can have a functional dependency where

for all pairs of tuples such that t₁[A]is equal to t₂[A] and t₁[B]is equal to t₂[B]. This allows us to state that K is a superkey of R if K implies R. For example, in a relation that has names and social security numbers, whenever your Social Security number is the student ID, the name in that tuple can only contain your name. That is because your name is not unique, but your Social Security is. If I go to the Social Security Administration and search their database for the name "Gary Burt", the results is a large number of people. If I search of the social security number "123-45-6789", the result is one and only one person.

Another example is in the loan information that we looked at before:

Loan-info-schema = (branch-name, loan-name, customer-name, amount)

it can be shown that the loan-number implies both the amount and the branch-name. It does not imply the customer-name because there may be more than one person listed on the load, such as a husband and wife, or parent and child (when the parent co-signs the loan).

Functional dependencies:

specify a set of constraints on a legal relation.
test relations to see if they are legal.

Some relations are said to be trivial when they are satisfied by all relations:.

A implies A
A implies B and B implies A.

Closure of a Set of Functional Dependencies

It is not enough to look at a single FD. All FDs must be considered in a relation! Given the schema R = ( A, B, C, G, H, I) and the FDs of:

We can show that A implies H because A implies B which implies H.

The notional for a FD is F. The notation of F⁺ is the set of all FDs logically implied by F. There is a set of rules, called Armstrong's axioms, that we can use to to compute closure.

Reflexivity rule: If A is a set of attributes, and B is a set of attributes that are completely contained in A, the A implies B.
Augmentation rule: If A implies B, and C is a set of attributes, then if A implies B, then AC implies BC.
Transitivity rule: If A implies B and B implies C, then A implies C.

These can be simplified if we also use:

Union rule: If A implies B and A implies C, the A implies AC.
Decomposition rule: If A implies BC then A implies B and A implies C.
Pseudotransitivity rule: If A implies B and CB implies D, then AC implies D.

Using mathematical principles, we can not test a set of attributes to see if they are a legal superkey.

Pitfalls in Relational-Database Design

Obviously, we can have good and bad designs. Among the undesirable design items are:

Repetition of information
Inability to represent certain information

The relation lending with the schema is an example of a bad design:

Lending-Schema=(branch-name, branch-city, assets, cutomer-name, loan-number, amount)

branch-name branch-city assets customer-name loan-number amount

Downtown Brooklyn

9000000

Jones L-17

1000

Redwood Palo Alto

2100000

Smith L-23

2000

Perryridge Horseneck

1700000

Hayes L-15

1500

Downtown Brooklyn

9000000

Jackson L-14

1500

Mianus Horseneck

400000

Jones L-93

500

Round Hill Horseneck

8000000

Turner L-11

900

Pownal Bennington

300000

Williams L-29

1200

North Town Rye

3700000

Hayes L-16

1300

Downtown Brooklyn

9000000

Johnson L-23

2000

Perryridge Horseneck

1700000

Glenn L-25

2500

Brighton Brooklyn

7100000

Brooks L-10

2200

Looking at the Downtown and Perryridge, when a new loan is added, the branch-city and assets must be repeated. That makes updating the table more difficult, because the update must guarantee that all tuples are updated. Additional problems come from having two people take out one loan (L-23). More complexity is involved when Jones took out a loan at a second branch (maybe one near home and the other near work.) Notice that there is no way to represent information on a branch unless there is a loan.

Decomposition

The obvious solution is that we should decompose this relation. As an alternative design, we can use the Decomposition rule: If A implies BC then A implies B and A implies C.

This gives us the schemas:

branch-customer-schema = (branch-name, branch-city, assets, customer-name)
customer-loan-schema = (customer-name, loan-number, amount)

branch-name branch-city assets customer-name

Downtown Brooklyn

9000000

Jones

Redwood Palo Alto

2100000

Smith

Perryridge Horseneck

1700000

Hayes

Downtown Brooklyn

9000000

Jackson

Mianus Horseneck

400000

Jones

Round Hill Horseneck

8000000

Turner

Pownal Bennington

300000

Williams

North Town Rye

3700000

Hayes

Downtown Brooklyn

9000000

Johnson

Perryridge Horseneck

1700000

Glenn

Brighton Brooklyn

7100000

Brooks

customer-name loan-number amount

Jones L-17

1000

Smith L-23

2000

Hayes L-15

1500

Jackson L-14

1500

Jones L-93

500

Turner L-11

900

Williams L-29

1200

Hayes L-16

1300

Johnson L-23

2000

Glenn L-25

2500

Brooklyn L-10

2200

Then when we need to get back to the original table, we can do a natural join on the two relations branch-customer and customer-loan.

Evaluating this design, how does it compare to the first version?

Looking at the Downtown and Perryridge, when a new loan is added, the branch-city and assets must be repeated. Problem still exists.
Problems come from having two people take out one loan (L-23). Problem still exists.
More complexity is involved when Jones took out a loan at a second branch. Problem still exists.
Notice that there is no way to represent information on a branch unless there is a loan. Problem still exists.

Worse, there is a new problem! When we do the natural join, we get back four additional tuples that did not exists in the original table:

(Downtown, Brooklyn, 9000000, Jones, L-93, 500)
(Perryridge, Horseneck, 1700000, Hayes, L-16, 1300)
(Mianus, Horseneck, 400000, Jones, L-17, 1000)
(North Town, Rye, 3700000, Hayes, L-15, 1500)

We are no long able to represent in the database information about which customers are borrows from which branch. This is called a lossy decomposition or lossy-join decomposition. A decomposition that is not a lossy-decomposition is a lossless-join decomposition. Lossless-joins are a requirement for good design and this causes constraints on the set of possible relations. We say that a relation is legal if it satisfies all rules, or constraints, imposed.

The proper way to decomposition this example so that we can have a lossless-join is to use three relations.

branch-schema = (branch-name, assets, branch-city)
loan-schema = (branch-name, loan-number, amount)
borrower-schema = (customer-name, loan-number)

Normalization Using Functional Dependencies

Using FDs, it is possible to define several normal forms to help develop good database designs. The two that we will example are Boyce-Codd Normal Form (BCNF) and Third Normal Form (3NF). The requirements for good decomposition are

Lossless-Join Decomposition
Dependency Preservation
Lack of Repetition of Information

We've discussed the lossless decomposition. Dependency preservation specifies that the design insure that when an update is made to the database, that it does not create an illegal relation. In regard to the repetition of information, it is necessary to include the key of another table, so that the joins can be properly formed. That is the only information that should be in both tables!

Boyce-Codd Normal Form

A relation schema R is said to be in BCNF with respect to a set F of FDs, if for all FDs in F⁺ of the form A implies B, where A is a subset of R and B is a subset of R and at least one of the following rules is true:

A implies B is a trivial FD (B is a subset of A)
A is a superkey for schema R

Without doing the mathematically proofs, it can be shown that the BCNF results in dependency preservation.

Third Normal Form

A relation schema R is in 3NF with respect to a set F of FDs if, for all FDs in the F⁺ of the form A implies B, where A is a subset of R and B is a subset of R and at least one of the following rules is true:

A implies B is a trivial FD
A is a superkey for schema R
Each attribute in the result of the expression B-A is contained in a candidate key for R

Lecture 14 Functional Dependencies

Closure of a Set of Functional Dependencies

Pitfalls in Relational-Database Design

Decomposition

Normalization Using Functional Dependencies

Boyce-Codd Normal Form

Third Normal Form

Lecture 14
Functional Dependencies