Database normalization
Normalization is a set of rules to apply after designing what data should be stored in the DB. Normalization guarantees certain properties of the DB without the loss of information
- Better organization
- Enforcement of constraints
- Absence of redundant data
- Absence of update anomalies
First normal form (1NF) definition
Tables can only contain atomic values, no nesting is allowed
Functional dependencies (FDs) definition
A functional dependency is a general type of constraints on attributes
We can detect which FDs hold now, but we don’t necessarily have to enforce all of them in the future, that is for the DBA to determine!
Inference rules for FDs
Given a minimal set of FDs from the architect, it is important to know that there are other FDs, implicitly implied by that set.
Armstrong’s axioms
Reflexivity rule lemma
Right-Hand-Side Split/Combine rule lemma
Transitivity rule lemma
Inferred FDs
Initial FDs from the provided set and the inferred FDs hold equally, with no distinction between two types
Exercise
- Payment(ID, name, title, amount, date, department)
- ID
name, title - title
amount - amount, department
date
Which FDs can be inferred?
Reflexivity: ID nameReflexivity: ID titleReflexivity: amount, department amountReflexivity: amount, department departmentReflexivity: ID, department IDTransitivity: ID amountTransitivity: ID, department amountReflexivity: ID, department departmentCombination: ID, department amount, departmentTransitivity: ID, department date- etc.
Number of different inferred FDs grows exponentially, so it is reasonable to only leave the most “interesting” ones:
- Minimal left-side
- Maximal right-side
For example
- ID
name, title, amount - ID, department
name, title, amount, date
Attribute closure definition
Returning to the previous exercice:
Defining a key based on FDs lemma
We will use key for the general case (some arbitrary key), and minimal key/super key for keys that we know to be minimal or not minimal
Exercise
- Enrollment(student, address, course, room, time)
- student
address - room, time
course - student, course
room, time
What would be the minimal key?
- Clearly, student must be in any key (any attribute that only appears in the left-side must be in the key)
- Clearly, address will not be in the minimal key (any attribute that only appears in the right-side won’t be in the minimal key)
Relation decomposition
Let there be a relation
| Name | ID | Phone | City |
|---|---|---|---|
| Alice | 123456789 | 03-1111111 | Ramat Gan |
| Alice | 123456789 | 050-1111112 | Ramat Gan |
| Bob | 012345678 | 03-1111113 | Tel Aviv |
| Bob | 012345678 | 052-1111114 | Tel Aviv |
- ID
Name, City - ID
Phone - Redundancy - repeated data
- Update anomalies - what if Alice moves to Tel Aviv?
- Deletion anomalies - what if Alice drops all phone numbers?
What is the solution?
- Decompose relation without losing any information
- Obtain a set of relations without abnormalities (a normal form)
For example:
| Name | ID | City |
|---|---|---|
| Alice | 123456789 | Ramat Gan |
| Bob | 012345678 | Tel Aviv |
| ID | Phone |
|---|---|
| 123456789 | 03-1111111 |
| 123456789 | 050-1111112 |
| 012345678 | 03-1111113 |
| 012345678 | 052-1111114 |
| But how can we say for sure that a decomposition is correct, i.e. it doesn’t lose any information? |
Boyce-Codd normal form (BCNF) definition
Lossless decomposition step lemma
-
During decomposition, it is important to be careful with duplicate rows
-
For simplicity, we will assume that
DISTINCTis applied to every relation -
All the FDs that hold in the original relation, also hold in projected relations, including implied FDs, but, excluding FDs involving attributes that were projected out
-
Prefer FDs with minimal left-side and maximal right-side for decomposition
-
Prefer FDs that split the relation roughly in half
Example
- Person(name, SSN, age, hairColor, phoneNumber)
- SSN
name, age - age
hairColor
First, find all implied FDs
- SSN
name, age, hairColor
Second, find all keys
and its supersets
Clearly,
- R1(SSN, name, age, hairColor) - all FDs are preserved
- age
hairColor still violates BCNF, decompose again- R11(age, hairColor) =
- R12(age, SSN, name) =
- R11(age, hairColor) =
- age
- R2(SSN, phoneNumber) =
Decomposition