Database normalization

Normalization is a set of rules to apply after designing what data should be stored in the DB. Normalization guarantees certain properties of the DB without the loss of information

  • Better organization
  • Enforcement of constraints
  • Absence of redundant data
  • Absence of update anomalies

First normal form (1NF) definition

Tables can only contain atomic values, no nesting is allowed

Functional dependencies (FDs) definition

A functional dependency is a general type of constraints on attributes

We can detect which FDs hold now, but we don’t necessarily have to enforce all of them in the future, that is for the DBA to determine!

Inference rules for FDs

Given a minimal set of FDs from the architect, it is important to know that there are other FDs, implicitly implied by that set.

Armstrong’s axioms

Reflexivity rule lemma

Right-Hand-Side Split/Combine rule lemma

Transitivity rule lemma

Inferred FDs

Initial FDs from the provided set and the inferred FDs hold equally, with no distinction between two types

Exercise
  • Payment(ID, name, title, amount, date, department)
  • ID name, title
  • title amount
  • amount, department date

Which FDs can be inferred?

  • Reflexivity: ID name
  • Reflexivity: ID title
  • Reflexivity: amount, department amount
  • Reflexivity: amount, department department
  • Reflexivity: ID, department ID
  • Transitivity: ID amount
  • Transitivity: ID, department amount
  • Reflexivity: ID, department department
  • Combination: ID, department amount, department
  • Transitivity: ID, department date
  • etc.

Number of different inferred FDs grows exponentially, so it is reasonable to only leave the most “interesting” ones:

  • Minimal left-side
  • Maximal right-side

For example

  • ID name, title, amount
  • ID, department name, title, amount, date

Attribute closure definition

Returning to the previous exercice:

Defining a key based on FDs lemma

We will use key for the general case (some arbitrary key), and minimal key/super key for keys that we know to be minimal or not minimal

Exercise

  • Enrollment(student, address, course, room, time)
  • student address
  • room, time course
  • student, course room, time

What would be the minimal key?

  • Clearly, student must be in any key (any attribute that only appears in the left-side must be in the key)
  • Clearly, address will not be in the minimal key (any attribute that only appears in the right-side won’t be in the minimal key)

Relation decomposition

Let there be a relation

NameIDPhoneCity
Alice12345678903-1111111Ramat Gan
Alice123456789050-1111112Ramat Gan
Bob01234567803-1111113Tel Aviv
Bob012345678052-1111114Tel Aviv
  • ID Name, City
  • ID Phone
  • Redundancy - repeated data
  • Update anomalies - what if Alice moves to Tel Aviv?
  • Deletion anomalies - what if Alice drops all phone numbers?

What is the solution?

  • Decompose relation without losing any information
  • Obtain a set of relations without abnormalities (a normal form)

For example:

NameIDCity
Alice123456789Ramat Gan
Bob012345678Tel Aviv
IDPhone
12345678903-1111111
123456789050-1111112
01234567803-1111113
012345678052-1111114
But how can we say for sure that a decomposition is correct, i.e. it doesn’t lose any information?

Boyce-Codd normal form (BCNF) definition

Lossless decomposition step lemma

  • During decomposition, it is important to be careful with duplicate rows

  • For simplicity, we will assume that DISTINCT is applied to every relation

  • All the FDs that hold in the original relation, also hold in projected relations, including implied FDs, but, excluding FDs involving attributes that were projected out

  • Prefer FDs with minimal left-side and maximal right-side for decomposition

  • Prefer FDs that split the relation roughly in half

Example

  • Person(name, SSN, age, hairColor, phoneNumber)
  • SSN name, age
  • age hairColor

First, find all implied FDs

  • SSN name, age, hairColor

Second, find all keys

  • and its supersets

Clearly, is not in BCNF, all three FDs violate it Let us decompose on SSN name, age, hairColor

  • R1(SSN, name, age, hairColor) - all FDs are preserved
    • age hairColor still violates BCNF, decompose again
      • R11(age, hairColor) =
      • R12(age, SSN, name) =
  • R2(SSN, phoneNumber) =

Decomposition is a BCNF decomposition of