BCNF and 3NF
Can BCNF be problematic? It is lossless, but is it all that we want? Turns out, yes, there are problematic examples!
- R(Film, Actor, Character)
- Character
Film - Film, Actor
Character
Let us decompose on the first FD
- R1(Character, Film)
- R2(Character, Actor)
Both tables are in BCNF and it is indeed a lossless decomposition. Let us insert some legal data into these tables!
- (Austin Powers, Austin Powers)
R1 - (Dr. Evil, Austin Powers)
R1 - (Austin Powers, Mike Myers)
R2 - (Dr. Evil, Mike Myers)
R1
And then join them back into R
- (Austin Powers, Mike Myers, Austin Powers)
- (Austin Powers, Mike Myers, Dr. Evil)
We got a violation of the second FD! Why? What happened? It is indeed a problem of BCNF, sometimes it is too restrictive - it forces too much decomposition. We would like to not only preserve the data (lossless), but to preserve FDs (dependency preservation)!
Third normal form (3NF) definition
Clearly, 3NF is less restrictive than BCNF, it allows more FDs
Decomposition to 3NF
How can we decompose the relation to be in 3NF?
- Using the BCNF lossless decomposition on FDs that violate 3NF only
- This method is lossless, but we might still lose some FDs!
- Using another decomposition algorithm, called “Synthesis algorithm”, that is both lossless and dependency-preserving!
Server side
The high-level architecture of the server tier looks something like this
flowchart TD user(App/User) compiler[Query compiler/optimizer] engine[Execution engine] index[Index/Record manager] buffer[Buffer manager] transaction[Transaction Manager] storage[(Storage)] user -- Transaction commands --> transaction user -- Query / DML / DDL --> compiler compiler -- Query execution plan --> engine engine -- Record, Index requests --> index index -- Page commands --> buffer buffer -- Read/write pages --> storage transaction --> engine transaction --> buffer
Storage
We will talk about buffer manager and storage
| Main memory | External memory | |
|---|---|---|
| RAM, cache | HDD, SSD, removable storage | |
| Processor access | Directly by CPU | Requires loading the data into main memory |
| Volatility | Volatile | Persistent, does not depend on power supply |
| Cost | Expensive | Cheap |
| Access | Tens of nanoseconds, less for cache | A few milliseconds, one page/sector at a time (typically 4KB) |
Main memory buffer management
The buffer manager of the DBMS asks to load pages from the external memory to frames of the buffer pool. The buffer pool is a collection of frames stored in the main memory for fast access. Choice of frame to load data into is defined by the replacement policy. The buffer manager
- Decides which pages to load
- Decides which pages to override (replacement policy)
- Decides which page to write back to the external memory
Note that buffer manager cannot be replaced by the operating system
- DBMS may be able to anticipate access patterns
- DBMS then may also be able to perform prefetching
- DBMS needs the ability to force pages to disk
In main memory algorithms, the main cost factor are the CPU operations In database algorithms, the main cost factor are the I/O operations, i.e. reading/writing pages from the external memory. We want to redesign algorithms to minimize that cost!
I/O model of computation
- Assume the cost of reading/writing one page to be 1. The cost of the algorithm is then number of pages read/written
- We are interested in the exact average cost.
do not suffice, the coefficient matters! - We are able to make assumptions about the data itself, i.e. “tuple is small enough to fit into a page”
2-way external merge sort
Imagine we have to sort 400Gb data with 16Gb RAM, is it possible? Not with usual sorting algorithms, we simply won’t be able to read the file!
2-way external merge sort is one of the solutions!
- Pass 1 - read a page, sort it in place, write it.
- only one buffer page is used
- Pass 2 and on - read two pages, merge them into an output page, write output page
- three buffer pages are used
- Each time the output page is filled - write it to the end of the sorted block.
- Each time an input page is fully merged - replace it with the next page from the same sorted block (if exists)
Let the data consist of
Let the size of main memory buffer (in pages) be
- In pass 1 we can use
pages instead of 1 - In pass 2 and on we can use
input pages instead of just 2
This will then reduce the number of passes to