Query complier/optimizer
flowchart TD user(App/User) compiler[Query compiler/optimizer] engine[Execution engine] index[Index/Record manager] buffer[Buffer manager] transaction[Transaction Manager] storage[(Storage)] user -- Transaction commands --> transaction user -- Query / DML / DDL --> compiler compiler -- Query execution plan --> engine engine -- Record, Index requests --> index index -- Page commands --> buffer buffer -- Read/write pages --> storage transaction --> engine transaction --> buffer
First, a query is converted to a query plan (tree), e.g.
SELECT buyer
FROM Purchase, Person
WHERE buyer=name
AND city = 'Seattle'
AND phone > '5430000'This query can be converted into two query plans
flowchart TD proj1(Projection) select1(Selection) join1(Join) pur1(Purchase) per1(Person) proj2(Projection) select2(Selection) join2(Join) pur2(Purchase) per2(Person) subgraph Filter first proj1 --> select1 select1 --> join1 join1 --> pur1 join1 --> per1 end subgraph Join first proj2 --> join2 join2 --> pur2 join2 --> select2 select2 --> per2 end
Which plan is more efficient? Join is an expensive operation, so it is best to reduce the size of operands as much as possible! Filter first IS the better one.
Transformations in relational algebra
Algebraic laws: Union, Intersection, Join
Commutativity and Associativity
- this applies to both bag and set semantics
Note that each transformation yields a different operation order.
Distributivity
- The usual for
Algebraic laws: Selection
- When
is defined on attributes from only
Selection everywhere
First rule of thumb - perform selection as soon as possible and wherever possible
- Smaller input to subsequent operations
- “Free” - can be done on tuples already in memory anyway
BUT!
- Selection may not help if the next operation could use an index
Selection movearound
Let there be the following views and query:
CREATE VIEW adults AS
SELECT persName, age
FROM Person
WHERE Person.age >= 18
AND persName NOT LIKE '%secret%';
# ------------------------------------
CREATE VIEW retailPerson AS
SELECT buyer AS retailPerson, product
FROM Purchase
UNION
SELECT seller AS retailPerson, product
FROM Purchase;
# ------------------------------------
SELECT *
FROM adults, retailPeople
WHERE persName = retailPerson
AND product = 'iPud23';This can be optimized by moving selection from one query to another!
First, we move criterion on product to retailPerson
CREATE VIEW retailPerson AS
SELECT buyer AS retailPerson, product
FROM Purchase
WHERE product = 'iPud23'
UNION
SELECT seller AS retailPerson, product
FROM Purchase
WHERE product = 'iPud23';
# ------------------------------------
SELECT *
FROM adults, retailPeople
WHERE persName = retailPerson;And then optimized again, an intermediate step
We can move criterion on persName to the main query
CREATE VIEW adults AS
SELECT persName, age
FROM Person
WHERE Person.age >= 18;
# ------------------------------------
SELECT *
FROM adults, retailPeople
WHERE persName = retailPerson
AND persName NOT LIKE '%secret%';And final optimization
We move criterion on persName to both adults and retailPerson
CREATE VIEW adults AS
SELECT persName, age
FROM Person
WHERE Person.age >= 18
AND persName NOT LIKE '%secret%';
# ------------------------------------
CREATE VIEW retailPerson AS
SELECT buyer AS retailPerson, product
FROM Purchase
WHERE product = 'iPud23'
AND persName NOT LIKE '%secret%'
UNION
SELECT seller AS retailPerson, product
FROM Purchase
WHERE product = 'iPud23'
AND persName NOT LIKE '%secret%';
# ------------------------------------
SELECT *
FROM adults, retailPeople
WHERE persName = retailPerson;The resulting query has a MUCH smaller join, which is significantly cheaper
Selection and HAVING
Let there be a query
SELECT product, MAX(age)
FROM Person, Purchase
WHERE persName = buyer
GFROUP BY product
HAVING MAX(age) > 40Can we somehow optimize it?
We only return the maximum age of a buyer per product. This age will always be larger than 40. This means that tuples with
SELECT product, MAX(age)
FROM Person, Purchase
WHERE persName = buyer
AND age > 40
GFROUP BY productIs this always possible? Definitely not! For example if we take MIN instead of MAX, tuples with
Algebraic laws: Projection
- where
are appropriate subsets of attributes of
- where
Projection everywhere
Second rule of thumb - perform bag projection as soon as possible and wherever possible
- Smaller input to subsequent operations
- “Free” - can be done on tuples already in memory anyway
BUT!
- Make sure that columns that are projected out are not needed later
- Projection may not help if the next operation could use an index
Unnesting queries
Let there be a query
SELECT DISTINCT product
FROM Purchase
WHERE buyer IN (SELECT persName
FROM Person
WHERE age >= 18) This can be unnested!
SELECT DISTINCT product
FROM Purchase, Person
WHERE buyer=persName
AND age >= 18Another example:
SELECT DISTINCT x.name, x.maker
FROM Product x
WHERE x.color = 'blue'
AND x.price >= ALL (SELECT y.price
FROM Product y
WHERE x.maker = y.maker
AND y.color = 'blue')What can we done? We can compute the complement first
SELECT DISTINCT x.name, x.maker
FROM Product x
WHERE x.color = 'blue'
AND x.price < ANY (SELECT y.price
FROM Product y
WHERE x.maker = y.maker
AND y.color='blue')And this can now be unnested
SELECT DISTINCT x.name, x.maker
FROM Product x, Product y
WHERE x.color = 'blue'
AND x.price < y.price
AND x.maker = y.maker
AND y.color = 'blue'And now take the complement of this
SELECT DISTINCT x.name, x.maker
FROM Product x
WHERE x.color = 'blue'
EXCEPT
SELECT x.name, x.maker
FROM Product x, Product y
WHERE x.color = 'blue'
AND x.price < y.price
AND x.maker = y.maker
AND y.color = 'blue'Pipelining and plan cost estimation
Computing the cost of a plan
We need to sum the costs of the plan operations
- Some operations may be cheaper with other operations
- We saw for example projection + distinct, cartesian product + selection
- The output of previous operation may be sorted
- The cost depends on the input size
- size of input relations is known
- we need to compute output size of previous operation
Pipelining
The idea is simple - pass each output tuple of one operation directly to the next operation (in the main memory)
We save 1 write of current operation output and 1 read of next operation input
There is, of course, a condition - sufficient memory. If we use entire
An example of pipelining for this query:
SELECT product
FROM Purchase, Person
WHERE Person.persName = Purchase.buyer
AND Person.age >= 18Let the query plan be:
The pipeline would then look something like this: - Selection (without writing output to the disk)
- Hash join step 1 for Person (with writing output to the disk)
- Hash join step 1 for Purchase (with writing output to the disk)
- Hash join step 2 (without writing output to the disk)
- Projection (with writing output somewhere)
We saved 1 write + read of selection result and 1 write + read of join result!
Computing output size
Cartesian product
- Input:
- Output number of tuples:
- Output tuple size:
- Output number of pages:
Reduction factor
- when selection one value out of values in column - this is correct under the assumption of a uniform distribution
- when selecting a range between and when the values in the column range from to - this is correct under the assumption of uniform distribution of values between
and
- this is correct under the assumption of uniform distribution of values between
- for the join - this is correct under the assumption of uniform distribution in
and in
- this is correct under the assumption of uniform distribution in
Cost of operations
| Operation | Without index | Clustered index | Non-clustered index | Pipelined |
|---|---|---|---|---|
| Selection | B(input) | Index search | Index search | 0 |
| Projection | B(input) | B(index) | B(index) | 0 |
- 0 if in memory, 1 for hash index, 2-4 for sorted index
- only relevant to sorted non-clustered index not in memory
- assuming the index is dense and contains the projected fields
| Operation | Without index | Clustered index on S | Non-clustered index on S | R is pipelined |
|---|---|---|---|---|
| Nested loop join | B(R) + B(R) | - | - | B(R) |
| Block nested loop join | B(R) + | - | - | |
| Index nested loop join | - | B(R) + T(R) | B(R) + T(R) | T(R) |
| Sort-merge join | (2 | B(S) | - | (2 |
| Hash join | 3(B(R)+B(S)) | - | - | 2B(R)+3B(S) |
- assuming no pipelining from sorting to merge
- sorted index
- but including a reduction factor for tuples that have a match in S
Example
(A, B, C) (D, E) (F, G)
Assumptions:
- Hash join is possible
- Sort-merge join can always be done in 2 passes, the result is written to disk
- The indices are in the main memory
- Output is pipelined except for sorting output
Both joins are hash joins
- Selection cost is
, output size is - Cost of join on
is , output size is - Cost of join on
is 2(selection output size) + 0 - previous results are pipelines, saves one full read
- join output is already partitioned by hash
Total cost is
Sorted clustered index on C, first join is a hash join, second join is sort merge join
- Selection cost is
, output size is - Cost of join on
is , output size is - Cost of join on
is - previous results are pipelined
Total cost is
Sorted clustered index on
- Selection cost is
, output size is - Cost of join on
is , output size is - Cost of join
is - selection result is pipelined
- previous join is pipelined and sorted by
Total cost is