Query complier/optimizer

flowchart TD

user(App/User)
compiler[Query compiler/optimizer]
engine[Execution engine]
index[Index/Record manager]
buffer[Buffer manager]
transaction[Transaction Manager]
storage[(Storage)]

user -- Transaction commands --> transaction
user -- Query / DML / DDL --> compiler
compiler -- Query execution plan --> engine
engine -- Record, Index requests --> index
index -- Page commands --> buffer
buffer -- Read/write pages --> storage

transaction --> engine
transaction --> buffer

First, a query is converted to a query plan (tree), e.g.

SELECT buyer
FROM   Purchase, Person
WHERE  buyer=name
       AND city = 'Seattle'
       AND phone > '5430000'

This query can be converted into two query plans

flowchart TD

proj1(Projection)
select1(Selection)
join1(Join)
pur1(Purchase)
per1(Person)

proj2(Projection)
select2(Selection)
join2(Join)
pur2(Purchase)
per2(Person)


	subgraph Filter first
		proj1 --> select1
		select1 --> join1
		join1 --> pur1
		join1 --> per1
	end
	
	subgraph Join first
		proj2 --> join2
		join2 --> pur2
		join2 --> select2
		select2 --> per2
	end

Which plan is more efficient? Join is an expensive operation, so it is best to reduce the size of operands as much as possible! Filter first IS the better one.

Transformations in relational algebra

Algebraic laws: Union, Intersection, Join

Commutativity and Associativity

  • - this applies to both bag and set semantics

Note that each transformation yields a different operation order.

Distributivity

  • The usual for

Algebraic laws: Selection

  • When is defined on attributes from only

Selection everywhere

First rule of thumb - perform selection as soon as possible and wherever possible

  • Smaller input to subsequent operations
  • “Free” - can be done on tuples already in memory anyway

BUT!

  • Selection may not help if the next operation could use an index

Selection movearound

Let there be the following views and query:

CREATE VIEW adults AS
SELECT persName, age
FROM   Person
WHERE  Person.age >= 18
       AND persName NOT LIKE '%secret%';
 
# ------------------------------------
 
CREATE VIEW retailPerson AS
SELECT buyer AS retailPerson, product
FROM   Purchase
	UNION
SELECT seller AS retailPerson, product
FROM   Purchase;
 
# ------------------------------------
 
SELECT *
FROM   adults, retailPeople
WHERE  persName = retailPerson
       AND product = 'iPud23';

This can be optimized by moving selection from one query to another! First, we move criterion on product to retailPerson

CREATE VIEW retailPerson AS
SELECT buyer AS retailPerson, product
FROM   Purchase
WHERE  product = 'iPud23'
	UNION
SELECT seller AS retailPerson, product
FROM   Purchase
WHERE  product = 'iPud23';
 
# ------------------------------------
 
SELECT *
FROM   adults, retailPeople
WHERE  persName = retailPerson;

And then optimized again, an intermediate step We can move criterion on persName to the main query

CREATE VIEW adults AS
SELECT persName, age
FROM   Person
WHERE  Person.age >= 18;
 
# ------------------------------------
 
SELECT *
FROM   adults, retailPeople
WHERE  persName = retailPerson
	   AND persName NOT LIKE '%secret%';

And final optimization We move criterion on persName to both adults and retailPerson

CREATE VIEW adults AS
SELECT persName, age
FROM   Person
WHERE  Person.age >= 18
	   AND persName NOT LIKE '%secret%';
 
# ------------------------------------
 
CREATE VIEW retailPerson AS
SELECT buyer AS retailPerson, product
FROM   Purchase
WHERE  product = 'iPud23'
	   AND persName NOT LIKE '%secret%'
	UNION
SELECT seller AS retailPerson, product
FROM   Purchase
WHERE  product = 'iPud23'
	   AND persName NOT LIKE '%secret%';
 
# ------------------------------------
 
SELECT *
FROM   adults, retailPeople
WHERE  persName = retailPerson;

The resulting query has a MUCH smaller join, which is significantly cheaper

Selection and HAVING

Let there be a query

SELECT    product, MAX(age)
FROM      Person, Purchase
WHERE     persName = buyer
GFROUP BY product
HAVING    MAX(age) > 40

Can we somehow optimize it? We only return the maximum age of a buyer per product. This age will always be larger than 40. This means that tuples with do not affect the result!

SELECT    product, MAX(age)
FROM      Person, Purchase
WHERE     persName = buyer
		  AND age > 40
GFROUP BY product

Is this always possible? Definitely not! For example if we take MIN instead of MAX, tuples with do affect the result

Algebraic laws: Projection

    • where are appropriate subsets of attributes of

Projection everywhere

Second rule of thumb - perform bag projection as soon as possible and wherever possible

  • Smaller input to subsequent operations
  • “Free” - can be done on tuples already in memory anyway

BUT!

  • Make sure that columns that are projected out are not needed later
  • Projection may not help if the next operation could use an index

Unnesting queries

Let there be a query

SELECT DISTINCT product
FROM   Purchase
WHERE  buyer IN (SELECT persName
				 FROM   Person
				 WHERE  age >= 18)         

This can be unnested!

SELECT DISTINCT product
FROM   Purchase, Person
WHERE  buyer=persName
       AND age >= 18

Another example:

SELECT DISTINCT x.name, x.maker
FROM   Product x
WHERE  x.color = 'blue'
       AND x.price >= ALL (SELECT y.price
                           FROM   Product y
						   WHERE x.maker = y.maker
						   AND y.color = 'blue')

What can we done? We can compute the complement first

SELECT DISTINCT x.name, x.maker
FROM   Product x
WHERE  x.color = 'blue'
       AND x.price < ANY (SELECT y.price
                           FROM   Product y
						   WHERE x.maker = y.maker
						   AND y.color='blue')

And this can now be unnested

SELECT DISTINCT x.name, x.maker
FROM   Product x, Product y
WHERE  x.color = 'blue'
       AND x.price < y.price
       AND x.maker = y.maker
       AND y.color = 'blue'

And now take the complement of this

SELECT DISTINCT x.name, x.maker
FROM   Product x
WHERE  x.color = 'blue'
	EXCEPT
SELECT x.name, x.maker
FROM   Product x, Product y
WHERE  x.color = 'blue'
       AND x.price < y.price
       AND x.maker = y.maker
       AND y.color = 'blue'

Pipelining and plan cost estimation

Computing the cost of a plan

We need to sum the costs of the plan operations

  • Some operations may be cheaper with other operations
    • We saw for example projection + distinct, cartesian product + selection
    • The output of previous operation may be sorted
  • The cost depends on the input size
    • size of input relations is known
    • we need to compute output size of previous operation

Pipelining

The idea is simple - pass each output tuple of one operation directly to the next operation (in the main memory) We save 1 write of current operation output and 1 read of next operation input There is, of course, a condition - sufficient memory. If we use entire pages in memory for sorting by name, we cannot pipeline to sorting by GPA

An example of pipelining for this query:

SELECT product
FROM   Purchase, Person
WHERE  Person.persName = Purchase.buyer
	   AND Person.age >= 18

Let the query plan be:

  • The pipeline would then look something like this:
  • Selection (without writing output to the disk)
  • Hash join step 1 for Person (with writing output to the disk)
  • Hash join step 1 for Purchase (with writing output to the disk)
  • Hash join step 2 (without writing output to the disk)
  • Projection (with writing output somewhere)

We saved 1 write + read of selection result and 1 write + read of join result!

Computing output size

Cartesian product

  • Input:
  • Output number of tuples:
  • Output tuple size:
  • Output number of pages:

Reduction factor

  • - when selection one value out of values in column
    • this is correct under the assumption of a uniform distribution
  • - when selecting a range between and when the values in the column range from to
    • this is correct under the assumption of uniform distribution of values between and
  • - for the join
    • this is correct under the assumption of uniform distribution in and in

Cost of operations

OperationWithout indexClustered indexNon-clustered indexPipelined
SelectionB(input)Index search + B(input) RFIndex search (+ B(index) RF) + T(input) RF0
ProjectionB(input)B(index)B(index)0
  1. 0 if in memory, 1 for hash index, 2-4 for sorted index
  2. only relevant to sorted non-clustered index not in memory
  3. assuming the index is dense and contains the projected fields
OperationWithout indexClustered index on SNon-clustered index on SR is pipelined
Nested loop joinB(R) + B(R)B(S)--B(R)B(S)
Block nested loop joinB(R) + --
Index nested loop join-B(R) + T(R)(selection on S)B(R) + T(R)(selection on S)T(R)(B(R) + T(R)(selection on S)
Sort-merge join(2of passes+1)(B(R)+B(S))B(S)+(2of passes+1)B(R)-(2of passes)B(R) + (2of passes+1)B(S)
Hash join3(B(R)+B(S))--2B(R)+3B(S)
  1. assuming no pipelining from sorting to merge
  2. sorted index
  3. but including a reduction factor for tuples that have a match in S

Example

  • (A, B, C)
  • (D, E)
  • (F, G)

Assumptions:

  • Hash join is possible
  • Sort-merge join can always be done in 2 passes, the result is written to disk
  • The indices are in the main memory
  • Output is pipelined except for sorting output

Both joins are hash joins

  • Selection cost is , output size is
  • Cost of join on is , output size is
  • Cost of join on is 2(selection output size) + 0
    • previous results are pipelines, saves one full read
    • join output is already partitioned by hash

Total cost is

Sorted clustered index on C, first join is a hash join, second join is sort merge join

  • Selection cost is , output size is
  • Cost of join on is , output size is
  • Cost of join on is
    • previous results are pipelined

Total cost is

Sorted clustered index on and on , joins are sort-merge

  • Selection cost is , output size is
  • Cost of join on is , output size is
  • Cost of join is
    • selection result is pipelined
    • previous join is pipelined and sorted by

Total cost is