Aggregation — Grouping
Group by clause definition
GROUP BY separates the resulting set into groups (subsets) by the given column(s) and thus allows for calculation of aggregates for each such group (subset) instead of the whole result set
GROUP BY clause
Columns used in GROUP BY can be present in the result set as is, but don’t have to Any other columns must only be used in some aggregate
SELECT product, SUM(price*quantity) AS salesPerProduct
FROM Purchase
WHERE date >= '2020.01.01'
GROUP BY productThis query will return sum of sales since 2020 for each product
Some GROUP BY queries can be written without GROUP BY using sub-queries, although it is not really useful
Having clause definition
HAVING clause is used in combination with GROUP BY and is roughly equivalent to WHERE for each group created by GROUP BY instead of a condition for the whole table
SELECT product, MAX(price) AS maxPrice
FROM Purchase
WHERE date >= '2020.01.01'
GROUP BY product
HAVING MAX(price) < 100 AND MIN(price) > 10Using HAVING without aggregation is usually pointless as it can be done with WHERE or a sub-query, or can have unexpected results, different from defining conditions on the whole table
Quote from Wikipedia
WHEREis taken into account at an earlier stage of a query execution, filtering the rows read from the tables. If a query containsGROUP BY, rows from the tables are grouped and aggregated. After the aggregating operation,HAVINGis applied, filtering out the rows that don’t match the specified conditions. Therefore,WHEREapplies to data read from tables, andHAVINGshould only apply to aggregated data, which isn’t known in the initial stage of a query.
Exercises
- Author(login, name)
- Wrote(login, title) Find names of authors who wrote at least 10 documents
SELECT Author.name
FROM Author, Wrote
WHERE Author.login = Wrote.login
GROUP BY Author.login, Author.name
HAVING COUNT(*) >= 10Find the maximal number of articles that any author wrote
SELECT MAX(nums.numArticles) AS maxArticles
FROM (SELECT COUNT(*) AS numArticles
FROM Wrote
GROUP BY Wrote.login) AS numsor
SELECT COUNT(*) AS maxArticles
FROM Wrote
GROUP BY Wrote.login
HAVING COUNT(*) >= ALL
(SELECT COUNT(*)
FROM Wrote
GROUP BY Wrote.login)NULLs in SQL definition
NULL is a missing or unknown atomic value in the table
We can either allow NULLs or prohibit them in the table/column
Problems with NULLs
- Selection, is
NULL = 2? - Distinct, are two NULLs distinct or not?
- Joins, are two NULLs equal or not?
- Intersect, are two NULLs equal or not?
...To solve this problem, in SQL there are three(!) boolean values:FALSE = 0UNKNOWN = 0.5(conceptually)TRUE = 1UNKNOWNis used specifically for comparisons withNULL
Logical operations with UNKNOWN
ANDTRUE AND UNKNOWN = UNKNOWNFALSE AND UNKNOWN = FALSEUNKNOWN AND UNKNOWN = UNKNOWN
ORTRUE OR UNKNOWN = TRUEFALSE OR UNKNOWN = UNKNOWNUNKNOWN OR UNKNOWN = UNKNOWN
NOTNOT UNKNOWN = UNKNOWNThese can be interpreted (conceptually) as:
x AND y = min(x, y)x OR y = max(x, y)NOT x = 1 - xRows with the value ofWHEREcondition beingUNKNOWNwill not be returned
IS NULL and IS NOT NULL definition
Operators IS NULL and IS NOT NULL check exactly what is written
If a value is equal (not equal) to NULL, then it returns TRUE, otherwise FALSE
Explicit joins
Benefits
- Explicitly define join criteria and distinguish them from selection criteria
- Avoid bugs like unintended cartesian products (i.e. forgotten join clause in
WHERE)
SELECT Product.name, Purchase.store
FROM Product JOIN Purchase ON Product.name = Purchase.prodNameis equivalent to
SELECT Product.name, Purchase.store
FROM Product, Purchase
WHERE Product.name = Purchase.prodNameOuter joins definition
Outer joins include rows from the first table that did not match any rows from the second table
LEFT OUTER JOIN,LEFT JOIN- for each rows on the left do a regular join if there is a match, or match with a row of NULLs from the right tableRIGHT OUTER JOIN,RIGHT JOIN- same asLEFT OUTER JOINbut symmetricFULL OUTER JOIN- include non-matching rows from both left and right tables- Missing from
MySQL, but can be emulated
- Missing from
Inner joins definition
INNER JOIN,JOIN- same as implicit join (with a joining condition!)
Exercises
- Product(pname, price, category, manufacturer)
- Purchase(buyer, seller, store, product)
- Company(cname, stockPrice, country) Find people who bought anything from Lucy or bought products from a company whose stock price is over $50
SELECT DISTINCT pur.buyer
FROM Purchase AS pur
JOIN Product AS p ON pur.product = p.pname
JOIN Company AS c ON p.manufacturer = c.cname
WHERE pur.seller = 'Lucy' OR c.stockPrice > 50Let’s talk about the last query, there is a problem:
What if we have a NULL somewhere?
Conditions in OR are asymmetric, that is they examine different tables, and they should not depend on the contents of other tables when joining. Any failure in the joining chain would break the query
Let’s try using LEFT OUTER JOIN to eliminate this problem!
SELECT DISTINCT pur.buyer
FROM Purchase AS pur
LEFT JOIN Product AS p ON pur.product = p.pname
LEFT JOIN Company AS c ON p.manufacturer = c.cname
WHERE pur.seller = 'Lucy' OR c.stockPrice > 50Or by using UNION:
(SELECT pur.buyer
FROM Purchase AS pur
WHERE pur.seller = 'Lucy')
UNION
(SELECT pur.buyer
FROM Purchase AS pur
JOIN Product AS p ON pur.product = p.pname
JOIN Company AS c ON p.manufacturer = c.cname
WHERE c.stockPrice > 50)Ways to deal with NULLs
CASE WHEN definition
Switch case clause in SQL
SELECT product,
CASE
WHEN price < 1500 THEN 'Low price'
WHEN price BETWEEN 1500 AND 2000 THEN 'Medium price'
WHEN price > 2000 THEN 'High price'
ELSE 'Unknown'
END AS PriceCategory
...COALESCE definition
Choose the first non-NULL value from the list
SELECT COALESCE(p.category, 'misc') AS category,
SUM(price*quantity) AS totalSales
FROM Purchase AS pur
LEFT JOIN Product AS p ON pur.product = p.pname
GROUP BY COALESCE(p.category, 'misc')- Purchase(pid, buyer, seller, product, date, price) For each purchase, calculate the sum of all previous purchases of the same buyer
SELECT pur1.pid, COALESCE((SELECT SUM(pur2.price)
FROM Purchase AS pur2
WHERE pur2.buyer = pur1.buyer
AND pur2.date < pur1.date
), 0) AS previousTotal
FROM Purchase AS pur1Can we do this some other way?
Window functions definition
- For each row, open a “window” to other related rows
- Similar to aggregate functions, but return a value for each row instead of each group
SELECT col1, col2,
window_function(...)
OVER([PARTITION BY] [ORDER BY]) AS new_column
FROM TableNamewindow_function()can be an aggregate function or some other:RANK,DENSE_RANK,ROW_NUM,NTILE,LAG,LEAD
OVERdefines the windowPARTITION BY(Optional) defines the associated group for each rowORDER BY(Optional) sets the order. When used, the window often ends with the current transaction Previous query then becomes
SELECT pid,
SUM(price) OVER (
PARTITION BY buyer
ORDER BY date
RANGE BETWEEN UNBOUNDED PRECEEDING AND 1 PRECEEDING
) AS previousTotal
FROM PurchaseAnother example is
- Student(name, dept, avgGrade, …) Find all the number one and two students with the highest average in each department. Include ties
SELECT ptable.name
FROM (SELECT name, dept, avgGrade,
DENSE_RANK() OVER (
PARTITION BY dept
ORDER BY avgGrade DESC
) AS studentRank
FROM Student) as ptable
WHERE ptable.studentRank <= 2