%load ../assets/tutorial.db
Every professional interacting with data will come across SQL sooner or later. In this post, we cover a core set of vocabulary, grammer, expression for efficient use of this language.
vocabulary
from where group by having select distinct union order by limit/fetch first rows only
these vocabularies are fundamental in the day to day usage of SQL. They are listed in the execution order as well, meaning that the from clause first gets executed (which table do we want), then where (which row) and so on. We don’t have to include all the clauses in one query, but if we do, then knowing the order in which they are executed matters, especially for debugging errors.
In this post, we use a jupyter kernel called xsqlite with sqlite3 backend (the most commonly used in-memory database) to demonstrate the SQL language. To install the kernel, plesae visit here.
Here are simple queries with some of the vocabs.
select * from customers limit 5
| customer_id | name | visited_on | amount |
| 1 | Jhon | 2019-01-01 | 100 |
| 2 | Daniel | 2019-01-02 | 110 |
| 3 | Jade | 2019-01-03 | 120 |
| 4 | Khaled | 2019-01-04 | 130 |
| 5 | Winston | 2019-01-05 | 110 |
select * from customers where customer_id<3
| customer_id | name | visited_on | amount |
| 1 | Jhon | 2019-01-01 | 100 |
| 2 | Daniel | 2019-01-02 | 110 |
| 1 | Jhon | 2019-01-10 | 130 |
select * from customers where customer_id=2
union
select * from customers where customer_id=2
| customer_id | name | visited_on | amount |
| 2 | Daniel | 2019-01-02 | 110 |
select * from customers where customer_id=2
union all
select * from customers where customer_id=2
| customer_id | name | visited_on | amount |
| 2 | Daniel | 2019-01-02 | 110 |
| 2 | Daniel | 2019-01-02 | 110 |
select distinct customer_id, visited_on
from customers
order by visited_on, customer_id
| customer_id | visited_on |
| 1 | 2019-01-01 |
| 2 | 2019-01-02 |
| 3 | 2019-01-03 |
| 4 | 2019-01-04 |
| 5 | 2019-01-05 |
| 6 | 2019-01-06 |
| 7 | 2019-01-07 |
| 8 | 2019-01-08 |
| 9 | 2019-01-09 |
| 1 | 2019-01-10 |
| 3 | 2019-01-10 |
aggregate function
It is very common that one wants summary statistics of different group of people/items/products. In SQL, this is done with the group by clause together with aggregate function(s).
select customer_id, min(visited_on), count(*)
from customers
where customer_id<3
group by customer_id
| customer_id | min(visited_on) | count(*) |
| 1 | 2019-01-01 | 2 |
| 2 | 2019-01-02 | 1 |
select customer_id, min(visited_on), count(*) as cnt
from customers
where customer_id<3
group by customer_id
having cnt=1
| customer_id | min(visited_on) | cnt |
| 2 | 2019-01-02 | 1 |
notice that having and where play a similar role which is to select relevant rows, but one gets executed before group by, the other after.
windows function
windows function and aggregate function are similar-ish in that they both act on groups (with different synatx though). The difference is that an aggregate function collapses rows within the same group into one, whereas a windows function keeps all the rows within the same group/window, and add a new value to each row.
select row_number() over(partition by visited_on), visited_on from customers
| row_number() over(partition by visited_on) | visited_on |
| 1 | 2019-01-01 |
| 1 | 2019-01-02 |
| 1 | 2019-01-03 |
| 1 | 2019-01-04 |
| 1 | 2019-01-05 |
| 1 | 2019-01-06 |
| 1 | 2019-01-07 |
| 1 | 2019-01-08 |
| 1 | 2019-01-09 |
| 1 | 2019-01-10 |
| 2 | 2019-01-10 |
here row_number() is the windows function acting on the groups obtained from partition by visited_on.
We can use order by instead of partition by, in which case there is a single group, and the function is executed in the requested order.
select row_number() over(order by visited_on), visited_on from customers
| row_number() over(order by visited_on) | visited_on |
| 1 | 2019-01-01 |
| 2 | 2019-01-02 |
| 3 | 2019-01-03 |
| 4 | 2019-01-04 |
| 5 | 2019-01-05 |
| 6 | 2019-01-06 |
| 7 | 2019-01-07 |
| 8 | 2019-01-08 |
| 9 | 2019-01-09 |
| 10 | 2019-01-10 |
| 11 | 2019-01-10 |
it is possible to use both partition by and order by in over().
subquery and common table expression
One can nest one query in another, called subquery. They are very useful in practice. Indeed, one complicated query needs to be decomposed into a few tasks. After figuring out the intermediate steps, one can put things together by either chaining them sequentially, or union, or a combination of both.
nested subqueries can be hard to read as the level of nested queries increases. this is where CTE comes into rescue. CTE is like defining intermediate variables in a general-purpose language such as python.
To illuastrate this, we consider three tables which are relational through ids.
select * from movies
| movie_id | title |
| 1 | Avengers |
| 2 | Frozen 2 |
| 3 | Joker |
select * from userss
| user_id | name |
| 1 | Daniel |
| 2 | Monica |
| 3 | Maria |
| 4 | James |
select * from movierating
| movie_id | user_id | rating | created_at |
| 1 | 1 | 3 | 2020-01-12 |
| 1 | 2 | 4 | 2020-02-11 |
| 1 | 3 | 2 | 2020-02-12 |
| 1 | 4 | 1 | 2020-01-01 |
| 2 | 1 | 5 | 2020-02-17 |
| 2 | 2 | 2 | 2020-02-01 |
| 2 | 3 | 2 | 2020-03-01 |
| 3 | 1 | 3 | 2020-02-22 |
| 3 | 2 | 4 | 2020-02-25 |
first we join three tables (search online for all kinds of join methods!)
select * from movierating left join userss using (user_id) left join movies using (movie_id)
| movie_id | user_id | rating | created_at | name | title |
| 1 | 1 | 3 | 2020-01-12 | Daniel | Avengers |
| 1 | 2 | 4 | 2020-02-11 | Monica | Avengers |
| 1 | 3 | 2 | 2020-02-12 | Maria | Avengers |
| 1 | 4 | 1 | 2020-01-01 | James | Avengers |
| 2 | 1 | 5 | 2020-02-17 | Daniel | Frozen 2 |
| 2 | 2 | 2 | 2020-02-01 | Monica | Frozen 2 |
| 2 | 3 | 2 | 2020-03-01 | Maria | Frozen 2 |
| 3 | 1 | 3 | 2020-02-22 | Daniel | Joker |
| 3 | 2 | 4 | 2020-02-25 | Monica | Joker |
now we define t as the result of joining with CTE, then find the person’s name who watched the largest number of movies, in case of a tie, choose the name that is lexicographical smaller (i.e. appears first in English dictionary). We would need the counts in order to find the name, this is where a subquery is required.
-- CTE
with t as (
select * from movierating left join userss using (user_id) left join movies using (movie_id)
)
select name as results from (
-- subquery
select name, count(*) cnt from t group by user_id, name order by cnt desc, name
)
limit 1
| results |
| Daniel |
next steps
writing efficient SQL is a matter of practice. In the sql directory of this repo, you can find the solution to 50 exercises collected from leetcode. They cover a wide range of problems, including what we’ve covered in this post and date operations, regular expressions… have fun learning SQL!