%load ../assets/tutorial.db
Every professional interacting with data will come across SQL sooner or later. In this post, we cover a core set of vocabulary, grammer, expression for efficient use of this language.
vocabulary
from
where
group by
having
select
distinct
union
order by
limit/fetch first rows only
these vocabularies are fundamental in the day to day usage of SQL. They are listed in the execution order as well, meaning that the from
clause first gets executed (which table do we want), then where
(which row) and so on. We don’t have to include all the clauses in one query, but if we do, then knowing the order in which they are executed matters, especially for debugging errors.
In this post, we use a jupyter kernel called xsqlite
with sqlite3
backend (the most commonly used in-memory database) to demonstrate the SQL language. To install the kernel, plesae visit here.
Here are simple queries with some of the vocabs.
select * from customers limit 5
customer_id | name | visited_on | amount |
1 | Jhon | 2019-01-01 | 100 |
2 | Daniel | 2019-01-02 | 110 |
3 | Jade | 2019-01-03 | 120 |
4 | Khaled | 2019-01-04 | 130 |
5 | Winston | 2019-01-05 | 110 |
select * from customers where customer_id<3
customer_id | name | visited_on | amount |
1 | Jhon | 2019-01-01 | 100 |
2 | Daniel | 2019-01-02 | 110 |
1 | Jhon | 2019-01-10 | 130 |
select * from customers where customer_id=2
union
select * from customers where customer_id=2
customer_id | name | visited_on | amount |
2 | Daniel | 2019-01-02 | 110 |
select * from customers where customer_id=2
union all
select * from customers where customer_id=2
customer_id | name | visited_on | amount |
2 | Daniel | 2019-01-02 | 110 |
2 | Daniel | 2019-01-02 | 110 |
select distinct customer_id, visited_on
from customers
order by visited_on, customer_id
customer_id | visited_on |
1 | 2019-01-01 |
2 | 2019-01-02 |
3 | 2019-01-03 |
4 | 2019-01-04 |
5 | 2019-01-05 |
6 | 2019-01-06 |
7 | 2019-01-07 |
8 | 2019-01-08 |
9 | 2019-01-09 |
1 | 2019-01-10 |
3 | 2019-01-10 |
aggregate function
It is very common that one wants summary statistics of different group of people/items/products. In SQL, this is done with the group by
clause together with aggregate function(s).
select customer_id, min(visited_on), count(*)
from customers
where customer_id<3
group by customer_id
customer_id | min(visited_on) | count(*) |
1 | 2019-01-01 | 2 |
2 | 2019-01-02 | 1 |
select customer_id, min(visited_on), count(*) as cnt
from customers
where customer_id<3
group by customer_id
having cnt=1
customer_id | min(visited_on) | cnt |
2 | 2019-01-02 | 1 |
notice that having
and where
play a similar role which is to select relevant rows, but one gets executed before group by
, the other after.
windows function
windows function and aggregate function are similar-ish in that they both act on groups (with different synatx though). The difference is that an aggregate function collapses rows within the same group into one, whereas a windows function keeps all the rows within the same group/window, and add a new value to each row.
select row_number() over(partition by visited_on), visited_on from customers
row_number() over(partition by visited_on) | visited_on |
1 | 2019-01-01 |
1 | 2019-01-02 |
1 | 2019-01-03 |
1 | 2019-01-04 |
1 | 2019-01-05 |
1 | 2019-01-06 |
1 | 2019-01-07 |
1 | 2019-01-08 |
1 | 2019-01-09 |
1 | 2019-01-10 |
2 | 2019-01-10 |
here row_number()
is the windows function acting on the groups obtained from partition by visited_on
.
We can use order by
instead of partition by
, in which case there is a single group, and the function is executed in the requested order.
select row_number() over(order by visited_on), visited_on from customers
row_number() over(order by visited_on) | visited_on |
1 | 2019-01-01 |
2 | 2019-01-02 |
3 | 2019-01-03 |
4 | 2019-01-04 |
5 | 2019-01-05 |
6 | 2019-01-06 |
7 | 2019-01-07 |
8 | 2019-01-08 |
9 | 2019-01-09 |
10 | 2019-01-10 |
11 | 2019-01-10 |
it is possible to use both partition by
and order by
in over()
.
subquery and common table expression
One can nest one query in another, called subquery. They are very useful in practice. Indeed, one complicated query needs to be decomposed into a few tasks. After figuring out the intermediate steps, one can put things together by either chaining them sequentially, or union, or a combination of both.
nested subqueries can be hard to read as the level of nested queries increases. this is where CTE comes into rescue. CTE is like defining intermediate variables in a general-purpose language such as python.
To illuastrate this, we consider three tables which are relational through ids.
select * from movies
movie_id | title |
1 | Avengers |
2 | Frozen 2 |
3 | Joker |
select * from userss
user_id | name |
1 | Daniel |
2 | Monica |
3 | Maria |
4 | James |
select * from movierating
movie_id | user_id | rating | created_at |
1 | 1 | 3 | 2020-01-12 |
1 | 2 | 4 | 2020-02-11 |
1 | 3 | 2 | 2020-02-12 |
1 | 4 | 1 | 2020-01-01 |
2 | 1 | 5 | 2020-02-17 |
2 | 2 | 2 | 2020-02-01 |
2 | 3 | 2 | 2020-03-01 |
3 | 1 | 3 | 2020-02-22 |
3 | 2 | 4 | 2020-02-25 |
first we join
three tables (search online for all kinds of join
methods!)
select * from movierating left join userss using (user_id) left join movies using (movie_id)
movie_id | user_id | rating | created_at | name | title |
1 | 1 | 3 | 2020-01-12 | Daniel | Avengers |
1 | 2 | 4 | 2020-02-11 | Monica | Avengers |
1 | 3 | 2 | 2020-02-12 | Maria | Avengers |
1 | 4 | 1 | 2020-01-01 | James | Avengers |
2 | 1 | 5 | 2020-02-17 | Daniel | Frozen 2 |
2 | 2 | 2 | 2020-02-01 | Monica | Frozen 2 |
2 | 3 | 2 | 2020-03-01 | Maria | Frozen 2 |
3 | 1 | 3 | 2020-02-22 | Daniel | Joker |
3 | 2 | 4 | 2020-02-25 | Monica | Joker |
now we define t
as the result of joining with CTE, then find the person’s name who watched the largest number of movies, in case of a tie, choose the name that is lexicographical smaller (i.e. appears first in English dictionary). We would need the counts in order to find the name, this is where a subquery is required.
-- CTE
with t as (
select * from movierating left join userss using (user_id) left join movies using (movie_id)
)
select name as results from (
-- subquery
select name, count(*) cnt from t group by user_id, name order by cnt desc, name
)
limit 1
results |
Daniel |
next steps
writing efficient SQL is a matter of practice. In the sql
directory of this repo, you can find the solution to 50 exercises collected from leetcode. They cover a wide range of problems, including what we’ve covered in this post and date operations, regular expressions… have fun learning SQL!