SQL 101 – Xiaochuan’s blog

Every professional interacting with data will come across SQL sooner or later. In this post, we cover a core set of vocabulary, grammer, expression for efficient use of this language.

vocabulary

from where group by having select distinct union order by limit/fetch first rows only

these vocabularies are fundamental in the day to day usage of SQL. They are listed in the execution order as well, meaning that the from clause first gets executed (which table do we want), then where (which row) and so on. We don’t have to include all the clauses in one query, but if we do, then knowing the order in which they are executed matters, especially for debugging errors.

In this post, we use a jupyter kernel called xsqlite with sqlite3 backend (the most commonly used in-memory database) to demonstrate the SQL language. To install the kernel, plesae visit here.

Here are simple queries with some of the vocabs.

%load ../assets/tutorial.db

select * from customers limit 5

customer_id	name	visited_on	amount
1	Jhon	2019-01-01	100
2	Daniel	2019-01-02	110
3	Jade	2019-01-03	120
4	Khaled	2019-01-04	130
5	Winston	2019-01-05	110

select * from customers where customer_id<3

customer_id	name	visited_on	amount
1	Jhon	2019-01-01	100
2	Daniel	2019-01-02	110
1	Jhon	2019-01-10	130

select * from customers where customer_id=2
union
select * from customers where customer_id=2

customer_id	name	visited_on	amount
2	Daniel	2019-01-02	110

select * from customers where customer_id=2
union all
select * from customers where customer_id=2

customer_id	name	visited_on	amount
2	Daniel	2019-01-02	110
2	Daniel	2019-01-02	110

select distinct customer_id, visited_on 
from customers 
order by visited_on, customer_id

customer_id	visited_on
1	2019-01-01
2	2019-01-02
3	2019-01-03
4	2019-01-04
5	2019-01-05
6	2019-01-06
7	2019-01-07
8	2019-01-08
9	2019-01-09
1	2019-01-10
3	2019-01-10

aggregate function

It is very common that one wants summary statistics of different group of people/items/products. In SQL, this is done with the group by clause together with aggregate function(s).

select customer_id, min(visited_on), count(*) 
from customers 
where customer_id<3 
group by customer_id

customer_id	min(visited_on)	count(*)
1	2019-01-01	2
2	2019-01-02	1

select customer_id, min(visited_on), count(*) as cnt 
from customers 
where customer_id<3
group by customer_id 
having cnt=1

customer_id	min(visited_on)	cnt
2	2019-01-02	1

notice that having and where play a similar role which is to select relevant rows, but one gets executed before group by, the other after.

windows function

windows function and aggregate function are similar-ish in that they both act on groups (with different synatx though). The difference is that an aggregate function collapses rows within the same group into one, whereas a windows function keeps all the rows within the same group/window, and add a new value to each row.

select row_number() over(partition by visited_on), visited_on from customers

row_number() over(partition by visited_on)	visited_on
1	2019-01-01
1	2019-01-02
1	2019-01-03
1	2019-01-04
1	2019-01-05
1	2019-01-06
1	2019-01-07
1	2019-01-08
1	2019-01-09
1	2019-01-10
2	2019-01-10

here row_number() is the windows function acting on the groups obtained from partition by visited_on.

We can use order by instead of partition by, in which case there is a single group, and the function is executed in the requested order.

select row_number() over(order by visited_on), visited_on from customers

row_number() over(order by visited_on)	visited_on
1	2019-01-01
2	2019-01-02
3	2019-01-03
4	2019-01-04
5	2019-01-05
6	2019-01-06
7	2019-01-07
8	2019-01-08
9	2019-01-09
10	2019-01-10
11	2019-01-10

it is possible to use both partition by and order by in over().

subquery and common table expression

One can nest one query in another, called subquery. They are very useful in practice. Indeed, one complicated query needs to be decomposed into a few tasks. After figuring out the intermediate steps, one can put things together by either chaining them sequentially, or union, or a combination of both.

nested subqueries can be hard to read as the level of nested queries increases. this is where CTE comes into rescue. CTE is like defining intermediate variables in a general-purpose language such as python.

To illuastrate this, we consider three tables which are relational through ids.

select * from movies

movie_id	title
1	Avengers
2	Frozen 2
3	Joker

select * from userss

user_id	name
1	Daniel
2	Monica
3	Maria
4	James

select * from movierating

movie_id	user_id	rating	created_at
1	1	3	2020-01-12
1	2	4	2020-02-11
1	3	2	2020-02-12
1	4	1	2020-01-01
2	1	5	2020-02-17
2	2	2	2020-02-01
2	3	2	2020-03-01
3	1	3	2020-02-22
3	2	4	2020-02-25

first we join three tables (search online for all kinds of join methods!)

select * from movierating left join userss using (user_id) left join movies using (movie_id)

movie_id	user_id	rating	created_at	name	title
1	1	3	2020-01-12	Daniel	Avengers
1	2	4	2020-02-11	Monica	Avengers
1	3	2	2020-02-12	Maria	Avengers
1	4	1	2020-01-01	James	Avengers
2	1	5	2020-02-17	Daniel	Frozen 2
2	2	2	2020-02-01	Monica	Frozen 2
2	3	2	2020-03-01	Maria	Frozen 2
3	1	3	2020-02-22	Daniel	Joker
3	2	4	2020-02-25	Monica	Joker

now we define t as the result of joining with CTE, then find the person’s name who watched the largest number of movies, in case of a tie, choose the name that is lexicographical smaller (i.e. appears first in English dictionary). We would need the counts in order to find the name, this is where a subquery is required.

-- CTE
with t as (
    select * from movierating left join userss using (user_id) left join movies using (movie_id)
)

select name as results from (
    -- subquery
    select name, count(*) cnt from t group by user_id, name order by cnt desc, name 
) 
limit 1

results
Daniel

next steps

writing efficient SQL is a matter of practice. In the sql directory of this repo, you can find the solution to 50 exercises collected from leetcode. They cover a wide range of problems, including what we’ve covered in this post and date operations, regular expressions… have fun learning SQL!