SQL 101

data engineering
Author

Xiaochuan Yang

Published

January 27, 2024

Every professional interacting with data will come across SQL sooner or later. In this post, we cover a core set of vocabulary, grammer, expression for efficient use of this language.

vocabulary

from where group by having select distinct union order by limit/fetch first rows only

these vocabularies are fundamental in the day to day usage of SQL. They are listed in the execution order as well, meaning that the from clause first gets executed (which table do we want), then where (which row) and so on. We don’t have to include all the clauses in one query, but if we do, then knowing the order in which they are executed matters, especially for debugging errors.

In this post, we use a jupyter kernel called xsqlite with sqlite3 backend (the most commonly used in-memory database) to demonstrate the SQL language. To install the kernel, plesae visit here.

Here are simple queries with some of the vocabs.

%load ../assets/tutorial.db
select * from customers limit 5
customer_id name visited_on amount
1 Jhon 2019-01-01 100
2 Daniel 2019-01-02 110
3 Jade 2019-01-03 120
4 Khaled 2019-01-04 130
5 Winston 2019-01-05 110
select * from customers where customer_id<3
customer_id name visited_on amount
1 Jhon 2019-01-01 100
2 Daniel 2019-01-02 110
1 Jhon 2019-01-10 130
select * from customers where customer_id=2
union
select * from customers where customer_id=2
customer_id name visited_on amount
2 Daniel 2019-01-02 110
select * from customers where customer_id=2
union all
select * from customers where customer_id=2
customer_id name visited_on amount
2 Daniel 2019-01-02 110
2 Daniel 2019-01-02 110
select distinct customer_id, visited_on 
from customers 
order by visited_on, customer_id
customer_id visited_on
1 2019-01-01
2 2019-01-02
3 2019-01-03
4 2019-01-04
5 2019-01-05
6 2019-01-06
7 2019-01-07
8 2019-01-08
9 2019-01-09
1 2019-01-10
3 2019-01-10

aggregate function

It is very common that one wants summary statistics of different group of people/items/products. In SQL, this is done with the group by clause together with aggregate function(s).

select customer_id, min(visited_on), count(*) 
from customers 
where customer_id<3 
group by customer_id
customer_id min(visited_on) count(*)
1 2019-01-01 2
2 2019-01-02 1
select customer_id, min(visited_on), count(*) as cnt 
from customers 
where customer_id<3
group by customer_id 
having cnt=1
customer_id min(visited_on) cnt
2 2019-01-02 1

notice that having and where play a similar role which is to select relevant rows, but one gets executed before group by, the other after.

windows function

windows function and aggregate function are similar-ish in that they both act on groups (with different synatx though). The difference is that an aggregate function collapses rows within the same group into one, whereas a windows function keeps all the rows within the same group/window, and add a new value to each row.

select row_number() over(partition by visited_on), visited_on from customers
row_number() over(partition by visited_on) visited_on
1 2019-01-01
1 2019-01-02
1 2019-01-03
1 2019-01-04
1 2019-01-05
1 2019-01-06
1 2019-01-07
1 2019-01-08
1 2019-01-09
1 2019-01-10
2 2019-01-10

here row_number() is the windows function acting on the groups obtained from partition by visited_on.

We can use order by instead of partition by, in which case there is a single group, and the function is executed in the requested order.

select row_number() over(order by visited_on), visited_on from customers
row_number() over(order by visited_on) visited_on
1 2019-01-01
2 2019-01-02
3 2019-01-03
4 2019-01-04
5 2019-01-05
6 2019-01-06
7 2019-01-07
8 2019-01-08
9 2019-01-09
10 2019-01-10
11 2019-01-10

it is possible to use both partition by and order by in over().

subquery and common table expression

One can nest one query in another, called subquery. They are very useful in practice. Indeed, one complicated query needs to be decomposed into a few tasks. After figuring out the intermediate steps, one can put things together by either chaining them sequentially, or union, or a combination of both.

nested subqueries can be hard to read as the level of nested queries increases. this is where CTE comes into rescue. CTE is like defining intermediate variables in a general-purpose language such as python.

To illuastrate this, we consider three tables which are relational through ids.

select * from movies
movie_id title
1 Avengers
2 Frozen 2
3 Joker
select * from userss
user_id name
1 Daniel
2 Monica
3 Maria
4 James
select * from movierating
movie_id user_id rating created_at
1 1 3 2020-01-12
1 2 4 2020-02-11
1 3 2 2020-02-12
1 4 1 2020-01-01
2 1 5 2020-02-17
2 2 2 2020-02-01
2 3 2 2020-03-01
3 1 3 2020-02-22
3 2 4 2020-02-25

first we join three tables (search online for all kinds of join methods!)

select * from movierating left join userss using (user_id) left join movies using (movie_id)
movie_id user_id rating created_at name title
1 1 3 2020-01-12 Daniel Avengers
1 2 4 2020-02-11 Monica Avengers
1 3 2 2020-02-12 Maria Avengers
1 4 1 2020-01-01 James Avengers
2 1 5 2020-02-17 Daniel Frozen 2
2 2 2 2020-02-01 Monica Frozen 2
2 3 2 2020-03-01 Maria Frozen 2
3 1 3 2020-02-22 Daniel Joker
3 2 4 2020-02-25 Monica Joker

now we define t as the result of joining with CTE, then find the person’s name who watched the largest number of movies, in case of a tie, choose the name that is lexicographical smaller (i.e. appears first in English dictionary). We would need the counts in order to find the name, this is where a subquery is required.

-- CTE
with t as (
    select * from movierating left join userss using (user_id) left join movies using (movie_id)
)

select name as results from (
    -- subquery
    select name, count(*) cnt from t group by user_id, name order by cnt desc, name 
) 
limit 1
results
Daniel

next steps

writing efficient SQL is a matter of practice. In the sql directory of this repo, you can find the solution to 50 exercises collected from leetcode. They cover a wide range of problems, including what we’ve covered in this post and date operations, regular expressions… have fun learning SQL!