teyuanliu.github.io

PostgreSQL is an open-source relational database.

System architecture

Server
- Root server process
  - This is the manager process that creates a connection server process for each client connection and administers its lifecycle.
- Connection server process
  - This is the process that communicates with a client via a TCP/IP network connection, managing database files, and performs database actions on behalf of the client.
Client
- A client can be a text-oriented tool, a graphical application, or a web server.

Database creation and deletion

A project often uses one database. We can use the binaries provided by PostgreSQL to create or delete a database.

Create a database. A database name must start with an alphabetic character and is limited to 63 bytes in length.
```
createdb <DATABASE_NAME>
```
Delete a database. This removes all associated database files.
```
dropdb <DATABASE_NAME>
```

Database operations

After creating a database, we can interact with the database using Structured Query Language (SQL) commands via the following ways.

Use the PostgreSQL interactive terminal, psql.
Use a driver to connect a custom application to the database.

Psql

We can run psql to start a terminal.

psql <DATABASE_NAME>    # Assume DATABASE_NAME is mydb.

mydb=> SELECT current_date; # Print the current date.

mydb=> \h   # Get help.

mydb=> \d   # Get the list of tables and sequences.

mydb=> \dt  # Get the list of tables.

mydb=> \i commands.sql  # Read and execute the commands.sql file.

mydb=> \q   # Exit the terminal.

SQL

We can use SQL commands to interact with a database, but what is inside a database? A database consists of one or multiple tables.

Table
- A table is used to store relational data.
Column
- A column has a specific data name and data type. A table has a set of columns that group data with different name and type together. Taking a user data table for example, one column can be name and another column can be email. Note that the columns of a table have a fixed order.
Row
- A row is like a record in the table. In the user data table example, a row represents a user, and those column entries of that row are the name, the email, and other information of that user. Note that the rows of a table aren't guaranteed to have a fixed order.

Data name

A table has its name. A column also has its name.

In PostgreSQL, using unquoted names is a best practice for naming.

Unquoted names must begin with a letter or an underscore. Subsequent characters can be letters, digits, underscores, or dollar signs. Unquoted names are automatically folded to lowercase so it is recommended to just use lowercase letters.

We can use arbitrary characters like uppercase letter, spaces and hyphens for naming if we enclose the name in double quotes. However, this is a poor practice because we then have to use double quotes everywhere.

Double quotes are used for identifiers like table and column names that contain reserved keywords, special characters or spaces.

Single quotes are used for string literals.

Data types

Each column has its own data type, and here is a list of the standard SQL data types.

smallint (2 bytes)
integer/int (4 bytes)
bigint (8 bytes)
real (4 bytes)
- single-precision floating-point number
double precision (8 bytes)
- double-precision floating-point number
char(N) (1 byte length metadata overhead + actual data length)
- string with a fixed length of N
- padding with blank
- Each character can use 1 to 4 bytes.
- data size limit is 1 GB
varchar(N) (1 byte length metadata overhead + actual data length)
- variable-length string up to N characters in length
- Each character can use 1 to 4 bytes.
- data size limit is 1 GB
text
- string with no length limit
- Each character can use 1 to 4 bytes.
- data size limit is 1 GB
date (4 bytes)
- date only
time (8 bytes)
- time of day only
timestamp (8 bytes)
- date and time without time zone
timestamptz (8 bytes)
- date and time with time zone
interval (16 bytes)
- 3 fields of months, days, seconds

SQL command

SQL commands ignore extra spaces.

Comment

Use -- before a comment and everything is ignored up to the end of the line.

Table creation

Assume we want to create a gym user table.

CREATE TABLE user (
    id int generated by default as identity primary key,
    name varchar(100),
    email varchar(254) unique,
    height_m real,
    weight_kg real,
    creation_date date default current_date
);

Table deletion

DROP TABLE user;

Row creation

-- Explicit columns
-- The column order can be altered.
-- Any unrequired column can be omitted.
INSERT INTO user (name, email, height_m, weight_kg, creation_date) VALUES ('Bob', 'bob@gmail.com', 1.8, 70, '2024-12-23');

-- Shorter version
INSERT INTO user VALUES ('Bob', 'bob@gmail.com', 1.8, 70, '2024-12-23');

We can use the COPY keyword to load data from text files too.

COPY user FROM '/tmp/user.txt';

Row read

We use the SELECT statement to query data from a table. The SELECT statement has 3 parts.

Select list
- List the columns or expressions to be returned.
- Parsed from the entries after the keyword SELECT
Table list
- List the tables from which to retrieve the data.
- Parsed from the entries after the keyword FROM
Row qualification
- Specify any row filtering criteria.
- Parsed from the entries after the keyword WHERE
- Optional

-- Select all columns from the user table without row filtering criteria.
SELECT * FROM user;

-- Select the name column and bmi expression from the user table without row filtering criteria.
SELECT name, weight_kg/(height_m*height_m) AS bmi FROM user;

-- Select the name and height_m column from the user table with a qualification specifying that only rows with height_m > 1.8 and weight_kg < 70 are wanted. Remove rows with duplicate names. The result should be ordered by name in descending order.
SELECT DISTINCT name, height_m FROM user WHERE height_m > 1.8 AND weight_kg < 70 ORDER BY name DESC;

Use * to select all columns.
Use AS to relabel a output column.
Use AND, OR, and NOT in the WHERE qualification expression.
Use ORDER by to sort the returned result.

Table join

Assume we also have a gym course enrollment table that links a course to each enrolled user.

CREATE TABLE course_enrollment (
    id int generated by default as identity primary key,
    course_name varchar(100),
    user_name varchar(100),
    creation_date date default current_date
);

Then we can make join queries to access multiple tables at once, or access the same table in such a way that multiple rows of the table are processed at the same time.

In a join query, it is good practice to qualify all column names so that the query won't fail if a duplicate column name exists in the query. We can relabel the tables with aliases to save some typing.

Self join

A self join is joining a table against itself.

SELECT u1.name, u1.height_m AS u1_height, u2.name, u2.height_m AS u2_height FROM user u1 JOIN user u2 ON u1.height_m < u2.height_m;

Inner join

We can inner join the user table and the course enrollment table in a SQL query. Only the rows that have matching values in the user_name/name column in the course_enrollment/user table are returned.

SELECT ce.course_name, u.name, u.email FROM course_enrollment ce JOIN user u ON ce.user_name = u.name;

Full Outer join

For outer joins, columns with missing data are substituted with null values.

SELECT ce.course_name, u.name, u.email FROM course_enrollment ce FULL OUTER JOIN user u ON ce.user_name = u.name;

Exclusive outer join

SELECT ce.course_name, u.name, u.email FROM course_enrollment ce FULL OUTER JOIN user u ON ce.user_name = u.name WHERE course_enrollment.user_name IS NULL OR user.name IS NULL;

Left outer join

SELECT ce.course_name, u.name, u.email FROM course_enrollment ce LEFT OUTER JOIN user u ON ce.user_name = u.name;

Exclusive left outer join

SELECT ce.course_name, u.name, u.email FROM course_enrollment ce LEFT OUTER JOIN user u ON ce.user_name = u.name WHERE user.name IS NULL;

Right outer join

SELECT ce.course_name, u.name, u.email FROM course_enrollment ce RIGHT OUTER JOIN user u ON ce.user_name = u.name;

Exclusive right outer join

SELECT ce.course_name, u.name, u.email FROM course_enrollment ce RIGHT OUTER JOIN user u ON ce.user_name = u.name WHERE course_enrollment.user_name IS NULL;

Aggregate function

An aggregate function computes a single result from multiple input rows. Here are some common functions.

count
sum
avg
min
max

-- Get the max height from the user table.
SELECT max(height_m) FROM user;

Subquery

-- Get the user who has the max height using a subquery.
SELECT name FROM user WHERE height_m = (SELECT max(height_m) FROM user)

Group by having

We can use GROUP BY to group multiple rows into one.

-- Get the user name and number of course enrollments for each user.
SELECT user_name, count(*) FROM course_enrollment GROUP BY user_name;

We can further filter the grouped rows with HAVING.

-- Get the user name and number of course enrollments for users who have more than 2 course enrollments.
SELECT user_name, count(*) FROM course_enrollment GROUP BY user_name HAVING count(*) > 2;

Filter

The FILTER removes rows from the input of the particular aggregate function that it is attached to.

-- Get the user name and number of course enrollments created after 2025-12-24 for each user.
SELECT user_name, count(*) FILTER (WHERE creation_date > '2025-12-24') FROM course_enrollment GROUP BY user_name;

Row update

UPDATE course_enrollment SET creation_date = '2025-12-24' WHERE creation_date = '2025-12-25';

Row deletion

DELETE FROM course_enrollment WHERE course_name = 'cardio';

Note that deletion without a qualification results in the deletion of all rows in the table.

-- This delete all rows in the course enrollment table.
DELETE FROM course_enrollment;

View

We can create a view based on a query for reuse.

CREATE VIEW course_enrollment_email_view AS
    SELECT ce.course_name, u.name, u.email
        FROM course_enrollment ce
        JOIN user u
        ON ce.user_name = u.name;

SELECT * FROM course_enrollment_email_view

Foreign key referential integrity

Often times we want to make sure that a foreign key column of a table do have a matching entry in that foreign key table. This is called the referential integrity.

CREATE TABLE user (
    id int generated by default as identity primary key,
    name varchar(100),
    email varchar(254) unique,
    height_m real,
    weight_kg real,
    creation_date date default current_date
);

CREATE TABLE course_enrollment (
    id int generated by default as identity primary key,
    course_name varchar(100),
    user_id int references user(id),
    user_name varchar(100),
    creation_date date default current_date
);

Transaction

A transaction is a single, all-or-nothing operation that consists of multiple statements. The intermediate state made by each statement of a transaction is invisible to other transactions. If any statement fails, the entire operation is canceled and no change is applied to the database, leaving the database intact.

This is known as the atomic property. Here is an example of bundling multiple statements into one transaction block.

BEGIN;
-- first statement
UPDATE user SET creation_date = '2025-12-24'
    WHERE creation_date = '2025-12-25';
-- another one
COMMIT;

If we don't want to commit the ongoing transaction, we can use ROLLBACK instead of COMMIT, and then all updates will be canceled.

PostgreSQL treats every SQL statement as being executed within a transaction by implicitly wrapping every command with a BEGIN and, if successful, a COMMIT, or a ROLLBACK otherwise.

We can use SAVEPOINT and ROLLBACK TO to keep the changes earlier than the savepoint and discard the rest. A savepoint can be released if it is no longer needed. Releasing or rolling back to a savepoint automatically releases all savepoints defined after it.

Window function

A window function performs a calculation across a set of rows called the window frame. If the PARTITION BY clause is supplied, the partition is the frame unless ORDER BY is supplied. If ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that have value equal to the current row based on the ORDER BY clause.

Unlike an aggregate function, a window function doesn't group the frame rows into one row.

The OVER clause determines how the input rows are split up for processing by the window function. The PARTITION BY clause within OVER divides the rows into partitions. We can omit the PARTITION BY to operate on all rows. We can control the order of rows processed by the window function using ORDER BY within OVER.

Window functions are permitted only in the SELECT list and the ORDER BY clause and are forbidden elsewhere because they execute after other clauses like WHERE or GROUP BY.

We can use a sub-select if we need to filter rows after the window function's calculation.

When we need to put multiple window functions inside a query, it is recommended to name each windowing in a WINDOW clause and then referenced in OVER.

-- Compare each user's weight with the average weight of the users with the same height.
SELECT id, name, weight_kg, avg(weight_kg), row_number() OVER (PARTITION BY height_m ORDER BY weight_kg) FROM user;

Execution order

WHERE selects the rows.
Aggregate function operate on group rows obtained from GROUP BY.
HAVING selects the group rows, often using aggregate functions.
Window function

Inheritance

Inheritance lets an inheriting table inherit columns from an inherited table.

CREATE TABLE cities (
    name text,
    population int,
    elevation int
);

CREATE TABLE capitals (
    state char(2) UNIQUE NOT NULL
) INHERITS (cities);

The date type of the column name is text, a native type for variable length character strings.

If we want the query to be performed on just the inherited table, we can do the following.

SELECT name, elevation
    FROM ONLY cities
    WHERE elevation > 500;

References

PostgreSQL tutorial

←Previous Next→