Cornell Database Course: Free SQL & Database Design Guide

Why Study Databases Formally?

Many developers learn SQL by picking up fragments as needed, but a formal study of databases reveals the "why" behind design decisions. Understanding relational algebra explains why certain SQL queries are efficient. Understanding normalization prevents the data anomalies that plague real applications. Understanding transaction isolation levels prevents subtle bugs in concurrent systems.

This guide synthesizes the core content from computer science database curricula — the kind of content covered in courses like Cornell's CS 4320 (Intro to Databases) and similar programs at other top universities.

The Relational Model

The relational model, proposed by E.F. Codd in 1970, organizes data into tables (relations) with well-defined rules:

A relation is a set of tuples (rows)
Each tuple has the same set of attributes (columns)
Each attribute has a domain (data type)
A key is a minimal set of attributes that uniquely identifies each tuple
The primary key is the chosen unique identifier for a relation
Foreign keys create references between relations

Relational algebra is the mathematical foundation of SQL. Key operations:

Operation	SQL Equivalent	Description
σ (Selection)	WHERE	Filter rows
π (Projection)	SELECT columns	Choose columns
⋈ (Join)	JOIN	Combine tables
∪ (Union)	UNION	Combine results
- (Difference)	EXCEPT	Rows in A not in B
× (Cartesian Product)	CROSS JOIN	Every pair of rows

Entity-Relationship (ER) Modeling

Before creating tables, model your domain with ER diagrams:

Entities: Objects with independent existence (Customer, Product, Order)

Relationships: Associations between entities:

One-to-One (1:1): Person has one Passport
One-to-Many (1:N): Customer places many Orders
Many-to-Many (M:N): Students enroll in many Courses; Courses have many Students

Attributes: Properties of entities (name, email, price)

Simple: Atomic value (age)
Composite: Multiple components (address → street, city, state)
Multi-valued: Multiple values per entity (phone numbers)
Derived: Computed from other attributes (age from birth_date)

Mapping ER to Relations:

Each entity becomes a table
1:N relationships add a foreign key to the "many" side
M:N relationships require a junction/bridge table

-- Student-Course M:N relationship
CREATE TABLE students (
    student_id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(200) UNIQUE NOT NULL
);

CREATE TABLE courses (
    course_id SERIAL PRIMARY KEY,
    code VARCHAR(20) UNIQUE NOT NULL,
    title VARCHAR(200) NOT NULL,
    credits SMALLINT CHECK (credits BETWEEN 1 AND 6)
);

-- Junction table for M:N
CREATE TABLE enrollments (
    student_id INTEGER REFERENCES students(student_id) ON DELETE CASCADE,
    course_id INTEGER REFERENCES courses(course_id) ON DELETE CASCADE,
    enrolled_at TIMESTAMP DEFAULT NOW(),
    grade CHAR(2),
    PRIMARY KEY (student_id, course_id)
);

Normalization: Eliminating Redundancy

Normalization is the process of organizing tables to reduce data redundancy and avoid update anomalies.

First Normal Form (1NF)

Every column must contain atomic (indivisible) values. No repeating groups.

Violation: StudentID | Name | Courses where Courses = "Math, Physics, CS" Fix: Create a separate row for each course enrollment.

Second Normal Form (2NF)

Must be in 1NF, and every non-key attribute must depend on the entire primary key (no partial dependencies).

Violation in a composite-key table:

Enrollment(StudentID, CourseID, StudentName, CourseName, Grade)

StudentName depends only on StudentID, not the full composite key.

Fix: Separate into Students and Courses tables.

Third Normal Form (3NF)

Must be in 2NF, and no non-key attribute depends on another non-key attribute (no transitive dependencies).

Violation: Employee(EmpID, DeptID, DeptName) DeptName depends on DeptID, which is a non-key attribute.

Fix: Separate into Employees(EmpID, DeptID) and Departments(DeptID, DeptName).

Boyce-Codd Normal Form (BCNF)

A stricter version of 3NF: every determinant must be a key.

Advanced SQL

Window Functions

Window functions perform calculations across related rows without collapsing them:

-- Rank employees by salary within each department
SELECT
    name,
    department,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank,
    DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank,
    ROW_NUMBER() OVER (ORDER BY salary DESC) AS overall_rank,
    AVG(salary) OVER (PARTITION BY department) AS dept_avg_salary,
    salary - AVG(salary) OVER (PARTITION BY department) AS diff_from_avg
FROM employees;

Common Table Expressions (CTEs)

CTEs make complex queries readable by naming intermediate results:

WITH monthly_sales AS (
    SELECT
        DATE_TRUNC('month', order_date) AS month,
        SUM(total) AS revenue
    FROM orders
    GROUP BY 1
),
revenue_growth AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
        ROUND(
            (revenue - LAG(revenue) OVER (ORDER BY month))
            / LAG(revenue) OVER (ORDER BY month) * 100, 2
        ) AS growth_pct
    FROM monthly_sales
)
SELECT * FROM revenue_growth
WHERE growth_pct IS NOT NULL
ORDER BY month;

Recursive CTEs

Traverse hierarchical data (org charts, bill of materials):

WITH RECURSIVE org_chart AS (
    -- Base case: CEO (no manager)
    SELECT id, name, manager_id, 0 AS level, name AS path
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive case
    SELECT e.id, e.name, e.manager_id, o.level + 1, o.path || ' > ' || e.name
    FROM employees e
    JOIN org_chart o ON e.manager_id = o.id
)
SELECT level, path FROM org_chart ORDER BY path;

Indexing and Query Optimization

Indexes dramatically speed up queries but add overhead on writes.

B-Tree indexes (default): Excellent for equality and range queries, sorting:

CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_created ON orders(created_at);
-- Composite index for queries filtering on both columns
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at);

Hash indexes: Only for equality comparisons, faster than B-tree for exact matches (PostgreSQL).

Partial indexes: Index only a subset of rows:

-- Only index active products (reduces index size)
CREATE INDEX idx_active_products ON products(name) WHERE is_active = true;

EXPLAIN ANALYZE: Understand query execution plans:

EXPLAIN ANALYZE
SELECT * FROM orders WHERE user_id = 123 ORDER BY created_at DESC LIMIT 10;
-- Look for: Seq Scan (slow) vs Index Scan (fast), actual vs estimated rows

Transactions and Isolation Levels

ACID properties guarantee reliable transactions:

Atomicity: All or nothing — if one step fails, all steps roll back
Consistency: Data always moves from one valid state to another
Isolation: Concurrent transactions don't interfere with each other
Durability: Committed data persists even after a crash

Isolation levels trade consistency for concurrency:

Level	Dirty Read	Non-Repeatable Read	Phantom Read
READ UNCOMMITTED	Possible	Possible	Possible
READ COMMITTED	Not possible	Possible	Possible
REPEATABLE READ	Not possible	Not possible	Possible
SERIALIZABLE	Not possible	Not possible	Not possible

PostgreSQL defaults to READ COMMITTED. Most applications need REPEATABLE READ or SERIALIZABLE only for critical financial transactions.

NoSQL vs. Relational

Understand when to choose each:

Relational databases excel at:

Complex queries with joins
Strong consistency requirements (financial data)
Well-defined, stable schemas
Applications requiring ACID transactions

NoSQL databases (MongoDB, Cassandra, DynamoDB) excel at:

Flexible schemas that evolve frequently
Massive horizontal scalability
High write throughput
Hierarchical document structures (JSON)

Many modern applications use both — relational for transactional data, NoSQL for user activity logs, product catalogs, or session data.

Cornell University Database Course: Free Database Design and...

Cornell University Database Course: Free Database Design and SQL Fundamentals

Why Study Databases Formally?

The Relational Model

Entity-Relationship (ER) Modeling

Normalization: Eliminating Redundancy

First Normal Form (1NF)

Second Normal Form (2NF)

Third Normal Form (3NF)

Boyce-Codd Normal Form (BCNF)

Advanced SQL

Window Functions

Common Table Expressions (CTEs)

Recursive CTEs

Indexing and Query Optimization

Transactions and Isolation Levels

NoSQL vs. Relational

Cornell University Database Course: Free Database Design and...

Cornell University Database Course: Free Database Design and SQL Fundamentals

Why Study Databases Formally?

The Relational Model

Entity-Relationship (ER) Modeling

Normalization: Eliminating Redundancy

First Normal Form (1NF)

Second Normal Form (2NF)

Third Normal Form (3NF)

Boyce-Codd Normal Form (BCNF)

Advanced SQL

Window Functions

Common Table Expressions (CTEs)

Recursive CTEs

Indexing and Query Optimization

Transactions and Isolation Levels

NoSQL vs. Relational

Related Posts

Python and SQL Databases: The Complete Guide

How to Learn Redis: The Complete Beginner's Guide