Cornell University Database Course: Free Database Design and SQL Fundamentals

Cornell University Database Course: Free Database Design and SQL Fundamentals

Why Study Databases Formally?

Many developers learn SQL by picking up fragments as needed, but a formal study of databases reveals the "why" behind design decisions. Understanding relational algebra explains why certain SQL queries are efficient. Understanding normalization prevents the data anomalies that plague real applications. Understanding transaction isolation levels prevents subtle bugs in concurrent systems.

This guide synthesizes the core content from computer science database curricula — the kind of content covered in courses like Cornell's CS 4320 (Intro to Databases) and similar programs at other top universities.

The Relational Model

The relational model, proposed by E.F. Codd in 1970, organizes data into tables (relations) with well-defined rules:

  • A relation is a set of tuples (rows)
  • Each tuple has the same set of attributes (columns)
  • Each attribute has a domain (data type)
  • A key is a minimal set of attributes that uniquely identifies each tuple
  • The primary key is the chosen unique identifier for a relation
  • Foreign keys create references between relations

Relational algebra is the mathematical foundation of SQL. Key operations:

OperationSQL EquivalentDescription
σ (Selection)WHEREFilter rows
π (Projection)SELECT columnsChoose columns
⋈ (Join)JOINCombine tables
∪ (Union)UNIONCombine results
- (Difference)EXCEPTRows in A not in B
× (Cartesian Product)CROSS JOINEvery pair of rows

Entity-Relationship (ER) Modeling

Before creating tables, model your domain with ER diagrams:

Entities: Objects with independent existence (Customer, Product, Order)

Relationships: Associations between entities:

  • One-to-One (1:1): Person has one Passport
  • One-to-Many (1:N): Customer places many Orders
  • Many-to-Many (M:N): Students enroll in many Courses; Courses have many Students

Attributes: Properties of entities (name, email, price)

  • Simple: Atomic value (age)
  • Composite: Multiple components (address → street, city, state)
  • Multi-valued: Multiple values per entity (phone numbers)
  • Derived: Computed from other attributes (age from birth_date)

Mapping ER to Relations:

  • Each entity becomes a table
  • 1:N relationships add a foreign key to the "many" side
  • M:N relationships require a junction/bridge table
-- Student-Course M:N relationship
CREATE TABLE students (
    student_id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(200) UNIQUE NOT NULL
);

CREATE TABLE courses (
    course_id SERIAL PRIMARY KEY,
    code VARCHAR(20) UNIQUE NOT NULL,
    title VARCHAR(200) NOT NULL,
    credits SMALLINT CHECK (credits BETWEEN 1 AND 6)
);

-- Junction table for M:N
CREATE TABLE enrollments (
    student_id INTEGER REFERENCES students(student_id) ON DELETE CASCADE,
    course_id INTEGER REFERENCES courses(course_id) ON DELETE CASCADE,
    enrolled_at TIMESTAMP DEFAULT NOW(),
    grade CHAR(2),
    PRIMARY KEY (student_id, course_id)
);

Normalization: Eliminating Redundancy

Normalization is the process of organizing tables to reduce data redundancy and avoid update anomalies.

First Normal Form (1NF)

Every column must contain atomic (indivisible) values. No repeating groups.

Violation: StudentID | Name | Courses where Courses = "Math, Physics, CS" Fix: Create a separate row for each course enrollment.

Second Normal Form (2NF)

Must be in 1NF, and every non-key attribute must depend on the entire primary key (no partial dependencies).

Violation in a composite-key table:

Enrollment(StudentID, CourseID, StudentName, CourseName, Grade)

StudentName depends only on StudentID, not the full composite key.

Fix: Separate into Students and Courses tables.

Third Normal Form (3NF)

Must be in 2NF, and no non-key attribute depends on another non-key attribute (no transitive dependencies).

Violation: Employee(EmpID, DeptID, DeptName) DeptName depends on DeptID, which is a non-key attribute.

Fix: Separate into Employees(EmpID, DeptID) and Departments(DeptID, DeptName).

Boyce-Codd Normal Form (BCNF)

A stricter version of 3NF: every determinant must be a key.

Advanced SQL

Window Functions

Window functions perform calculations across related rows without collapsing them:

-- Rank employees by salary within each department
SELECT
    name,
    department,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank,
    DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank,
    ROW_NUMBER() OVER (ORDER BY salary DESC) AS overall_rank,
    AVG(salary) OVER (PARTITION BY department) AS dept_avg_salary,
    salary - AVG(salary) OVER (PARTITION BY department) AS diff_from_avg
FROM employees;

Common Table Expressions (CTEs)

CTEs make complex queries readable by naming intermediate results:

WITH monthly_sales AS (
    SELECT
        DATE_TRUNC('month', order_date) AS month,
        SUM(total) AS revenue
    FROM orders
    GROUP BY 1
),
revenue_growth AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
        ROUND(
            (revenue - LAG(revenue) OVER (ORDER BY month))
            / LAG(revenue) OVER (ORDER BY month) * 100, 2
        ) AS growth_pct
    FROM monthly_sales
)
SELECT * FROM revenue_growth
WHERE growth_pct IS NOT NULL
ORDER BY month;

Recursive CTEs

Traverse hierarchical data (org charts, bill of materials):

WITH RECURSIVE org_chart AS (
    -- Base case: CEO (no manager)
    SELECT id, name, manager_id, 0 AS level, name AS path
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive case
    SELECT e.id, e.name, e.manager_id, o.level + 1, o.path || ' > ' || e.name
    FROM employees e
    JOIN org_chart o ON e.manager_id = o.id
)
SELECT level, path FROM org_chart ORDER BY path;

Indexing and Query Optimization

Indexes dramatically speed up queries but add overhead on writes.

B-Tree indexes (default): Excellent for equality and range queries, sorting:

CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_created ON orders(created_at);
-- Composite index for queries filtering on both columns
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at);

Hash indexes: Only for equality comparisons, faster than B-tree for exact matches (PostgreSQL).

Partial indexes: Index only a subset of rows:

-- Only index active products (reduces index size)
CREATE INDEX idx_active_products ON products(name) WHERE is_active = true;

EXPLAIN ANALYZE: Understand query execution plans:

EXPLAIN ANALYZE
SELECT * FROM orders WHERE user_id = 123 ORDER BY created_at DESC LIMIT 10;
-- Look for: Seq Scan (slow) vs Index Scan (fast), actual vs estimated rows

Transactions and Isolation Levels

ACID properties guarantee reliable transactions:

  • Atomicity: All or nothing — if one step fails, all steps roll back
  • Consistency: Data always moves from one valid state to another
  • Isolation: Concurrent transactions don't interfere with each other
  • Durability: Committed data persists even after a crash

Isolation levels trade consistency for concurrency:

LevelDirty ReadNon-Repeatable ReadPhantom Read
READ UNCOMMITTEDPossiblePossiblePossible
READ COMMITTEDNot possiblePossiblePossible
REPEATABLE READNot possibleNot possiblePossible
SERIALIZABLENot possibleNot possibleNot possible

PostgreSQL defaults to READ COMMITTED. Most applications need REPEATABLE READ or SERIALIZABLE only for critical financial transactions.

NoSQL vs. Relational

Understand when to choose each:

Relational databases excel at:

  • Complex queries with joins
  • Strong consistency requirements (financial data)
  • Well-defined, stable schemas
  • Applications requiring ACID transactions

NoSQL databases (MongoDB, Cassandra, DynamoDB) excel at:

  • Flexible schemas that evolve frequently
  • Massive horizontal scalability
  • High write throughput
  • Hierarchical document structures (JSON)

Many modern applications use both — relational for transactional data, NoSQL for user activity logs, product catalogs, or session data.

Share: