Why Study Databases Formally?
Many developers learn SQL by picking up fragments as needed, but a formal study of databases reveals the "why" behind design decisions. Understanding relational algebra explains why certain SQL queries are efficient. Understanding normalization prevents the data anomalies that plague real applications. Understanding transaction isolation levels prevents subtle bugs in concurrent systems.
This guide synthesizes the core content from computer science database curricula — the kind of content covered in courses like Cornell's CS 4320 (Intro to Databases) and similar programs at other top universities.
The Relational Model
The relational model, proposed by E.F. Codd in 1970, organizes data into tables (relations) with well-defined rules:
- A relation is a set of tuples (rows)
- Each tuple has the same set of attributes (columns)
- Each attribute has a domain (data type)
- A key is a minimal set of attributes that uniquely identifies each tuple
- The primary key is the chosen unique identifier for a relation
- Foreign keys create references between relations
Relational algebra is the mathematical foundation of SQL. Key operations:
| Operation | SQL Equivalent | Description |
|---|---|---|
| σ (Selection) | WHERE | Filter rows |
| π (Projection) | SELECT columns | Choose columns |
| ⋈ (Join) | JOIN | Combine tables |
| ∪ (Union) | UNION | Combine results |
| - (Difference) | EXCEPT | Rows in A not in B |
| × (Cartesian Product) | CROSS JOIN | Every pair of rows |
Entity-Relationship (ER) Modeling
Before creating tables, model your domain with ER diagrams:
Entities: Objects with independent existence (Customer, Product, Order)
Relationships: Associations between entities:
- One-to-One (1:1): Person has one Passport
- One-to-Many (1:N): Customer places many Orders
- Many-to-Many (M:N): Students enroll in many Courses; Courses have many Students
Attributes: Properties of entities (name, email, price)
- Simple: Atomic value (age)
- Composite: Multiple components (address → street, city, state)
- Multi-valued: Multiple values per entity (phone numbers)
- Derived: Computed from other attributes (age from birth_date)
Mapping ER to Relations:
- Each entity becomes a table
- 1:N relationships add a foreign key to the "many" side
- M:N relationships require a junction/bridge table
-- Student-Course M:N relationship
CREATE TABLE students (
student_id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
email VARCHAR(200) UNIQUE NOT NULL
);
CREATE TABLE courses (
course_id SERIAL PRIMARY KEY,
code VARCHAR(20) UNIQUE NOT NULL,
title VARCHAR(200) NOT NULL,
credits SMALLINT CHECK (credits BETWEEN 1 AND 6)
);
-- Junction table for M:N
CREATE TABLE enrollments (
student_id INTEGER REFERENCES students(student_id) ON DELETE CASCADE,
course_id INTEGER REFERENCES courses(course_id) ON DELETE CASCADE,
enrolled_at TIMESTAMP DEFAULT NOW(),
grade CHAR(2),
PRIMARY KEY (student_id, course_id)
);
Normalization: Eliminating Redundancy
Normalization is the process of organizing tables to reduce data redundancy and avoid update anomalies.
First Normal Form (1NF)
Every column must contain atomic (indivisible) values. No repeating groups.
Violation: StudentID | Name | Courses where Courses = "Math, Physics, CS" Fix: Create a separate row for each course enrollment.
Second Normal Form (2NF)
Must be in 1NF, and every non-key attribute must depend on the entire primary key (no partial dependencies).
Violation in a composite-key table:
Enrollment(StudentID, CourseID, StudentName, CourseName, Grade)
StudentName depends only on StudentID, not the full composite key.
Fix: Separate into Students and Courses tables.
Third Normal Form (3NF)
Must be in 2NF, and no non-key attribute depends on another non-key attribute (no transitive dependencies).
Violation: Employee(EmpID, DeptID, DeptName) DeptName depends on DeptID, which is a non-key attribute.
Fix: Separate into Employees(EmpID, DeptID) and Departments(DeptID, DeptName).
Boyce-Codd Normal Form (BCNF)
A stricter version of 3NF: every determinant must be a key.
Advanced SQL
Window Functions
Window functions perform calculations across related rows without collapsing them:
-- Rank employees by salary within each department
SELECT
name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS overall_rank,
AVG(salary) OVER (PARTITION BY department) AS dept_avg_salary,
salary - AVG(salary) OVER (PARTITION BY department) AS diff_from_avg
FROM employees;
Common Table Expressions (CTEs)
CTEs make complex queries readable by naming intermediate results:
WITH monthly_sales AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total) AS revenue
FROM orders
GROUP BY 1
),
revenue_growth AS (
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
ROUND(
(revenue - LAG(revenue) OVER (ORDER BY month))
/ LAG(revenue) OVER (ORDER BY month) * 100, 2
) AS growth_pct
FROM monthly_sales
)
SELECT * FROM revenue_growth
WHERE growth_pct IS NOT NULL
ORDER BY month;
Recursive CTEs
Traverse hierarchical data (org charts, bill of materials):
WITH RECURSIVE org_chart AS (
-- Base case: CEO (no manager)
SELECT id, name, manager_id, 0 AS level, name AS path
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive case
SELECT e.id, e.name, e.manager_id, o.level + 1, o.path || ' > ' || e.name
FROM employees e
JOIN org_chart o ON e.manager_id = o.id
)
SELECT level, path FROM org_chart ORDER BY path;
Indexing and Query Optimization
Indexes dramatically speed up queries but add overhead on writes.
B-Tree indexes (default): Excellent for equality and range queries, sorting:
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_created ON orders(created_at);
-- Composite index for queries filtering on both columns
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at);
Hash indexes: Only for equality comparisons, faster than B-tree for exact matches (PostgreSQL).
Partial indexes: Index only a subset of rows:
-- Only index active products (reduces index size)
CREATE INDEX idx_active_products ON products(name) WHERE is_active = true;
EXPLAIN ANALYZE: Understand query execution plans:
EXPLAIN ANALYZE
SELECT * FROM orders WHERE user_id = 123 ORDER BY created_at DESC LIMIT 10;
-- Look for: Seq Scan (slow) vs Index Scan (fast), actual vs estimated rows
Transactions and Isolation Levels
ACID properties guarantee reliable transactions:
- Atomicity: All or nothing — if one step fails, all steps roll back
- Consistency: Data always moves from one valid state to another
- Isolation: Concurrent transactions don't interfere with each other
- Durability: Committed data persists even after a crash
Isolation levels trade consistency for concurrency:
| Level | Dirty Read | Non-Repeatable Read | Phantom Read |
|---|---|---|---|
| READ UNCOMMITTED | Possible | Possible | Possible |
| READ COMMITTED | Not possible | Possible | Possible |
| REPEATABLE READ | Not possible | Not possible | Possible |
| SERIALIZABLE | Not possible | Not possible | Not possible |
PostgreSQL defaults to READ COMMITTED. Most applications need REPEATABLE READ or SERIALIZABLE only for critical financial transactions.
NoSQL vs. Relational
Understand when to choose each:
Relational databases excel at:
- Complex queries with joins
- Strong consistency requirements (financial data)
- Well-defined, stable schemas
- Applications requiring ACID transactions
NoSQL databases (MongoDB, Cassandra, DynamoDB) excel at:
- Flexible schemas that evolve frequently
- Massive horizontal scalability
- High write throughput
- Hierarchical document structures (JSON)
Many modern applications use both — relational for transactional data, NoSQL for user activity logs, product catalogs, or session data.