EasyDB Architecture Explained: Building Blocks of a Toy Database

This blog post explores the internal architecture of EasyDB, a lightweight educational database written in Java. We'll start with a high-level overview of modern database architecture and key algorithmic design choices, then walk through EasyDB's components across four areas: storage, MVCC & ACID, query execution, and indexing/query planning. Each section covers design trade-offs and includes source code examples.

Database Architecture Overview

A modern relational database typically consists of the following subsystems:

Storage Engine: Manages how data is physically laid out and persisted. The core challenge is balancing cost and throughput. For example, PostgreSQL uses 4KB pages because that size aligns well with physical disk block sizes and minimizes I/O overhead. The storage engine includes page formatting (e.g., slotted pages) to support transaction management, index and vacuum etc.
Transaction Management: Ensures ACID (Atomicity, Consistency, Isolation, Durability) properties through concurrency control and logging.
Query Processor: Parses SQL, plans execution, and runs logical operators using execution models like Volcano.
Indexing and Optimizer: Supports fast data retrieval using index structures and selects efficient query plans.

Algorithm Comparison Across Core Areas

Component	Algorithm	Description	Pros	Cons
Storage	Tuple-based Layout and Vacuum support	Classic tuple based storage	Fast in-place CRUD	Needs Auto Vacuum to cleanup the zombie records
	Null Bitmaps	Compression to save space	Compact format to save spaces	Extra calculation and state managment
Transactions	MVCC with XMIN/XMAX	Snapshot-based read isolation	Readers don’t block writers	Old versions need cleanup

Query Execution	Volcano Iterator Model	Operators implement `next()` to fetch rows	Memory efficient, modular	No vectorization or parallelism
	Rule-based Planning	Uses indexes if available	Simple to implement	Not cost-optimal
Indexing	Clustered B+ Tree	Sorted, balanced tree for equality/range queries	Fast range scans, efficient search	Slower writes, not version-aware
	Heuristic Plan Selection	Chooses plans based on field/index availability	Easy to understand	No cost-based decisions

With this context in mind, let's explore how EasyDB implements each of these components.

1. Database Storage

Overview: EasyDB persists data using slotted page storage and write-ahead logging, organized with in-memory buffer management.

Core Structures:

Tuple.java: Tuple header layout and version management.
PageImpl.java: Page structure and raw byte array management.
Stroage.java: Handles file reads/writes and page allocation.

Algorithms & Concepts:

Header Format: Tracks validity and size metadata for each record.
Null bitmaps: Bitmasks to track the null columns.

Pros:

Clear and efficient control over memory and I/O.
Easy to understand format.

Cons:

Manual memory management with potential for leaks.
No automatic eviction or LRU.

2. MVCC and ACID Guarantees

Overview: EasyDB uses Multi-Version Concurrency Control (MVCC) and a transaction table to enforce ACID semantics.

Core Structures:

TupleHeader.java: Transaction table and state tracking.
VersionManagerImpl.java: MVCC logic and isolation.
LockTable.java: Two-phase locking and deadlock detection.

Algorithms & Concepts:

XMIN/XMAX per record: Identifies creating and deleting transactions.
Visibility Rules: Based on commit status and transaction start time.
Undo Logging: Stores oldRaw in each DataItem for rollback.

Pros:

Snapshot isolation with non-blocking reads.
Rollback and recovery support with WAL and oldRaw.

Cons:

Manual version chain cleanup required.
Writers can still block due to locks.

Code Highlights:

public static boolean isVersionSkip(Transaction t, long xmin, long xmax) {
    return t.getLevel() == 3 && t.getXid() != xmin && xmax != 0;
}

Ensures repeatable read isolation by detecting version skips.

3. Query Engine

Overview: EasyDB implements a unified query processing pipeline that spans three key stages: parsing, planning, and execution. This modular flow is built on top of JSQLParser and operates using a Volcano-style iterator model to deliver results.

Parsing

EasyDB uses the external JSQLParser library to convert SQL strings into structured syntax trees. This parsing stage breaks down a query into its components—select fields, tables, conditions, and expressions—making it easier to reason about the query in subsequent steps.

Code Path:

TableManagerImpl.java – Handles SQL statement type dispatch.
Table.java – Extracts parsed clauses (e.g., WHERE conditions) and converts them into internal representations.

Example:

Statement stmt = CCJSqlParserUtil.parse(sql);

Planning

Once parsed, EasyDB builds a logical execution plan. This is a rule-based planner that chooses between index scans and full table scans based on available indexes and WHERE conditions.

Plan Strategies:

No index available: Full table scan.
Single index available: Index scan with optional post-filter.
Two indexes: Intersect or union results for AND/OR logic.

Code Path:

Table.parseWhere() – Translates conditions into index-friendly filters.

Example:

if (f1 != null && f2 != null) {
    uids.retainAll(uids2); // AND logic with index intersection
}

Execution

Execution is carried out through a simple iterator pipeline that mimics the Volcano model. Each query returns a list of candidate UIDs (record IDs), which are then read using MVCC filters and transformed into result rows.

Execution Flow:

Iterate over candidate UIDs.
Use VersionManager.read(xid, uid) to fetch visible records.
Apply WHERE clause filters.
Project selected fields.

Code Path:

Table.select() – Performs MVCC-based filtering and row materialization.
VersionManagerImpl.java – Controls record visibility.

Example:

for (Long uid : uids) {
    byte[] raw = versionManager.read(xid, uid);
    if (raw == null) continue;
    Map<String, Object> entry = parseEntry(raw);
    if (matchesWhere(entry)) result.add(entry);
}

Pros:

Modular and easy to extend.
Clear separation between parse, plan, and execute.

Cons:

No cost-based optimization.
No pipelining between operators or intermediate plan tree.

4. Indexing, Stats & Query Planning

Overview: EasyDB supports clustered B+ tree indexes and uses basic heuristics for query planning.

Core Structures:

BPlusTree.java: Tree operations and page structure.
Node.java: Leaf and internal node definitions.
Field.java: Index lookup and condition handling.

Algorithms & Concepts:

Clustered B+ Tree: Leaf nodes store data UIDs; inner nodes guide search.
Range Scan Support: Sibling pointers for fast traversal.
Rule-Based Plan Selection: Favor indexes over full scans.

Pros:

Fast equality and range lookups.
Efficient key ordering and storage.

Cons:

Indexes not versioned — may point to stale records.
No histogram-based cost estimates.

Code Highlights:

List<Long> uids = f1.search(res1.left, res1.right);

Uses B+ tree index to retrieve UIDs matching a range.

Final Thoughts

EasyDB (Java version) demonstrates many core components of database systems — from slotted pages and WAL to MVCC and B+ tree indexing. While simplified, it offers a clear and accessible path to understand database internals.

For more details, browse the EasyDB GitHub repository and try hacking on it yourself!

Architect Rocks

Monday, October 13, 2025

EasyDB Architecture Explained: Building Blocks of a Tiny Database

EasyDB Architecture Explained: Building Blocks of a Toy Database

Database Architecture Overview

Algorithm Comparison Across Core Areas

1. Database Storage

2. MVCC and ACID Guarantees

3. Query Engine

Parsing

Planning

Execution

Pros:

Cons:

4. Indexing, Stats & Query Planning

Final Thoughts

No comments:

Post a Comment

[Joy]License Plate Pattern Recognition

Labels

Blog Archive