Starting my batch at the Recurse Center, I knew one of the things I wanted to learn was the basics of database internals. My goal was to answer a basic question: “what does a database look like physically, from the perspective of files on disk?”
B+Trees
I started with “Build Your Own Database From Scratch In Go”. By working through the first half of the book in Rust, I learned:
- “Logs” in databases mean something pretty different from what I’m used to. Instead of storing information about a program’s operation history, a “write ahead log” is a data structure you can safely append incoming operations to. This allows concurrent operations to be safely stored in a durable fashion, even if the process crashes mid-write.
- To build a persistent key/value store, it’s important to use a data structure where changes are atomic (again to be resilient to crashes mid-write). There are LSM-trees; all I know about them is that the acronym expands to Log-Structured Merge. There are also B+Trees, which I’ve gotten up close and personal with.
- For efficient manipulation of files on disk, it’s important to keep in mind segment sizes. The smallest possible disk read is often a few kilobytes; reading or writing less than that at a time is inefficient IO. Also, memory-mapping is the best approach if you need random access to disk.
Following along, I built a memory-mapped B+Tree in Rust to make sure I understood the concepts. A B+Tree consists of internal nodes with variable numbers of children and leaf nodes with data. Each node has a key, and the key type can be ordered between them. Internal nodes store the key ranges of their children, allowing them to determine which children to recurse to when finding a node. There’s also a rebalancing algorithm to split up nodes that get too large and condense nodes that get too small, to ensure that the height of the tree remains small (and thus fast to traverse).
I think my implementation doesn’t fully rebalance itself (making it inefficient), but it does correctly handle get/insert/remove operations. Using mmap
means that the file is automatically updated under the hood, which is how some real database systems work. Unfortunately mmap
is super unergonomic in Rust, because I’m taking a byte array and doing a bunch of unsafe witchery to map the bytes to data structures. I think my use of MaybeUninit
in this code is actually unsound? So I wouldn’t use that B+Tree code as a good example of anything.
SQLite Format
In week 2 I joined some of my batchmates in writing SQLite from scratch. I decided to go about this in Zig, partially to learn a new language and partially to avoid the memory-mapping problems I had in Rust. The first step is to parse the database file’s header, which is pretty simple. Things immediately kicked up a notch on the next step, as I began to dig into SQLite’s internal structures. Each SQLite database is represented by pages (4096 bytes in the test database, but it can be configured on a per-DB basis). Each page consists of a header, an array of internal pointers to “cells”, and then the cells themselves. A page is a node in a B+Tree, either a leaf or branch; these trees represent tables and indices in SQLite.
All this actually sounds pretty simple. Let me assure you that there was plenty of room to write frustrating, hard-to-understand bugs along the way:
- All the integers in the format are big endian. Somehow I managed to read in exactly one value in the header in little endian, which took me hours to track down.
- Cells have a variable length, so their locations in the page data cannot be determined except by the cell-offset array.
- Pages are 1-indexed. Assuming pages are 0-indexed won’t fail loudly, but it will result in incorrect programs that report odd things.
- The cell offsets on the first database page treat the database header as belonging to that page, so their offsets aren’t relative to the page header but rather to byte 0.
After I had worked through these initial obstacles, my program was able to enumerate the tables in a SQLite database file and read the sqlite_schema
table to determine their names. For now I’ve stopped there, because I feel like I’ve developed a good understanding of the physical basis of database file storage.
Takeaways
With what I learned from my projects, I can return to my initial question of “what does a database look like physically?” At its simplest, my answer is two core facts:
- A key-value store is represented by a tree (probably a B+Tree or an LSM tree) which allows for atomic writes.
- Each table in a SQL database can be thought of as a key-value store, with primary keys as keys and rows as values. Each column’s data is contained in the row, or in an overflow section the row points to.
My immediate curiosity is quenched, but the field of database implementation is vast and there are still plenty of things I’d like to learn!
- How are queries planned? This is sort of the ‘missing middle’ in my mental model of SQL queries. I understand parsing pretty well and I think I now understand what primitives the database has to manipulate its tables, but I’m curious what happens between. This is the question I’m most interested in, and most likely to return to in my batch.
- Sub-question: is a single-value query like
SELECT * FROM mytable WHERE mytable.id = 12345;
going through the query planner? Or is this equivalent to a single key lookup operation? Likewise for full-table scans likeSELECT * FROM mytable;
- Sub-question: is a single-value query like
- How are tables with multiple primary keys (like you’d use in an M:N relationship) represented? Are the keys concatenated?
- What is an index? I’ve never had cause to use one but I’ve also never worked with large volumes of production data in SQL.
- How do databases implement locking? My understanding is that SQLite only allows one writer for the database a time, but larger engines like MySQL or Postgres lock on a much more granular basis.
- What determines database performance characteristics? I’ve heard at truly large scales (e.g. Google) SQL doesn’t run fast enough; why is that? Is it a consequence of data layout, or the SQL query language, or the guarantees of the relational model?
- How are document databases (like Mongo or Firestore) similar or different to a key/value store or relational database under the hood?
Thanks to Nicholas Walton and Anna Hope for feedback on a draft of this post.