We had an interesting incident at work recently. We run huge databases on PostgreSQL, and to reduce disk fragmentation, we run pg_repack[1] on a regular basis. This tool rewrites tables and indexes so that they are stored in contiguous blocks on disk. It reduces the overhead of disk seeks, improves read performance, and reduces the size of the database.
Before we discuss what VACUUM does and its implications, we need to understand how data is actually stored on disk. What happens when a tuple is inserted, updated, or deleted? Understanding this will help us understand what VACUUM does, why it’s needed, and its implications.
The most compelling feature of relational database systems is ACID(Atomicity Consistency Isolation Durability). Isolation is achieved using transactions. Isolation is needed to avoid race conditions when concurrent actors act upon the same row.
SQL standard defines isolation with different levels. Every isolation level offers certain guarantees and possible anomalies that can occur.
Databases have always fascinated me, and I have always dreamt of building a simple, hobby database for fun. I have read several blog posts about building redis, git, compiler, and interpreter, but none about building a database. This post is an outcome of a long and treacherous search of posts on making a database and eventually encountering the Bitcask paper.
I have been working with a client with close to 600k images on their home page. The photos are tagged with multiple categories. The index API returns paginated pictures based on various sets of filters on classes.
Recently, they normalized their data, and every image was mapped to 4 different categories through join tables.