Databases have always fascinated me, and I have always dreamt of building a simple, hobby database for fun. I have read several blog posts about building redis, git, compiler, and interpreter, but none about building a database. This post is an outcome of a long and treacherous search of posts on making a database and eventually encountering the Bitcask paper.
Bitcask
Bitcask is a local key/value store where the data is persisted in a file. One operating system process will open the file for writing at any time. When the file size reaches a considerable theoretical threshold, we close the file and open a new one.
Goals
- Low latency per item, read or written.
- High throughput, especially when writing an incoming stream of random items.
- Ability to handle datasets much more significant than RAM w/o degradation.
- Crash friendliness, both in terms of a fast recovery and not losing data.
- Ease of backup and restore.
- A relatively simple, understandable code structure and data format.
Data Format
Let’s start breaking it down.
What’s the essential thing that we need to store?
- Key
- Value
Given a cursor at the location of the data in file, will you be able to read the data?
The answer is No!
Let me explain
As key and value is variable length. We will need to figure out how much to read. So we need to store the size of the key and its value.
Now that we have the key and value size written in the file. Given a file cursor pointing to the location of the data. We know the first 8 bytes represent key and value size. Once we read that, we see the size of the actual key and the value to be read.
Storing primitive data types
Although our kv database is MVP-ready. Suppose we want to store data types like integer, float and string. Once the data is written in the file, the data type is lost, and everything will be treated as a string, as we are not storing any information.
To preserve type information, let’s store two more fields, keytype
and valuetype
, representing the type of data stored.
Auditing & Security
For auditing and security, bitcask suggests storing 32-bit epoch timestamp and CRC(Cyclic Redundancy Check), respectively. These values are generated when data is written to the file.
The final data format would look like something below. We would store 20 bytes of additional data for every key and value.
- 1st 4 bytes are a 32-bit integer representing CRC.
- The following 4 bytes are a 32-bit integer representing epoch timestamp.
- The following 8 bytes are two 32-bit integers representing
keysize
andvaluesize
. - The next 4 bytes are two 16-bit integers representing
keytype
andvaluetype
. - The remaining bytes are our key and value.
Number System
Computers represent data in sets of binary digits. The representation comprises bits grouped into more extensive collections, such as bytes. The first 24 bytes are unsigned integers if you notice our data format. By default, when integers are written to a file, they are not stored in binary.
Let me explain
Suppose we run the below code. What would be the size of the data written in the file?
|
|
It’s 36 bytes because, by default, they are written as strings where each character is 1 byte.
$${\textsf{ 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 = 36 bytes}}$$
But this could be more efficient. In our data format, we discussed that for any key and value, we would be storing 20 bytes of metadata. So we cannot keep them as strings, as it would result in a variable length field. The solution is to encode it and store it in binary format.
If we encoded them as 4-byte integers and stored them in binary format. The size of the file would 32 bytes.
$${\textsf{ 4 * 8 = 32 bytes}}$$
Largest Key & Value Size
The largest key
or value
stored in file is a function of the type of integer of keysize
or valuesize
respectively.
$${\large{\mathsf{\max_{\substack{\mathsf{1<x<2^{x}-1}}} f(x)}}} \textsf{where x is the size of an integer in bits}$$
As our keysize
or valuesize
are 4 bytes(32 bits) unsigned integers. The largest value of key or value that we can store is
$${{\mathsf{ 2^{32-1}} = \textsf{4294967295 bytes = 4.29 GB}}}$$
Stiching Everything
Let’s create a Serializer
module which encapsulates serialization and deserialization logic. We need to encode our data into a binary sequence. Ruby has Array.pack, which packs the data into a binary sequence according to a directive. Directives inform how data should be encoded in a binary sequence. Below is the list of directives we will be using.
- L< : 32-bit unsigned integer, little endian (uint32_t)
- S< : 16-bit unsigned integer, little endian (uint16_t)
- q< : 64-bit signed integer, little endian (int64_t)
- E : double-precision, little-endian
All of our directives have little-endian byte order. This does not have any performance benefit. You can encode them in big-endian order as well. We specify the endianness to ensure our database file is cross-platform compatible.
|
|
Serialize & Deserialize
serialize
serializes the data to our data format. It identifies the type of data, generates a header and computes crc. It returns the length of the data and the binary encoded data.deserialize
does the opposite ofserialize
. It validates crc, decodes the binary encoded data and returns epoch, key and value.
|
|
Below are a few helper functions used for packing, unpacking and generating crc.
|
|
DiskStore
Finally, let us create a class called DiskStore
, our persistent key-value store. It has the following methods.
- Get : Returns value for a given key from the store.
look_up table?} C -->|yes| D(seek to
data loc in file) D --> F(read
log_size bytes) F --> G{crc valid?} G --> |no| E G --> |yes| H(deserialize) H --> J(return value) C -->|no| E(return
empty string)
- Put : Write key-value into the store.
|
|
Closing thoughts
To avoid overloading you folks with too much data, I have left out a few things from the article. The next improvement that you should be thinking about is How to initialize the DiskStore with a pre-existing data file?
You can implement it in the language of your choice. If you are blocked anywhere, check out the complete working code at https://github.com/dineshgowda24/bitcask-rb. It has implementations for initializing the DiskStore
with a pre-existing data file and benchmarks if you are interested in any of those.
Happy implementing your own KV store!