I have been working with a client with close to 600k images on their home page. The photos are tagged with multiple categories. The index API returns paginated pictures based on various sets of filters on classes.
Recently, they normalized their data, and every image was mapped to 4 different categories through join tables. So to load the home page, the app server used to join 8 tables on runtime and return the results.
Below is the query that filters the images tagged ABSTRACT
for a page and limit results to 10.
|
|
The count on the actual category tables is meagre <5k rows per table. The join tables have mostly 1(images):1(categories) mapping. Meaning every image has at least been tagged into 4 categories. If our filter predicate results in 100k images, we are essentially joining 5 tables ( images + 4 classes) of 100k each.
Let’s break down the EXPLAIN statement
Below is the result of EXPLAIN ANALYZE
that filters the images tagged ABSTRACT
for a specific page, sorts by images.id
, and returns the first 10 results.
|
|
Table Stats | Node Stats |
The query took 900 ms to execute, which resulted in a terrible user experience since users would have to wait for the home page to load. It also stresses the database instance when the traffic is higher.
There are 8 joins, of which 6 are nested loop joins and 2 are hash joins. Almost all of the joins result in an index scan. There are few sequential scans since the number of rows in the table is so few that the query planner decided to go with it. Sequential scans can be ignored in this scenario since they are not contributing to latency.
Looking at the per node and table stats, the primary contributor to the latency is nested loop joins and the index scans, which contribute about 85% of the overall latency.
The nested loops join is the most fundamental join algorithm. It works like using two nested queries: the outer or driving query to fetch the results from one table and a second query for each row from the driving query to fetch the corresponding data from the other table.
The hash join algorithm aims for the weak spot of the nested loops join: The many B-tree traversals when executing the inner query. Instead, it loads the candidate records from one side of the join into a hash table that can be probed quickly for each row from the other side.
Optimizations
1. Joins
Joins take more work to optimize. The hash joins perform better than nested loop joins when the driving and outer tables are larger. But we do not have control over when the query optimizer uses hash join.
But there are a few optimizations that we can apply on joins as well.
- Reduce the number of columns selected in join, which results in less heap seek.
- Join the tables on indexed columns to avoid full table scans.
- Appropriate filter predicates so that the query optimizer works with fewer rows.
2. In memory sort
The final results are sorted by images.id
using external merge Disk
. If we set the work_mem
to a more significant value, we could make the query optimizer use in-memory quick sort. But looking at the per node stats, the overall latency for sort is just 30 ms which is negligible compared to overall query execution. So we avoided this optimization
3. Join tables based on filter predicate
If we look at our original query, whatever may be the filter predicate, we are joining all four classes through the mapping tables. The query is sub-optimal and not required. If a user wants to filter ABSTRACT
images, we could get away with just joining sections
via the images_sections
table since we are interested in images.external_uuid
. This approach would significantly reduce the execution time since we only join fewer tables.
So the app service should decide which tables to join based on the filter predicate.
|
|
|
|
What would happen when a user filters all the categories?
We would be joining all eight tables since the query has many filter predicates. The number of rows is vastly reduced in every join, increasing the query performance.
Although pt.3 reduced the query execution times, more is needed. Even the optimised query can quickly become a bottleneck for an application serving at 20k at peak RPM.
Usage patterns
Even after a few optimizations, we could not achieve the <50 ms query execution times. So we decided on an entirely different approach and started looking at the read/write patterns of the service.
The service operations team usually uploads 10k images per month, and users access the home pages up to 1-1.5 million times a month.
The read-to-write ratio is 1M/10k, which is 100:1. For every 100 reads, there is one write. The service read heavy.
In the current use case, real-time data is not a requirement, and even if the images are delayed by a few hours to show up on the home page, it’s alright.
With all the new data points at hand, we decided to give the materialized view a try.
Materialized Views
In computing, a materialized view is a database object containing a query’s results. For example, it may be a local copy of data located remotely, a subset of the rows or columns of a table or join result, or a summary using an aggregate function. The query results used to create materialized view are snapshotted and persisted in the disk.
Once we create a materialized view, we can use SQL to query snapshots.
Creating the materialized view
|
|
Now that we have created our materialized, lets query it.
|
|
|
|
We got the results in less than <1 ms, which is a significant optimization from 200 ms and great for user experience. If needed, you can also add indexes on materialized like traditional tables.
Response times of the homepage post deployment
The significant difference between querying on tables and materialized view is that materialized view cannot subsequently return real-time results if we insert new rows into an underlying table.
We can update the materialized view using the below query. Note that the below query locks the materialized view and will be unusable until refreshed.
|
|
To refresh the materialized view without locking, you can use CONCURRENTLY
, but this option requires you to have a unique index on the materialized view.
|
|
Closing notes
When to use materialized views?
- Queries require multiple joins or compute aggregate data during runtime.
- Application can tolerate stale data for a few hours.
Materialized views can become a great resource and lead to significant performance increases if used appropriately. But if they are misused, they can lead to stale data and refresh bottlenecks.
Note
This post has been featured in the following newsletters:
Materialised views for serious performance gainshttps://t.co/8mc50j93eN
— Ruby Weekly (@RubyDiscussions) January 2, 2023
Discussions: https://t.co/4ZHDzTUzvg#programming #ruby
by @dineshgowda24
Materialised views for serious performance gains https://t.co/lGA6jmkrRt by @dineshgowda24
— Ruby LibHunt (@RubyLibHunt) December 31, 2022
Thank you for all the suggestions and feedback :-)
I am a polyglot programmer interested in databases, and I have implemented a project to build your persistent key-value store based on bitcask paper. Please check it out at https://github.com/dineshgowda24/bitcask-rb.