https://www.percona.com/blog/wp-content/uploads/2023/01/Why-MySQL-Could-Be-Slow-With-Large-Tables-300×168.jpg
16 years ago, our founder Peter Zaitsev covered this topic and some of the points described there are still valid, and we will cover more on this blog. While the technologies have evolved and matured enough, there are still some people thinking that MySQL is only for small projects or that it can’t perform well with large tables.
Some startups adopted MySQL in its early days such as Facebook, Uber, Pinterest, and many more, which are now big and successful companies that prove that MySQL can run on large databases and on heavily used sites.
With disks being faster nowadays and CPU and memory resources being cheaper, we could easily say MySQL can handle TBs of data with good performance. For instance, in Percona Managed Services, we have many clients with TBs worth of data that are well performant.
In this blog post, we will review key topics to consider for managing large datasets more efficiently in MySQL.
Primary keys:
This is one of the most important things to consider when creating a new table in MySQL, we should always create an explicit primary key (PK). InnoDB will sort the data in primary key order, and that will serve to reference actual data pages on disk. If we don’t specify a primary key, MySQL will check for other unique indexes as candidates for PK, and if there are none, it will create an internal clustered index to serve as the primary key, which is not the most optimal.
When there is no application logic or candidate to choose as a primary key, we can use an auto_increment column as the primary key.
NOTE: As of MySQL 8.0.30, Generated Invisible Primary Keys were introduced to add an invisible primary key when no explicit PK is defined. You can refer to the documentation for further details.
Also, keep in mind that a portion of the primary key will be added at the end of each secondary index, so try to avoid selecting strings as the primary key, as it will make the secondary indexes to be larger and the performance will not be optimal.
Redundant indexes:
It is known that accessing rows by fetching an index is more efficient than through a table scan in most cases. However, there are cases where the same column is defined on multiple indexes in order to serve different query patterns, and sometimes some of the indexes created for the same column are redundant, leading to more overhead when inserting or deleting data (as indexes are updated) and increased disk space for storing the indexes for the table.
You can use one of our tools pt-duplicate-key-checker to detect the duplicate keys.
Example (using the employee sample DB):
Suppose we have the following schema:
db1 employees> show create table employees\G
*************************** 1. row ***************************
Table: employees
Create Table: CREATE TABLE `employees` (
`emp_no` int NOT NULL,
`birth_date` date NOT NULL,
`first_name` varchar(14) NOT NULL,
`last_name` varchar(16) NOT NULL,
`gender` enum('M','F') NOT NULL,
`hire_date` date NOT NULL,
PRIMARY KEY (`emp_no`),
KEY `idx_last_name` (`last_name`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
Now, suppose that we need to filter by last_name and hire_date, we would create the following index:
ALTER TABLE employees ADD INDEX idx_last_name_hire_date (last_name,hire_date);
We would end up with the following schema:
db1 employees> show create table employees\G
*************************** 1. row ***************************
Table: employees
Create Table: CREATE TABLE `employees` (
`emp_no` int NOT NULL,
`birth_date` date NOT NULL,
`first_name` varchar(14) NOT NULL,
`last_name` varchar(16) NOT NULL,
`gender` enum('M','F') NOT NULL,
`hire_date` date NOT NULL,
PRIMARY KEY (`emp_no`),
KEY `idx_last_name` (`last_name`),
KEY `idx_last_name_hire_date` (`last_name`,`hire_date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
Now, the index idx_last_name and idx_last_name_hire_date have the same prefix (last_name).
The new index idx_last_name_hire_date can be used to serve queries filtered by only last_name or, by last_name and hire_date, leaving the last_name index redundant.
We can corroborate that by using pt-duplicate-key-checker:
[user1] percona@db1: ~ $ pt-duplicate-key-checker -d employees
# ########################################################################
# employees.employees
# ########################################################################
# idx_last_name is a left-prefix of idx_last_name_hire_date
# Key definitions:
# KEY `idx_last_name` (`last_name`),
# KEY `idx_last_name_hire_date` (`last_name`,`hire_date`)
# Column types:
# `last_name` varchar(16) not null
# `hire_date` date not null
# To remove this duplicate index, execute:
ALTER TABLE `employees`.`employees` DROP INDEX `idx_last_name`;
# ########################################################################
# Summary of indexes
# ########################################################################
# Size Duplicate Indexes 350357634
# Total Duplicate Indexes 1
# Total Indexes 17
Data types:
It’s not uncommon to find databases where the data type is not fitted correctly. There are many cases where there are int fields where data could fit in a smallint field or fixed-sized char fields that could be stored in a variable-sized varchar field. This may not be a huge problem for small tables, but for tables with millions of records, overprovisioning data types will only make the table to be bigger in size and performance, not the most optimal.
Make sure you design the data types correctly while planning for the future growth of the table.
Example:
Creating four simple tables to store strings but using different data types:
db1 test> CREATE TABLE tb1 (id int auto_increment primary key, test_text char(200));
Query OK, 0 rows affected (0.11 sec)
db1 test> CREATE TABLE tb2 (id int auto_increment primary key, test_text varchar(200));
Query OK, 0 rows affected (0.05 sec)
db1 test> CREATE TABLE tb3 (id int auto_increment primary key, test_text tinytext);
Query OK, 0 rows affected (0.13 sec)
db1 test> CREATE TABLE tb4 (id int auto_increment primary key, test_text text);
Query OK, 0 rows affected (0.11 sec)
Inserting 2,000 rows with text:
[user1] percona@db1: ~ $ for i in {1..2000}; do for tb in {1..4}; do mysql test -e "INSERT INTO tb$tb (test_text) VALUES ('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse euismod, nulla sit amet rhoncus venenatis, massa dolor lobortis nisi, in.');"; done; done
All four tables have 2,000 rows:
[user1] percona@db1: ~ $ mysql test -e "select count(*) from tb1; select count(*) from tb2; select count(*) from tb3; select count(*) from tb4;"
+----------+
| count(*) |
+----------+
| 2000 |
+----------+
+----------+
| count(*) |
+----------+
| 2000 |
+----------+
+----------+
| count(*) |
+----------+
| 2000 |
+----------+
+----------+
| count(*) |
+----------+
| 2000 |
+----------+
Let’s look at the disk space usage for the tables:
[user1] percona@db1: ~ $ sudo ls -lh /var/lib/mysql/test/|grep tb
-rw-r-----. 1 mysql mysql 592K Dec 30 02:48 tb1.ibd
-rw-r-----. 1 mysql mysql 464K Dec 30 02:48 tb2.ibd
-rw-r-----. 1 mysql mysql 464K Dec 30 02:48 tb3.ibd
-rw-r-----. 1 mysql mysql 464K Dec 30 02:48 tb4.ibd
We can see that tb1 is larger than the others, as it is storing the text on a fixed size char (200) field that will store the defined 200 characters without caring about the actual inserted string length, while the varchar, tinytext, and text fields are variable sized fields and will store only the actual length of the string (in the example we inserted 143 characters).
Compression:
Compression is the process of restructuring the data by changing its encoding in order to store it in fewer bytes. There are many compression tools and algorithms for data out there.
MySQL supports native compression for InnoDB tables using the Zlib library with the LZ77 compression algorithm. It allows for saving disk space and data in memory at the expense of CPU usage for compressing and decompressing the data. If CPU usage is not a bottleneck in your setup, you can leverage compression as it can improve performance which means that less data needs to be read from disk and written to memory, and indexes are compressed too. It can help us to save costs on storage and backup times.
The compression ratio depends on multiple factors, but as with any other compression method, it is more efficient on text than on binaries, so tables with text fields will have a better compression ratio.
Example (using the employee sample DB):
Created a new table employees_compressed:
mysql> CREATE TABLE employees_compressed LIKE employees;
Query OK, 0 rows affected (0.12 sec)
mysql> ALTER TABLE employees_compressed ROW_FORMAT=COMPRESSED;
Query OK, 0 rows affected (0.14 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> INSERT INTO employees_compressed SELECT * FROM employees;
Size comparison:
[user1] percona@db1: ~ $ sudo ls -lh /var/lib/mysql/employees/|grep employees
-rw-r-----. 1 mysql mysql 704M Dec 30 02:28 employees.ibd
-rw-r-----. 1 mysql mysql 392M Dec 30 17:19 employees_compressed.ibd
In this simple example, we had a compression ratio of ~45%!
There are a couple of blog posts from Yves that describe and benchmark MySQL compression:
Compression Options in MySQL (Part 1)
Compression Options in MySQL (Part 2)
Archive or purge old or non-used data:
Some companies have to retain data for multiple years either for compliance or for business requirements. However, there are many cases where data is stored and needed only for a short time; for example, why keep application session information for many years?
While MySQL can handle large data sets, it is always recommended to keep only the used data in the databases, as this will make data access more efficient, and also will help to save costs on storage and backups. There is a good blog post from Gaurav, MySQL Data Archival With Minimal Disruption, showing how we can easily archive old data using pt-archiver.
Partitioning:
Partitioning is a feature that allows dividing a large table into smaller sub-tables based on a partition key. The most common use case for table partitioning is to divide the data by date.
For example: Partitioning a table by year, can be beneficial if you have data for many years and your query patterns are filtered by year. In this case, it would be more efficient to read only one smaller partition rather than one large table with information from many years.
It is very important to analyze the partition key before partitioning based on query patterns because if queries do not always use the partition key as a filtering condition, they will need to scan one or multiple partitions to get the desired data, which results in a huge performance penalty.
It is a cool feature but as mentioned above, it is not suitable for every workload and it needs to be planned carefully, as choosing a poor partition key can result in huge performance penalties.
Sharding:
Sharding is the concept of splitting data horizontally, i.e. by distributing data into multiple servers (shards), meaning that the different portions of data for a given table, may be stored on many different servers. This can help to split large data sets into smaller ones stored in multiple servers.
The data is split in a similar way to partitioning, using a sharding key, which is the pattern of how the data is split and distributed among the shards. This needs to be handled at the application layer, and have a coordinator that reads the query and distributes the query to the specific shard where the data is stored.
Also, it is important to carefully select the appropriate sharding key depending on the query patterns to the table in order to solve the majority of queries by routing only to one shard, as having to look for the information from many shards and then filter it, process and aggregating it is an expensive operation.
From the above, not all applications or workloads may be fitted for sharding, and, adding that it requires to be properly handled to the application, it may add complexity to the environments.
MongoDB supports this natively, however, MySQL doesn’t, but there are some efforts in the MySQL world to implement sharding.
Some of them are:
MySQL NDB Cluster is an in-memory database clustering solution developed by Oracle for MySQL. It supports native sharding being transparent for the application. It is available under a paid subscription.
It is a feature-rich open-source MySQL proxy solution, that allows query routing for the most common MySQL architectures (PXC/Galera, Replication, Group Replication, etc.).
It allows sharding by configuring a set of backend servers (shards) and a set of query rules, to route the application queries to the specified shards.
Note that it requires some handling on the application as it doesn’t support the merging and data retrieval from multiple shards.
You can find more information in this good blog from Marco: MySQL Sharding with ProxySQL
It is an open source database clustering solution created by PlanetScale that is compatible with the MySQL engine. It supports native sharding. You can find more information about Vitess on our blog post from Alkin: Introduction to Vitess on Kubernetes for MySQL – Part I of III.
MyRocks:
MyRocks is a storage engine developed by Facebook and made open source. It was developed for optimizing data storage and access for big data sets. MyRocks is shipped in Percona Server for MySQL.
There is a cool blog post from Vadim covering big data sets in MyRocks:
MyRocks Use Case: Big Dataset
Query tuning:
It is common to find applications that at the beginning perform very well, but as data grows the performance starts to decrease. The most common cause is that poorly written queries or poor schema design are well-performant with minimum data, however, as data grows all those problems are uncovered. You can make use of the slow_query_log and pt-query-digest to find your problematic queries.
Administration:
Performing administrative tasks on large tables can be painful, specifically schema changes and backups.
For schema changes, Percona has a tool pt-online-schema-change that can help us to perform schema changes with minimal disruption to the database. It works by creating a new table with the desired schema change applied, and it copies existing data in batches from the original table to the new one. Ongoing changes are copied from the original table to the new one using triggers.
This way, in a large table, instead of having a huge blocking operation for an alter, pt-OSC can run in the background without locking and in batches to minimize performance impact.
For backups of large datasets, Percona XtraBackup can help to reduce the time to backup and recovery, it is a hot physical backup solution that copies the data files of the tables while saving the ongoing changes to the database as redo logs. It supports native compression and encryption.
Remember that monitoring your databases is always important to help you find issues or bottlenecks, you can use and install for free Percona Monitoring and Management to take a deeper look at your servers’ and databases’ health. It provides QAN (Query Analyzer) to help you find problematic queries in your databases.
Conclusion
The old myth that MySQL can’t handle large datasets is nothing but a myth. With hardware being more powerful and cheaper, and the technology evolving, now it is easier than ever to manage large tables in MySQL.
Percona Database Performance Blog